riley's blog

Thoughts, baseball, engineering

Why Are My Page_faults So High in Perf?

Interesting behavior I have been dealing with lately.

Have you seen this perf profile before?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
  PerfTop:   12025 irqs/sec  kernel:21.0%  exact:  0.0% [1000Hz cycles],  (all, 12 CPUs)
------------------------------------------------------------------------------------------------

             samples  pcnt function                                 DSO
             _______ _____ ________________________________________ ____________________________

            19739.00 22.3% page_fault                               [kernel.kallsyms]
             9475.00 10.7% __GI_time                                /lib64/libc-2.5.so
             4934.00  5.6% free                                     /usr/lib64/libjemalloc.so.1
             4816.00  5.4% arena_dalloc_bin_locked                  /usr/lib64/libjemalloc.so.1
             3111.00  3.5% calloc                                   /usr/lib64/libjemalloc.so.1
             2399.00  2.7% __GI_vfprintf                            /lib64/libc-2.5.so
             2258.00  2.6% malloc                                   /usr/lib64/libjemalloc.so.1
             2176.00  2.5% arena_tcache_fill_small                  /usr/lib64/libjemalloc.so.1

I had not seen it. Probably I had not seen it because no one does this many syscalls in high performance code (dur). I googled it and nothing useful came up aside from the normal major/minor page_fault stuff explained by the paging model in the Linux kernel.

I then asked the smartest guy I know, and he pointed me in the right direction. What follows is the explanation in case you run into this. I suspect no one else will run into this because it has to do with libc/kernel mismatches.

We are working on a complex event processor at AppNexus. This thing is responsible for aggregating, joining, filtering, etc.. on advertising events. As such, it’s very concerned with what time it is on the system. This is due to various RAM eviction mechanisms that rely on windows of time for processing so that we don’t consume all the RAM on the system holding onto events longer than necessary. Turns out that we were naively calling:

1
time_t now = time(NULL);

In a few key places (i.e. every time we processed an incoming event). The expectation that system time() calls would eat a majority of cycles was slightly unexpected. What’s more unexpected is that these system time() calls would invoke so many page faults.

We are using glibc version 2.5 (the stock version on CentOS 5) on x86_64 machines. This implements a time() call by using assembly to invoke a syscall on the vsyscall page provided my linux. From the time.S file:

glibc/sysdeps/unix/sysv/linux/x86_64/time.S
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/* For the calculation see asm/vsyscall.h.  */
#define VSYSCALL_ADDR_vtime           0xffffffffff600400


/* Return the current time as a `time_t' and also put it in *T if T is
   not NULL.  Time is represented as seconds from Jan 1 00:00:00 1970.  */

ENTRY (time)
      /* Align stack.  */
      sub      $0x8, %rsp
      cfi_adjust_cfa_offset(8)

      movq    $VSYSCALL_ADDR_vtime, %rax
      callq   *%rax

      add $0x8, %rsp
      cfi_adjust_cfa_offset(-8)
      ret
PSEUDO_END_NOERRNO(time)
libc_hidden_def (time)

However, because we want new hotness kernels we have a relatively new 3.8.13 kernel installed on most of these stock CentOS 5 machines. What is this magic address?

1
cat /proc/self/maps

You should see something like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
00400000-00405000 r-xp 00000000 fd:00 13467750                           /bin/cat
00604000-00606000 rw-p 00004000 fd:00 13467750                           /bin/cat
01b15000-01b36000 rw-p 00000000 00:00 0                                  [heap]
3b58600000-3b5861c000 r-xp 00000000 fd:00 91324604                       /lib64/ld-2.5.so
3b5881c000-3b5881d000 r--p 0001c000 fd:00 91324604                       /lib64/ld-2.5.so
3b5881d000-3b5881e000 rw-p 0001d000 fd:00 91324604                       /lib64/ld-2.5.so
3b58a00000-3b58b4d000 r-xp 00000000 fd:00 91324605                       /lib64/libc-2.5.so
3b58b4d000-3b58d4d000 ---p 0014d000 fd:00 91324605                       /lib64/libc-2.5.so
3b58d4d000-3b58d51000 r--p 0014d000 fd:00 91324605                       /lib64/libc-2.5.so
3b58d51000-3b58d52000 rw-p 00151000 fd:00 91324605                       /lib64/libc-2.5.so
3b58d52000-3b58d57000 rw-p 00000000 00:00 0
7fa403eb5000-7fa40748c000 r--p 00000000 fd:00 120077090                  /usr/lib/locale/locale-archive
7fa40748c000-7fa40748e000 rw-p 00000000 00:00 0
7fa40749e000-7fa4074a0000 rw-p 00000000 00:00 0
7ffff864b000-7ffff866c000 rw-p 00000000 00:00 0                          [stack]
7ffff87f8000-7ffff87fa000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

These are sorted lowest to highest address in the memory map of this process. You can see the process itself, it’s heap, then DSOs, then locale related stuff, then some unknown, then stack, then vdso then vsyscall.

On newer libc’s we are supposed to get the sys_time by using vdso space. That is out of the scope of this blog post but if you are interested: http://lwn.net/Articles/446528/. Note that when that article says “The static vsyscall page will not continue to exist in its current form, and applications which still use it will continue to work but will get a little bit slower”, it is severly understating how much slower they will get. More on that below.

The penultimate page starting at 0xffffffffff600000 and ending at 0xffffffffff601000 must contain a time() function at offset 0xffffffffff600400. We are merely dereferencing that memory address and calling the function defined there. This causes the kernel to trap (page_fault) this address and invoke special handling to emulate the syscalls that used to be there. This “fixes” the security hole and still allows legacy apps to work (much more slowly).

I created a simple test program to avoid any other complications in my own code to isolate this problem:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include <time.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
  time_t now = 0;
  int count = atoi(argv[1]);

  for(int i=0; i < count; i++) {
      now = time(NULL);
      }

      return 0;
}

The perf output on this simpler program is pretty revealing as to what’s happening.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Events: 637  cycles
#
# Overhead  Command      Shared Object                     Symbol
# ........  .......  .................  .........................
#
    48.66%      tpf  [kernel.kallsyms]  [k] page_fault
    16.89%      tpf  libc-2.5.so        [.] __GI_time
     4.30%      tpf  [kernel.kallsyms]  [k] trace_hardirqs_off_caller
     3.68%      tpf  [kernel.kallsyms]  [k] trace_hardirqs_off_thunk
     3.22%      tpf  tpf                [.] main
     3.07%      tpf  [kernel.kallsyms]  [k] emulate_vsyscall
     2.61%      tpf  [kernel.kallsyms]  [k] error_entry
     2.46%      tpf  [kernel.kallsyms]  [k] do_page_fault
     2.30%      tpf  [kernel.kallsyms]  [k] __bad_area_nosemaphore
     2.14%      tpf  [kernel.kallsyms]  [k] is_prefetch
     1.54%      tpf  [kernel.kallsyms]  [k] restore
     1.07%      tpf  [kernel.kallsyms]  [k] spurious_fault
     1.03%      tpf  [kernel.kallsyms]  [k] system_call_after_swapgs
     0.92%      tpf  [kernel.kallsyms]  [k] error_sti
     0.92%      tpf  [kernel.kallsyms]  [k] trace_hardirqs_on_caller
     0.78%      tpf  [kernel.kallsyms]  [k] retint_swapgs
     0.77%      tpf  [kernel.kallsyms]  [k] error_exit
     0.61%      tpf  [kernel.kallsyms]  [k] restore_args
     0.52%      tpf  [kernel.kallsyms]  [k] sys_time
     0.49%      tpf  [kernel.kallsyms]  [k] write_ok_or_segv

The dereference and function call on that special page at 0xffffffffff600000 is causing the kernel to trigger a page_fault as that is not RAM I should be accessing. This triggers the do_page_fault which eventually calls __bad_area_nosemaphore.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
717 static void
718 __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
719                        unsigned long address, int si_code)
720 {
721         struct task_struct *tsk = current;
722
723         /* User mode accesses just cause a SIGSEGV */
724         if (error_code & PF_USER) {
725                 /*
726                  * It's possible to have interrupts off here:
727                  */
728                 local_irq_enable();
729
730                 /*
731                  * Valid to do another page fault here because this one came
732                  * from user space:
733                  */
734                 if (is_prefetch(regs, error_code, address))
735                         return;
736
737                 if (is_errata100(regs, address))
738                         return;
739
740 #ifdef CONFIG_X86_64
741                 /*
742                  * Instruction fetch faults in the vsyscall page might need
743                  * emulation.
744                  */
745                 if (unlikely((error_code & PF_INSTR) &&
746                              ((address & ~0xfff) == VSYSCALL_START))) {
747                         if (emulate_vsyscall(regs, address))
748                                 return;
749                 }
750 #endif
751                 /* Kernel addresses are always protection faults: */
752                 if (address >= TASK_SIZE)
753                         error_code |= PF_PROT;
754
755                 if (likely(show_unhandled_signals))
756                         show_signal_msg(regs, error_code, address, tsk);
757
758                 tsk->thread.cr2         = address;
759                 tsk->thread.error_code  = error_code;
760                 tsk->thread.trap_nr     = X86_TRAP_PF;
761
762                 force_sig_info_fault(SIGSEGV, si_code, address, tsk, 0);
763
764                 return;
765         }
766
767         if (is_f00f_bug(regs, address))
768                 return;
769
770         no_context(regs, error_code, address, SIGSEGV, si_code);
771 }

So we’ve accessed a bad area (kernel memory) in order to execute an instruction there. User accesses are supposed to segfault unless we are x86_64 (which we are). In that case, emulate vsyscall functionality.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
176 bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
177 {
178         struct task_struct *tsk;
179         unsigned long caller;
180         int vsyscall_nr, syscall_nr, tmp;
181         int prev_sig_on_uaccess_error;
182         long ret;
183
184         /*
185          * No point in checking CS -- the only way to get here is a user mode
186          * trap to a high address, which means that we're in 64-bit user code.
187          */
188
189         WARN_ON_ONCE(address != regs->ip);
190
191         if (vsyscall_mode == NONE) {
192                 warn_bad_vsyscall(KERN_INFO, regs,
193                                   "vsyscall attempted with vsyscall=none");
194                 return false;
195         }
196
197         vsyscall_nr = addr_to_vsyscall_nr(address);
198
199         trace_emulate_vsyscall(vsyscall_nr);
200
201         if (vsyscall_nr < 0) {
202                 warn_bad_vsyscall(KERN_WARNING, regs,
203                                   "misaligned vsyscall (exploit attempt or buggy program) -- look up the vsyscall kernel parameter if you need a workaround");
204                 goto sigsegv;
205         }
206
207         if (get_user(caller, (unsigned long __user *)regs->sp) != 0) {
208                 warn_bad_vsyscall(KERN_WARNING, regs,
209                                   "vsyscall with bad stack (exploit attempt?)");
210                 goto sigsegv;
211         }
212
213         tsk = current;
214
215         /*
216          * Check for access_ok violations and find the syscall nr.
217          *
218          * NULL is a valid user pointer (in the access_ok sense) on 32-bit and
219          * 64-bit, so we don't need to special-case it here.  For all the
220          * vsyscalls, NULL means "don't write anything" not "write it at
221          * address 0".
222          */
223         switch (vsyscall_nr) {
224         case 0:
225                 if (!write_ok_or_segv(regs->di, sizeof(struct timeval)) ||
226                     !write_ok_or_segv(regs->si, sizeof(struct timezone))) {
227                         ret = -EFAULT;
228                         goto check_fault;
229                 }
230
231                 syscall_nr = __NR_gettimeofday;
232                 break;
233
234         case 1:
235                 if (!write_ok_or_segv(regs->di, sizeof(time_t))) {
236                         ret = -EFAULT;
237                         goto check_fault;
238                 }
239
240                 syscall_nr = __NR_time;
241                 break;
242
243         case 2:
244                 if (!write_ok_or_segv(regs->di, sizeof(unsigned)) ||
245                     !write_ok_or_segv(regs->si, sizeof(unsigned))) {
246                         ret = -EFAULT;
247                         goto check_fault;
248                 }
249
250                 syscall_nr = __NR_getcpu;
251                 break;
252         }
253
254         /*
255          * Handle seccomp.  regs->ip must be the original value.
256          * See seccomp_send_sigsys and Documentation/prctl/seccomp_filter.txt.
257          *
258          * We could optimize the seccomp disabled case, but performance
259          * here doesn't matter.
260          */
261         regs->orig_ax = syscall_nr;
262         regs->ax = -ENOSYS;
263         tmp = secure_computing(syscall_nr);
264         if ((!tmp && regs->orig_ax != syscall_nr) || regs->ip != address) {
265                 warn_bad_vsyscall(KERN_DEBUG, regs,
266                                   "seccomp tried to change syscall nr or ip");
267                 do_exit(SIGSYS);
268         }
269         if (tmp)
270                 goto do_ret;  /* skip requested */
271
272         /*
273          * With a real vsyscall, page faults cause SIGSEGV.  We want to
274          * preserve that behavior to make writing exploits harder.
275          */
276         prev_sig_on_uaccess_error = current_thread_info()->sig_on_uaccess_error;
277         current_thread_info()->sig_on_uaccess_error = 1;
278
279         ret = -EFAULT;
280         switch (vsyscall_nr) {
281         case 0:
282                 ret = sys_gettimeofday(
283                         (struct timeval __user *)regs->di,
284                         (struct timezone __user *)regs->si);
285                 break;
286
287         case 1:
288                 ret = sys_time((time_t __user *)regs->di);
289                 break;
290
291         case 2:
292                 ret = sys_getcpu((unsigned __user *)regs->di,
293                                  (unsigned __user *)regs->si,
294                                  NULL);
295                 break;
296         }
297
298         current_thread_info()->sig_on_uaccess_error = prev_sig_on_uaccess_error;
299
300 check_fault:
301         if (ret == -EFAULT) {
302                 /* Bad news -- userspace fed a bad pointer to a vsyscall. */
303                 warn_bad_vsyscall(KERN_INFO, regs,
304                                   "vsyscall fault (exploit attempt?)");
305
306                 /*
307                  * If we failed to generate a signal for any reason,
308                  * generate one here.  (This should be impossible.)
309                  */
310                 if (WARN_ON_ONCE(!sigismember(&tsk->pending.signal, SIGBUS) &&
311                                  !sigismember(&tsk->pending.signal, SIGSEGV)))
312                         goto sigsegv;
313
314                 return true;  /* Don't emulate the ret. */
315         }
316
317         regs->ax = ret;
318
319 do_ret:
320         /* Emulate a ret instruction. */
321         regs->ip = caller;
322         regs->sp += 8;
323         return true;
324
325 sigsegv:
326         force_sig(SIGSEGV, current);
327         return true;
328 }
329

This basically checks that we aren’t trying to write the return value from the syscall into memory we have no permissions on (write_ok_or_segv) and then calls sys_time storing the answer in reg->ax and then emulating the rest of the syscall protocol.

Going back to the test program. Let’s bench mark it (crudely) so we can see how much slower it really is:

10MM calls to time(NULL) on a broken system (gcc -std=c99 -O2 -o tpf tpf.c)

1
2
3
4
5
[rberton@broken_machine test_page_fault []]$ time ./tpf 10000000

real               0m6.415s
user               0m2.230s
sys                0m4.046s

10MM calls to time(NULL) on a fairly modern ubuntu:

1
2
3
4
5
6
7
8
[rberton@ubuntu ~]$ uname -a
Linux ubuntu 3.8.0-35-generic #50-Ubuntu SMP Tue Dec 3 01:24:59 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

[rberton@ubuntu ~]$ time ./tpf 10000000

real                0m0.041s
user                0m0.040s
sys                 0m0.000s

156X faster. A “little slower” is missing the mark here.

tl;dr If you use a really old libc (say glibc 2.5) and upgrade your kernel to something new because you want the hotness, you are likely making it even worse for any vsyscall on x86_64 machines. You should either upgrade your libc at the same time as the linux kernel or do what Paul suggests: statically link uclibc.