I had not seen it. Probably I had not seen it because no one does this many syscalls in high performance code (dur). I googled it and nothing useful came up aside from the normal major/minor page_fault stuff explained by the paging model in the Linux kernel.
I then asked the smartest guy I know, and he pointed me in the right direction. What follows is the explanation in case you run into this. I suspect no one else will run into this because it has to do with libc/kernel mismatches.
We are working on a complex event processor at AppNexus. This thing is responsible for aggregating, joining, filtering, etc.. on advertising events. As such, it’s very concerned with what time it is on the system. This is due to various RAM eviction mechanisms that rely on windows of time for processing so that we don’t consume all the RAM on the system holding onto events longer than necessary. Turns out that we were naively calling:
1
time_tnow=time(NULL);
In a few key places (i.e. every time we processed an incoming event). The expectation that system time() calls would eat a majority of cycles was slightly unexpected. What’s more unexpected is that these system time() calls would invoke so many page faults.
We are using glibc version 2.5 (the stock version on CentOS 5) on x86_64 machines. This implements a time() call by using assembly to invoke a syscall on the vsyscall page provided my linux. From the time.S file:
However, because we want new hotness kernels we have a relatively new 3.8.13 kernel installed on most of these stock CentOS 5 machines. What is this magic address?
These are sorted lowest to highest address in the memory map of this process. You can see the process itself, it’s heap, then DSOs, then locale related stuff, then some unknown, then stack, then vdso then vsyscall.
On newer libc’s we are supposed to get the sys_time by using vdso space. That is out of the scope of this blog post but if you are interested: http://lwn.net/Articles/446528/. Note that when that article says “The static vsyscall page will not continue to exist in its current form, and applications which still use it will continue to work but will get a little bit slower”, it is severly understating how much slower they will get. More on that below.
The penultimate page starting at 0xffffffffff600000 and ending at 0xffffffffff601000 must contain a time() function at offset 0xffffffffff600400. We are merely dereferencing that memory address and calling the function defined there. This causes the kernel to trap (page_fault) this address and invoke special handling to emulate the syscalls that used to be there. This “fixes” the security hole and still allows legacy apps to work (much more slowly).
I created a simple test program to avoid any other complications in my own code to isolate this problem:
The dereference and function call on that special page at 0xffffffffff600000 is causing the kernel to trigger a page_fault as that is not RAM I should be accessing. This triggers the do_page_fault which eventually calls __bad_area_nosemaphore.
717staticvoid718__bad_area_nosemaphore(structpt_regs*regs,unsignedlongerror_code,719unsignedlongaddress,intsi_code)720{721structtask_struct*tsk=current;722723/* User mode accesses just cause a SIGSEGV */724if(error_code&PF_USER){725/*726 * It's possible to have interrupts off here:727 */728local_irq_enable();729730/*731 * Valid to do another page fault here because this one came732 * from user space:733 */734if(is_prefetch(regs,error_code,address))735return;736737if(is_errata100(regs,address))738return;739740#ifdefCONFIG_X86_64741/*742 * Instruction fetch faults in the vsyscall page might need743 * emulation.744 */745if(unlikely((error_code&PF_INSTR)&&746((address&~0xfff)==VSYSCALL_START))){747if(emulate_vsyscall(regs,address))748return;749}750#endif751/* Kernel addresses are always protection faults: */752if(address>=TASK_SIZE)753error_code|=PF_PROT;754755if(likely(show_unhandled_signals))756show_signal_msg(regs,error_code,address,tsk);757758tsk->thread.cr2=address;759tsk->thread.error_code=error_code;760tsk->thread.trap_nr=X86_TRAP_PF;761762force_sig_info_fault(SIGSEGV,si_code,address,tsk,0);763764return;765}766767if(is_f00f_bug(regs,address))768return;769770no_context(regs,error_code,address,SIGSEGV,si_code);771}
So we’ve accessed a bad area (kernel memory) in order to execute an instruction there. User accesses are supposed to segfault unless we are x86_64 (which we are). In that case, emulate vsyscall functionality.
176boolemulate_vsyscall(structpt_regs*regs,unsignedlongaddress)177{178structtask_struct*tsk;179unsignedlongcaller;180intvsyscall_nr,syscall_nr,tmp;181intprev_sig_on_uaccess_error;182longret;183184/*185 * No point in checking CS -- the only way to get here is a user mode186 * trap to a high address, which means that we're in 64-bit user code.187 */188189WARN_ON_ONCE(address!=regs->ip);190191if(vsyscall_mode==NONE){192warn_bad_vsyscall(KERN_INFO,regs,193"vsyscall attempted with vsyscall=none");194returnfalse;195}196197vsyscall_nr=addr_to_vsyscall_nr(address);198199trace_emulate_vsyscall(vsyscall_nr);200201if(vsyscall_nr<0){202warn_bad_vsyscall(KERN_WARNING,regs,203"misaligned vsyscall (exploit attempt or buggy program) -- look up the vsyscall kernel parameter if you need a workaround");204gotosigsegv;205}206207if(get_user(caller,(unsignedlong__user*)regs->sp)!=0){208warn_bad_vsyscall(KERN_WARNING,regs,209"vsyscall with bad stack (exploit attempt?)");210gotosigsegv;211}212213tsk=current;214215/*216 * Check for access_ok violations and find the syscall nr.217 *218 * NULL is a valid user pointer (in the access_ok sense) on 32-bit and219 * 64-bit, so we don't need to special-case it here. For all the220 * vsyscalls, NULL means "don't write anything" not "write it at221 * address 0".222 */223switch(vsyscall_nr){224case0:225if(!write_ok_or_segv(regs->di,sizeof(structtimeval))||226!write_ok_or_segv(regs->si,sizeof(structtimezone))){227ret=-EFAULT;228gotocheck_fault;229}230231syscall_nr=__NR_gettimeofday;232break;233234case1:235if(!write_ok_or_segv(regs->di,sizeof(time_t))){236ret=-EFAULT;237gotocheck_fault;238}239240syscall_nr=__NR_time;241break;242243case2:244if(!write_ok_or_segv(regs->di,sizeof(unsigned))||245!write_ok_or_segv(regs->si,sizeof(unsigned))){246ret=-EFAULT;247gotocheck_fault;248}249250syscall_nr=__NR_getcpu;251break;252}253254/*255 * Handle seccomp. regs->ip must be the original value.256 * See seccomp_send_sigsys and Documentation/prctl/seccomp_filter.txt.257 *258 * We could optimize the seccomp disabled case, but performance259 * here doesn't matter.260 */261regs->orig_ax=syscall_nr;262regs->ax=-ENOSYS;263tmp=secure_computing(syscall_nr);264if((!tmp&®s->orig_ax!=syscall_nr)||regs->ip!=address){265warn_bad_vsyscall(KERN_DEBUG,regs,266"seccomp tried to change syscall nr or ip");267do_exit(SIGSYS);268}269if(tmp)270gotodo_ret;/* skip requested */271272/*273 * With a real vsyscall, page faults cause SIGSEGV. We want to274 * preserve that behavior to make writing exploits harder.275 */276prev_sig_on_uaccess_error=current_thread_info()->sig_on_uaccess_error;277current_thread_info()->sig_on_uaccess_error=1;278279ret=-EFAULT;280switch(vsyscall_nr){281case0:282ret=sys_gettimeofday(283(structtimeval__user*)regs->di,284(structtimezone__user*)regs->si);285break;286287case1:288ret=sys_time((time_t__user*)regs->di);289break;290291case2:292ret=sys_getcpu((unsigned__user*)regs->di,293(unsigned__user*)regs->si,294NULL);295break;296}297298current_thread_info()->sig_on_uaccess_error=prev_sig_on_uaccess_error;299300check_fault:301if(ret==-EFAULT){302/* Bad news -- userspace fed a bad pointer to a vsyscall. */303warn_bad_vsyscall(KERN_INFO,regs,304"vsyscall fault (exploit attempt?)");305306/*307 * If we failed to generate a signal for any reason,308 * generate one here. (This should be impossible.)309 */310if(WARN_ON_ONCE(!sigismember(&tsk->pending.signal,SIGBUS)&&311!sigismember(&tsk->pending.signal,SIGSEGV)))312gotosigsegv;313314returntrue;/* Don't emulate the ret. */315}316317regs->ax=ret;318319do_ret:320/* Emulate a ret instruction. */321regs->ip=caller;322regs->sp+=8;323returntrue;324325sigsegv:326force_sig(SIGSEGV,current);327returntrue;328}329
This basically checks that we aren’t trying to write the return value from the syscall into memory we have no permissions on (write_ok_or_segv) and then calls sys_time storing the answer in reg->ax and then emulating the rest of the syscall protocol.
Going back to the test program. Let’s bench mark it (crudely) so we can see how much slower it really is:
10MM calls to time(NULL) on a broken system (gcc -std=c99 -O2 -o tpf tpf.c)
12345
[rberton@broken_machine test_page_fault []]$ time ./tpf 10000000
real 0m6.415s
user 0m2.230s
sys 0m4.046s
10MM calls to time(NULL) on a fairly modern ubuntu:
12345678
[rberton@ubuntu ~]$ uname -a
Linux ubuntu 3.8.0-35-generic #50-Ubuntu SMP Tue Dec 3 01:24:59 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux[rberton@ubuntu ~]$ time ./tpf 10000000
real 0m0.041s
user 0m0.040s
sys 0m0.000s
156X faster. A “little slower” is missing the mark here.
tl;dr If you use a really old libc (say glibc 2.5) and upgrade your kernel to something new because you want the hotness, you are likely making it even worse for any vsyscall on x86_64 machines. You should either upgrade your libc at the same time as the linux kernel or do what Paul suggests: statically link uclibc.