Note that this is a synthetic microbenchmark, not a real world benchmark. In the real world, other effects often swamp the kinds of things measured here.
Poller_bench sets up an array of socketpairs, has a Poller monitor the read end of each socketpair, and measures how long it takes to execute the following snippet of code with various styles of Poller:
for (k=0; k<num_active; k++) write(fdpairs[k * spacing + spacing/2][1], "a", 1); poller.waitAndDispatchEvents(0);where spacing = num_pipes / num_active.
poller.waitAndDispatchEvents()
calls
poll()
or
ioctl(m_dpfd, DP_POLL, &dopoll)
,
as appropriate, then calls an event handler for each ready socket.
The event handler for this benchmark just executes
read(event->fd, buf, 1);
cd /dev mknod poll u 15 125 chmod 666 /dev/pollwhere 15 is MISC_MAJOR and 125 is DEVPOLL_MINOR from the kernel sources; your MISC_MAJOR may differ, be sure to check /usr/src/linux/include/linux/major.h for the definition of MISC_MAJOR on your system.
cd /usr/include/asm ln -s ../linux/devpoll.h
Also, a line near the end of /usr/include/sys/poll_impl.h may need to be moved to get it to compile when included from C++ programs.
On Linux, if you want kernel profile results, boot with argument 'profile=2' to enable the kernel's builtin profiler.
Run the shell script Poller_bench.sh as follows:
su sh Poller_bench.sh
The script raises file descriptor limits, then runs the command
./Poller_bench 5 1 spd 100 1000 10000
It should be run on an idle machine, with no email client, web browser, or X server running. The Pentium III machine at my disposal was running a single sshd; the Solaris machine was running two sshd's and an idle XWindow server, so it wasn't quite as idle.
On a 167MHz sun4u Sparc Ultra-1 running SunOS 5.7 (Solaris 7) Generic_106541-11:
pipes 100 1000 10000
select 151 - -
poll 470 676 3742
/dev/poll 61 70 92
165133 microseconds to open each of 10000 socketpairs
29646 microseconds to close each of 10000 socketpairs
On a 4X400Mhz Enterprise 450 running Solaris 8 (results contributed by Doug Lea):
pipes 100 1000 10000
select 60 - -
poll 273 388 1559
/dev/poll 27 28 34
116586 microseconds to open each of 10000 socketpairs
19235 microseconds to close each of 10000 socketpairs
(The machine wasn't idle, but at most one CPU was doing other stuff
during test, and the test seemed to occupy only one CPU.)
On an idle 650 MHz dual Pentium III running Red Hat Linux 6.2
with kernel 2.2.14smp plus the /dev/poll patch plus Dave Miller's
patch to speed up close():
pipes 100 1000 10000
select 28 - -
poll 23 890 11333
/dev/poll 19 146 4264
(Time to open or close socketpairs was not recorded, but was under 14 microseconds.)
On the same machine as above, but with kernel 2.4.0-test10-pre4 smp:
pipes 100 1000 10000
select 52 - -
poll 49 1184 14660
26 microseconds to open each of 10000 socketpairs
14 microseconds to close each of 10000 socketpairs
(Note: the /dev/poll patch does not apply cleanly to recent 2.4.0-test kernels, I believe,
and I did not try it.)
On a single processor 600Mhz Pentium-III with 512MB of memory, running FreeBSD 4.x-STABLE
(results contributed by Jonathan Lemon):
pipes 100 1000 10000 30000
select 54 - - -
poll 50 552 11559 35178
kqueue 8 8 8 8
(Note: Jonathan also varied the number of active
pipes, and found that kqueue's time scaled linearly with
that number, whereas poll's time scaled linearly with
number of total pipes.)
The test was also run with pipes instead of socketpairs (results not shown); the performance on Solaris was about the same, but the /dev/poll driver on Linux did not perform well with pipes. According to Niels Provos,
The hinting code which causes a considerable speed up for /dev/poll only applies to network sockets. If there are any serious applications that make uses of pipes in a manner that would benefit from /dev/poll then the pipe code needs to return hints too.
2.4.0-test10-pre4 was slower than 2.2.14 in all cases tested.
I should show results for pipes as well as socketpairs.
The Linux 2.2.14 /dev/poll driver printed messages to the console when sockets were closed; this should probably be disabled for production.
The 2.2.14 Linux /dev/poll driver was about six times faster than poll() for 1000 fds, but fell down to only 2.7 times faster at 10000 fds. The Solaris /dev/poll driver was about seven times faster than poll() at 100 fds, and increased to 40 times faster at 10000 fds.
Under Linux 2.2.14, when the number of idle sockets was increased from 100 to 10000, the time to check for active sockets with poll() and /dev/poll increased by a factor of 493 and 224, respectively. This is terribly, horribly bad scaling behavior.
Under Linux 2.4.0-test10-pre4, when the number of idle sockets was increased from 100 to 10000, the time to check for active sockets with poll() increased by a factor of 300. This is terribly, horribly bad scaling behavior.
There seems to be a scalability problem in poll() under both Linux 2.2.14 and 2.4.0-test10-pre4 and in /dev/poll under Linux 2.2.14.
poll() is stuck with an interface that dictates O(n) behavior on total pipes; still, Linux's implementation could be improved. The design of the current Linux /dev/poll patch is O(n) in total pipes, in spite of the fact that its interface allows it to be O(1) in total pipes and O(n) only in active pipes.
See also the recent discussions on linux-kernel.
If you run the above test on a Linux system booted with 'profile=2', Poller_bench will output one kernel profiling data file per test condition. Poller_bench.sh does a gross analysis using 'readprofile | sort -rn | head > bench%d%c.top' to find the kernel functions with the highest CPU usage, where %d is the number of socketpairs, and %c is p for poll, d for /dev/poll, etc.
'more bench10000*.top' shows the results for 10000 socketpairs. On 2.2.14, it shows:
:::::::::::::: bench10000d.dat.top :::::::::::::: 901 total 0.0008 833 dp_poll 1.4875 27 do_bottom_half 0.1688 7 __get_request_wait 0.0139 4 startup_32 0.0244 3 unix_poll 0.0203 :::::::::::::: bench10000p.dat.top :::::::::::::: 584 total 0.0005 236 unix_poll 1.5946 162 sock_poll 4.5000 148 do_poll 0.6727 24 sys_poll 0.0659 7 __generic_copy_from_user 0.1167This seems to indicate that /dev/poll spends nearly all of its time in dp_poll(), and poll spends a fair bit of time in three routines: unix_poll, sock_poll, and do_poll.
On 2.4.0-test10-pre4 smp, 'more bench10000*.top' shows:
:::::::::::::: 2.4/bench10000p.dat.top :::::::::::::: 1507 total 0.0011 748 default_idle 14.3846 253 unix_poll 1.9167 209 fget 2.4881 195 sock_poll 5.4167 29 sys_poll 0.0342 29 fput 0.1272 29 do_pollfd 0.1648It seems curious that the idle routine should show up so much, but it's probably just the second CPU doing nothing.
Poller_bench.sh will also try to do a fine analysis of dp_poll() using the 'profile' tool (source included), which is a variant of readprofile that shows hotspots within kernel functions. Looking at its output for the run on 2.2.14, the three four-byte regions that take up the most CPU time in dp_poll() in the 10000 socketpair case are
c01d9158 39.135654% 326 c01d9174 11.404561% 95 c01d91a0 27.250900% 227Looking at the output of 'objdump -d /usr/src/linux/vmlinux', that region corresponds to the object code:
c01d9158: c7 44 24 14 00 00 00 movl $0x0,0x14(%esp,1) c01d915f: 00 c01d9160: 8b 74 24 24 mov 0x24(%esp,1),%esi c01d9164: 8b 86 8c 04 00 00 mov 0x48c(%esi),%eax c01d916a: 3b 50 04 cmp 0x4(%eax),%edx c01d916d: 73 0a jae c01d9179I'm not yet familiar enough with kernel hacker tools to associate those with lines of code in /usr/src/linux/drivers/char/devpoll.c, but that 'lock btr' hotspot appears to be the call to test_and_clear_bit().c01d916f: 8b 40 10 mov 0x10(%eax),%eax c01d9172: 8b 14 90 mov (%eax,%edx,4),%edx c01d9175: 89 54 24 14 mov %edx,0x14(%esp,1) c01d9179: 83 7c 24 14 00 cmpl $0x0,0x14(%esp,1) c01d917e: 75 12 jne c01d9192 c01d9180: 53 push %ebx c01d9181: ff 74 24 3c pushl 0x3c(%esp,1) c01d9185: e8 5a fc ff ff call c01d8de4 c01d918a: 83 c4 08 add $0x8,%esp c01d918d: e9 d1 00 00 00 jmp c01d9263 c01d9192: 8b 7c 24 10 mov 0x10(%esp,1),%edi c01d9196: 0f bf 4f 06 movswl 0x6(%edi),%ecx c01d919a: 31 c0 xor %eax,%eax c01d919c: f0 0f b3 43 10 lock btr %eax,0x10(%ebx) c01d91a1: 19 c0 sbb %eax,%eax
The source used was lmbench-2alpha10 from bitmover.com. I did not check into why the TCP test failed on the linux box.
L M B E N C H 1 . 9 S U M M A R Y ------------------------------------ (Alpha software, do not distribute) Processor, Processes - times in microseconds - smaller is better ---------------------------------------------------------------- Host OS Mhz null null open selct sig sig fork exec sh call I/O stat clos inst hndl proc proc proc --------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ---- sparc-sun SunOS 5.7 167 2.9 12. 48 55 0.40K 6.6 81 3.8K 15K 32K i686-linu Linux 2.2.14d 651 0.5 0.8 4 5 0.03K 1.4 2 0.3K 1K 6K Context switching - times in microseconds - smaller is better ------------------------------------------------------------- Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw --------- ------------- ----- ------ ------ ------ ------ ------- ------- sparc-sun SunOS 5.7 19 69 235 114 349 116 367 i686-linu Linux 2.2.14d 1 5 17 5 129 30 129 *Local* Communication latencies in microseconds - smaller is better ------------------------------------------------------------------- Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP ctxsw UNIX UDP TCP conn --------- ------------- ----- ----- ---- ----- ----- ----- ----- ---- sparc-sun SunOS 5.7 19 60 120 197 215 1148 i686-linu Linux 2.2.14d 1 7 13 31 80 File & VM system latencies in microseconds - smaller is better -------------------------------------------------------------- Host OS 0K File 10K File Mmap Prot Page Create Delete Create Delete Latency Fault Fault --------- ------------- ------ ------ ------ ------ ------- ----- ----- sparc-sun SunOS 5.7 6605 15 5.2K i686-linu Linux 2.2.14d 10 0 19 1 5968 1 0.5K *Local* Communication bandwidths in MB/s - bigger is better ----------------------------------------------------------- Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem UNIX reread reread (libc) (hand) read write --------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- ----- sparc-sun SunOS 5.7 60 55 54 84 122 177 89 122 141 i686-linu Linux 2.2.14d 528 366 -1 357 451 150 138 451 171 Memory latencies in nanoseconds - smaller is better (WARNING - may not be correct, check graphs) --------------------------------------------------- Host OS Mhz L1 $ L2 $ Main mem Guesses --------- ------------- --- ---- ---- -------- ------- sparc-sun SunOS 5.7 167 12 59 273 i686-linu Linux 2.2.14d 651 4 10 131