It's time for web servers to handle ten thousand clients simultaneously, don't you think? After all, the web is a big place now.
And computers are big, too. You can buy a 1000MHz machine with 2 gigabytes of RAM and an 1000Mbit/sec Ethernet card for $1500 or so. Let's see - at 20000 clients, that's 500KHz, 100Kbytes, and 50Kbits/sec per client. It shouldn't take any more horsepower than that to take four kilobytes from the disk and send them to the network once a second for each of twenty thousand clients. (That works out to $0.08 per client, by the way. Those $100/client licensing fees some operating systems charge are starting to look a little heavy!) So hardware is no longer the bottleneck.
In 1999 one of the busiest ftp sites, cdrom.com, actually handled 10000 clients simultaneously through a Gigabit Ethernet pipe. As of 2001, that same speed is now being offered by several ISPs, who expect it to become increasingly popular with large business customers.
And the thin client model of computing appears to be coming back in style -- this time with the server out on the Internet, serving thousands of clients.
With that in mind, here are a few notes on how to configure operating systems and write code to support thousands of clients. The discussion centers around Unix-like operating systems, for obvious reasons.
If you haven't read it already, go out and get a copy of Unix Network Programming : Networking Apis: Sockets and Xti (Volume 1) by the late W. Richard Stevens. It describes many of the I/O strategies and pitfalls related to writing high-performance servers. It even talks about the 'thundering herd' problem.
Prepackaged libraries are available that abstract some of the techniques presented below, insulating your code from the operating system and making it more portable.
The following five combinations seem to be popular:
... set nonblocking mode on all network handles, and use select() or poll() to tell which network handle has data waiting. This is the traditional favorite. With this scheme, the kernel tells you whether a file descriptor is ready, whether or not you've done anything with that file descriptor since the last time the kernel told you about it. (The name 'level triggered' comes from computer hardware design; it's the opposite of 'edge triggered'. Jonathon Lemon introduced the terms in his paper on kqueue().)
Note: it's particularly important to remember that readiness notification from the kernel is only a hint; the file descriptor might not be ready anymore when you try to read from it. That's why it's important to use nonblocking mode when using readiness notification.
An important bottleneck in this method is that read() or sendfile()
from disk blocks if the page is not in core at the moment;
setting nonblocking mode on a disk file handle has no effect.
Same thing goes for memory-mapped disk files.
The first time a server needs disk I/O, its process blocks,
all clients must wait, and that raw nonthreaded performance goes to waste.
Worker threads or processes that do the disk I/O can get around this
bottleneck. One approach is to use memory-mapped files,
and if mincore() indicates I/O is needed, ask a worker to do the I/O,
and continue handling network traffic. Jef Poskanzer mentions that
Pai, Druschel, and Zwaenepoel's Flash web server uses this trick; they gave a talk at
Usenix '99 on it.
It looks like mincore() is available in BSD-derived Unixes
like FreeBSD
and Solaris, but is not part
of the Single Unix Specification.
It's available as part of Linux as of kernel 2.3.51,
thanks to Chuck Lever.
There are several ways for a single thread to tell which of a set of nonblocking sockets are ready for I/O:
See Poller_select (cc, h) for an example of how to use select() interchangeably with other readiness notification schemes.
Some OS's (e.g. Solaris 8) speed up poll() et al by use of techniques like poll hinting, which was implemented and benchmarked by Niels Provos for Linux in 1999.
See Poller_poll (cc, h, benchmarks) for an example of how to use poll() interchangeably with other readiness notification schemes.
The idea behind /dev/poll is to take advantage of the fact that often poll() is called many times with the same arguments. With /dev/poll, you get an open handle to /dev/poll, and tell the OS just once what files you're interested in by writing to that handle; from then on, you just read the set of currently ready file descriptors from that handle.
It appeared quietly in Solaris 7 (see patchid 106541) but its first public appearance was in Solaris 8; according to Sun, at 750 clients, this has 10% of the overhead of poll().
Linux is also starting to support /dev/poll, although it's not in the main kernel as of 2.4.4. See N. Provos, C. Lever, "Scalable Network I/O in Linux," May, 2000. [FREENIX track, Proc. USENIX 2000, San Diego, California (June, 2000).] A patch implementing /dev/poll for the 2.2.14 Linux kernel is available from www.citi.umic.edu. Note: Niels reports that under heavy load it can drop some readiness events. Kevin Clark updated it for the 2.4.3 kernel (see www.cs.unh.edu/~kdc/dev-poll-patch-linux-2.4.3). Davide Libenzi updated this patch for 2.4.6 or so (see here), but didn't fix its essential problems.
See Christopher K. St. John's /dev/yapoll page for some interesting benchmarks, and yet another try at a /dev/poll-like interface.
See Poller_devpoll (cc, h benchmarks ) for an example of how to use /dev/poll interchangeably with many other readiness notification schemes. (Caution - the example is for Linux /dev/poll, might not work right on Solaris.)
[Banga, Mogul, Drusha '99] described this kind of scheme in 1999.
There are several APIs which let the application retrieve 'file became ready' notifications:
FreeBSD 5.0 (and possibly 4.1) support a generalized alternative to poll() called kqueue()/kevent(); it supports both edge-triggering and level-triggering. (See also Jonathan Lemon's page and his Usenix paper on kqueue().)
Like /dev/poll, you allocate a listening object, but rather than opening the file /dev/poll, you call kqueue() to allocate one. To change the events you are listening for, or to get the list of current events, you call kevent() on the descriptor returned by kqueue(). It can listen not just for socket readiness, but also for plain file readiness, signals, and even for I/O completion.
Note: as of October 2000, the threading library on FreeBSD does not interact well with kqueue(); evidently, when kqueue() blocks, the entire process blocks, not just the calling thread.
See Poller_kqueue (cc, h, benchmarks) for an example of how to use kqueue() interchangeably with many other readiness notification schemes.
Examples and libraries using kqueue():
The 2.4 linux kernel can deliver socket readiness events via a particular realtime signal. Here's how to turn this behavior on:
/* Mask off SIGIO and the signal you want to use. */ sigemptyset(&sigset); sigaddset(&sigset, signum); sigaddset(&sigset, SIGIO); sigprocmask(SIG_BLOCK, &m_sigset, NULL); /* For each file descriptor, invoke F_SETOWN, F_SETSIG, and set O_ASYNC. */ fcntl(fd, F_SETOWN, (int) getpid()); fcntl(fd, F_SETSIG, signum); flags = fcntl(fd, F_GETFL); flags |= O_NONBLOCK|O_ASYNC; fcntl(fd, F_SETFL, flags);This sends that signal when a normal I/O function like read() or write() completes. To use this, write a normal poll() outer loop, and inside it, after you've handled all the fd's noticed by poll(), you loop calling sigwaitinfo().
See Poller_sigio (cc, h) for an example of how to use rtsignals interchangeably with many other readiness notification schemes.
See Zach Brown's phhttpd for example code that uses this feature directly.
[Provos, Lever, and Tweedie 2000] describes a recent benchmark of phhttpd using a variant of sigtimedwait(), sigtimedwait4(), that lets you retrieve multiple signals with one call. Interestingly, the chief benefit of sigtimedwait4() for them seemed to be it allowed the app to gauge system overload (so it could behave appropriately). (Note that poll() provides the same measure of system overload.)
Vitaly Luban announced a patch implementing this scheme on 18 May 2001; his patch lives at www.luban.org/GPL/gpl.html. (Note: as of Sept 2001, there may still be stability problems with this patch under heavy load. dkftpbench at about 4500 users may be able to trigger an oops.)
See Poller_sigfd (cc, h) for an example of how to use signal-per-fd interchangably with many other readiness notification schemes.
This has not yet become popular in Unix, probably because few operating systems support asynchronous I/O, also possibly because it's hard to use. Under Unix, this is done with the aio_ interface (scroll down from that link to "Asynchronous input and output"), which associates a signal and value with each I/O operation. Signals and their values are queued and delivered efficiently to the user process. This is from the POSIX 1003.1b realtime extensions, and is also in the Single Unix Specification, version 2, and in glibc 2.1. The generic glibc 2.1 implementation may have been written for standards compliance rather than performance.
AIO is normally used with edge-triggered completion notification, i.e. a signal is queued when the operation is complete. It can also be used with level triggered completion notification by calling aio_suspend().
SGI has implemented high-speed AIO with kernel support. As of version 1.1, it's said to work well with both disk I/O and sockets. It seems to use kernel threads.
Ben LaHaise is working on another implementation for Linux AIO that does not use kernel threads, and has a more efficient underlying api:
The O'Reilly book POSIX.4: Programming for the Real World is said to include a good introduction to aio.
A tutorial for the earlier, nonstandard, aio implementation on Solaris is online at Sunsite. It's probably worth a look, but keep in mind you'll need to mentally convert "aioread" to "aio_read", etc.
Under Windows, asynchronous I/O is associated with the terms "Overlapped I/O" and IOCP or "I/O Completion Port". Microsoft's IOCP combines techniques from the prior art like asynchronous I/O (like aio_write) and queued completion notification (like when using the aio_sigevent field with aio_write) with a new idea of holding back some requests to try to keep the number of running threads associated with a single IOCP constant. For more information, see Jeffrey Richter's book "Programming Server-Side Applications for Microsoft Windows 2000" (Amazon, MSPress), U.S. patent #06223207, or MSDN.
... and let read() and write() block. Has the disadvantage of using a whole stack frame for each client, which costs memory. Many OS's also have trouble handling more than a few hundred threads. If each thread gets a 2MB stack (not an uncommon default value), you run out of *virtual memory* at (2^32 / 2^21) = 2048 threads on a 32 bit machine.
Unfortunately, for various reasons, people seem to be hooked on threads. Fortunately for them, two groups (NGPT and NPT) have been working hard to improve Linux's thread support.
Ulrich Drepper, though, didn't like NGPT, and figured he could do better. (For those who have ever tried to contribute a patch to glibc, this may not come as a big surprise :-) Over the next few months, Ulrich Drepper, Ingo Molnar, and others contributed glibc and kernel changes that make up something called the Native Posix Threads Library (NPT). NPT uses all the kernel enhancements designed for NGPT, and takes advantage of a few new ones. Ingo Molnar described the kernel enhancements as follows:
While NPT uses the three kernel features introduced by NGPT: getpid() returns PID, CLONE_THREAD and futexes; NPT also uses (and relies on) a much wider set of new kernel features, developed as part of this project.One big difference between the two is that NPT is a 1:1 threading model, whereas NGPT is an M:N threading model (see below). In spite of this, Ulrich's initial benchmarks seem to show that NPT is indeed much faster than NGPT. (The NGPT team is looking forward to seeing Ulrich's benchmark code to verify the result.)Some of the items NGPT introduced into the kernel around 2.5.8 got modified, cleaned up and extended, such as thread group handling (CLONE_THREAD). [the CLONE_THREAD changes which impacted NGPT's compatibility got synced with the NGPT folks, to make sure NGPT does not break in any unacceptable way.]
The kernel features developed for and used by NPT are described in the design whitepaper, http://people.redhat.com/drepper/nptl-design.pdf ...
A short list: TLS support, various clone extensions (CLONE_SETTLS, CLONE_SETTID, CLONE_CLEARTID), POSIX thread-signal handling, sys_exit() extension (release TID futex upon VM-release), the sys_exit_group() system-call, sys_execve() enhancements and support for detached threads.
There was also work put into extending the PID space - eg. procfs crashed due to 64K PID assumptions, max_pid, and pid allocation scalability work. Plus a number of performance-only improvements were done as well.
In essence the new features are a no-compromises approach to 1:1 threading - the kernel now helps in everything where it can improve threading, and we precisely do the minimally necessery set of context switches and kernel calls for every basic threading primitive.
As of September 2002, NPT is *not ready for general use* because it is not yet easy to set up. That will change as distributions update to newer versions of glibc and the Linux kernel; when the 2.6 kernel is released, distributions that support it will probably offer NPT support as well. (If you need better thread support before then, use NGPT.)
Novell and Microsoft are both said to have done this at various times,
at least one NFS implementation does this,
khttpd does this for Linux
and static web pages, and
"TUX" (Threaded linUX webserver)
is a blindingly fast and flexible kernel-space HTTP server by Ingo Molnar for Linux.
Ingo's September 1, 2000 announcement
says an alpha version of TUX can be downloaded from
ftp://ftp.redhat.com/pub/redhat/tux,
and explains how to join a mailing list for more info.
The linux-kernel list has been discussing the pros and cons of this
approach, and the consensus seems to be instead of moving web servers
into the kernel, the kernel should have the smallest possible hooks added
to improve web server performance. That way, other kinds of servers
can benefit. See e.g.
Zach Brown's remarks
about userland vs. kernel http servers.
It appears that the 2.4 linux kernel provides sufficient power to user programs, as
the X15 server runs about as fast as Tux, but doesn't use any
kernel modifications.
Richard Gooch has written a paper discussing I/O options.
The Apache mailing lists have some interesting posts about why they prefer not to use select() (basically, they think that makes plugins harder). Still, they're planning to use select()/poll()/sendfile() for static requests in Apache 2.0.
Mark Russinovich wrote an editorial and an article discussing I/O strategy issues in the 2.2 Linux kernel. Worth reading, even he seems misinformed on some points. In particular, he seems to think that Linux 2.2's asynchronous I/O (see F_SETSIG above) doesn't notify the user process when data is ready, only when new connections arrive. This seems like a bizarre misunderstanding. See also comments on an earlier draft, Ingo Molnar's rebuttal of 30 April 1999, Russinovich's comments of 2 May 1999, a rebuttal from Alan Cox, and various posts to linux-kernel. I suspect he was trying to say that Linux doesn't support asynchronous disk I/O, which used to be true, but now that SGI has implemented KAIO, it's not so true anymore.
See these pages at sysinternals.com and MSDN for information on "completion ports", which he said were unique to NT; in a nutshell, win32's "overlapped I/O" turned out to be too low level to be convenient, and a "completion port" is a wrapper that provides a queue of completion events, plus scheduling magic that tries to keep the number of running threads constant by allowing more threads to pick up completion events if other threads that had picked up completion events from this port are sleeping (perhaps doing blocking I/O).
See also OS/400's support for I/O completion ports.
There was an interesting discussion on linux-kernel in September 1999 titled "> 15,000 Simultaneous Connections" (and the second week of the thread). Highlights:
Interesting reading!
set kern.maxfiles=XXXXwhere XXXX is the desired system limit on file descriptors, and reboot. Thanks to an anonymous reader, who wrote in to say he'd achieved far more than 10000 connections on FreeBSD 4.3, and says
"FWIW: You can't actually tune the maximum number of connections in FreeBSD trivially, via sysctl.... You have to do it in the /boot/loader.conf file.Another reader says
The reason for this is that the zalloci() calls for initializing the sockets and tcpcb structures zones occurs very early in system startup, in order that the zone be both type stable and that it be swappable.
You will also need to set the number of mbufs much higher, since you will (on an unmodified kernel) chew up one mbuf per connection for tcptempl structures, which are used to implement keepalive."
As of FreeBSD 4.4, the tcptempl structure is no longer allocated; you no longer have to worry about one mbuf being chewed up per connection.Huzzah...
echo 32768 > /proc/sys/fs/file-max echo 65536 > /proc/sys/fs/inode-maxincreases the system limit on open files, and
ulimit -n 32768increases the current process' limit. I verified that a process on Red Hat 6.0 (2.2.5 or so plus patches) can open at least 31000 file descriptors this way. Another fellow has verified that a process on 2.2.12 can open at least 90000 file descriptors this way (with appropriate limits). The upper bound seems to be available memory.
On any architecture, you may need to reduce the amount of stack space allocated for each thread to avoid running out of virtual memory. You can set this at runtime with pthread_attr_init() if you're using pthreads.
See also Volano's detailed instructions for raising file, thread, and FD_SET limits in the 2.2 kernel. Wow. This stunning little document steps you through a lot of stuff that would be hard to figure out yourself. This is a must-read, even though some of its advice is already out of date.
Up through JDK 1.3, Java's standard networking libraries mostly offered the one-thread-per-client model. There was a way to do nonblocking reads, but no way to do nonblocking writes.
In May 2001, JDK 1.4 introduced the package java.nio to provide full support for nonblocking I/O (and some other goodies). See the release notes for some caveats. Try it out and give Sun feedback!
HP's java also includes a Thread Polling API.
In 2000, Matt Welsh implemented nonblocking sockets for Java; his performance benchmarks show that they have advantages over blocking sockets in servers handling many (up to 10000) connections. His class library is called java-nbio; it's part of the Sandstorm project. Benchmarks showing performance with 10000 connections are available.
See also Dean Gaudet's essay on the subject of Java, network I/O, and threads, and the paper by Matt Welsh on events vs. worker threads.
There are several proposals for improving Java's networking APIs:
A zero-copy implementation of sendfile() is on its way for the 2.4 kernel. See LWN Jan 25 2001.
One developer using sendfile() with Freebsd reports that using POLLWRBAND instead of POLLOUT makes a big difference.
Solaris 8 (as of the July 2001 update) has a new system call 'sendfilev'. A copy of the man page is here.. The Solaris 8 7/01 release notes also mention it. I suspect that this will be most useful when sending to a socket in blocking mode; it'd be a bit of a pain to use with a nonblocking socket.
See LWN Jan 25 2001 for a summary of some very interesting discussions on linux-kernel about TCP_CORK and a possible alternative MSG_MORE.
"I've compared the raw performance of a select-based server with a multiple-process server on both FreeBSD and Solaris/x86. On microbenchmarks, there's only a marginal difference in performance stemming from the software architecture. The big performance win for select-based servers stems from doing application-level caching. While multiple-process servers can do it at a higher cost, it's harder to get the same benefits on real workloads (vs microbenchmarks). I'll be presenting those measurements as part of a paper that'll appear at the next Usenix conference. If you've got postscript, the paper is available at http://www.cs.rice.edu/~vivek/flash99/"
For Linux, it looks like kernel bottlenecks are being fixed constantly. See Linux Weekly News, Kernel Traffic, the Linux-Kernel mailing list, and my Mindcraft Redux page.
In March 1999, Microsoft sponsored a benchmark comparing NT to Linux at serving large numbers of http and smb clients, in which they failed to see good results from Linux. See also my article on Mindcraft's April 1999 Benchmarks for more info.
See also The Linux Scalability Project. They're doing interesting work, including Niels Provos' hinting poll patch, and some work on the thundering herd problem.
See also Mike Jagdis' work on improving select() and poll(); here's Mike's post about it.
Two tests in particular are simple, interesting, and hard:
Jef Poskanzer has published benchmarks comparing many web servers. See http://www.acme.com/software/thttpd/benchmarks.html for his results.
I also have a few old notes about comparing thttpd to Apache that may be of interest to beginners.
Chuck Lever keeps reminding us about Banga and Druschel's paper on web server benchmarking. It's worth a read.
IBM has an excellent paper titled Java server benchmarks [Baylor et al, 2000]. It's worth a read.
Copyright 1999-2002 Dan Kegel
dank@alumni.caltech.edu
Last updated: 28 Sept 2002
[Return to www.kegel.com]