The Low Latency Web

March 26, 2012

500,000 Requests/Sec? Piffle! 1,000,000 Is Better

Filed under: HTTP Servers — lowlatencyweb @ 1:58 pm

Modern HTTP servers are capable of handling 500k requests/sec on commodity hardware. However that article ignored HTTP pipelining which can have a significant impact on performance. Buggy legacy servers and proxies prevent most browsers from enabling pipelining by default, but that is the high-latency past, not the future. In fact we can nearly double the performance of nginx 1.0.14 by enabling pipelining:

Getting to 1M requests/sec required minor tweaking of the client and server compared to the original article. nginx’s worker_processes was reduced from 16 to 14, wrk’s threads were increased from 10 to 11, and 30M requests were made instead of 10M. Maximum performance was reached with a pipeline depth of 8 and 1,100 concurrent connections.

wrk -t 11 -c N -r 30m --pipeline 8 http://localhost:8080/index.html

When pipelining is enabled wrk counts latency as the time from the first request to the last response. In this particular environment less than 400 concurrent connections result in worse latency and throughput with 8 pipelined requests, however with 400+ connections latency and throughput are significantly better.

A SPDY Future

What makes these results even more interesting is the accelerating acceptance of SPDY as a replacement for HTTP 1.1. Chrome has supported SPDY for a while, Firefox and nginx will soon, and there is an experimental Apache module. For dynamic content Jetty and Netty support SPDY today, and Netty’s implementation is already in production use by Twitter.

Persistent connections and support for multiple concurrent requests are inherent to SPDY, so a move to SPDY will reduce the number and frequency of new connections, as well as the latency caused by non-pipelined request/response cycles. This will allow HTTP servers to get closer to their theoretical maximum performance, which looks to be very high indeed for nginx.

March 23, 2012

A Note On Benchmarking

Filed under: HTTP Servers — lowlatencyweb @ 2:42 pm

This series of articles has drawn out the cargo cultists who insist that HTTP benchmarks must be run over the network, or that a real application should be tested, as if adding more variables makes a benchmark more useful. What would adding NIC overhead and network latency into the mix prove when the goal is to test a HTTP server, or a dynamic content stack? Zed Shaw addresses this in an excellent article on confounding, but the fundamental point is that a useful benchmark must isolate the variable being tested.

Critics may claim that for many web applications the database is the bottleneck so there is little value in testing anything else. If so that’s easily addressed with in-memory caching and alternative data stores, and then what becomes the bottleneck? Probably the HTTP server or dynamic content stack. Modern hardware and databases are very efficient and performance problems tend to be caused by poor choices in hardware or software rather than database or network overhead.

However in the interest of comprehensiveness here is a plot showing the previous test of dynamic content when the client and server are on separate machines, communicating over a gigabit VLAN. Please note that no TCP tuning was done other than the original tuning to increase the number of ephemeral ports. Both machines are identical hardware from SoftLayer (with whom my only affiliation is that of a satisfied customer).

The absolutely difference between localhost and running over the VLAN is relatively uninteresting given that it is dependent on the switching hardware, NIC hardware & driver, kernel, TCP tuning parameters, etc. Latency increases as expected, but requests/sec appear to be converging.

March 22, 2012

150,000 Requests/Sec – Dynamic HTML & JSON Can Be Fast

Filed under: HTTP Servers — lowlatencyweb @ 10:45 pm

Most web applications contain copious amounts of static content in the form of JavaScript, CSS, images, etc. Serving that efficiently means a better user experience, and also more CPU cycles free for dynamic content. Useful benchmarks of dynamic content are more difficult due to the huge number of components available today. Do you choose Ruby, Python, node.js, Java? Which framework on top of which runtime? Which templating system?

Well, this is the Low Latency Web! Here’s a plot showing requests per second vs. number of concurrent connections for a simple JVM application which prefix-routes each request to a random responder and renders the result as either JSON or a Jade HTML template.

Performance plateaus at around 150,000 requests/sec for HTML output and 180,000 requests/sec for JSON output. Latency is much greater than the ideal case of static content via nginx, but only around 6ms with 1,000 concurrent connections and below 2ms for <= 300 connections. Not bad for the JVM and a commodity server, and JSON generation performance is better than static content on EC2.

wrk was run the same as before, but the JSON test adds the Accept HTTP header:

wrk -t 10 -c N -r 10m http://localhost:8080/
wrk -t 10 -c N -r 10m -H "Accept: application/json" http://localhost:8080/

Application

The complete test application is available in app.zip, simply extract it somewhere and run mvn compile exec:java. The only dependencies are a JVM and maven. The results above are from the server described in the original article, running JDK 7u3:

java version "1.7.0_03"
Java(TM) SE Runtime Environment (build 1.7.0_03-b04)
Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode)

The test app only 124 lines of Java, but could just as well have been written in Clojure, JRuby, Scala or any other JVM language that compiles to byte code. It uses Jetty 8 as the HTTP server, Scalate as the templating engine, and Jackson as the JSON generator.

When the app starts it chooses 1,000 words from /usr/share/dict/words and maps each to a responder that keeps a monotonically increasing count of calls. Each request is randomly routed to a responder via a prefix match, and the result is output in JSON or HTML depending on whether the request’s Accept header contains application/json or not.

The benchmark tests the following, in approximate order of used CPU time:

  1. Jetty’s HTTP performance
  2. Scalate’s template rendering performance
  3. Jackson’s JSON generating performance

The cost of routing a request appears to be negligible but is theoretically O(log n).

March 21, 2012

Modern HTTP Servers Are Fast, EC2 Is Not

Filed under: HTTP Servers — lowlatencyweb @ 9:55 pm

The previous article showed nginx 1.0.14 performance on a dedicated server from SoftLayer. That server was chosen simply because one was available, and 24GB of RAM was completely unnecessary.

It would be more useful to publish results from an enviroment that is easy and cheap to replicate, such as Amazon EC2. The Cluster Compute Eight Extra Large Instance appears to be a good candidate with dual Intel Xeon E5-2670 CPUs and 60.5GB RAM. Spot prices are typically $0.54/hour.

Kernel parameters and nginx config are identical to the previous article, but EC2 instances run an Amazon Linux AMI.

Despite running one of the latest Intel CPUs, with 4 extra cores, the EC2 instance performs very poorly. Some virtualization overhead is expected, however each cc2.8xlarge instance should have an entire physical machine to itself given that the Intel Xeon E5-2670 supports a maximum of 2 CPUs.

March 20, 2012

500,000 requests/sec – Modern HTTP servers are fast

Filed under: HTTP Servers — lowlatencyweb @ 7:00 am

A modern HTTP server running on somewhat recent hardware is capable of servicing a huge number of requests with very low latency. Here’s a plot showing requests per second vs. number of concurrent connections for the default index.html page included with nginx 1.0.14.


With this particular hardware & software combination the server quickly reaches over 500,000 requests/sec and sustains that with gradually increasing latency. Even at 1,000 concurrent connections, each requesting the page as quickly as possible, latency is only around 1.5ms.

The plot shows the average requests/sec and per-request latency of 3 runs of wrk -t 10 -c N -r 10m http://localhost:8080/index.html where N = number of connections. The load generator is wrk, a scalable HTTP benchmarking tool.

Software

The OS is Ubuntu 11.10 running Linux 3.0.0-16-generic #29-Ubuntu SMP Tue Feb 14 12:48:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux. The following kernel parameters were changed to increase the number of ephemeral ports, reduce TIME_WAIT, increase the allowed listen backlog, and the number of connections Netfilter can track:

  echo "2048 64512" > /proc/sys/net/ipv4/ip_local_port_range
  echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
  echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
  echo "10" > /proc/sys/net/ipv4/tcp_fin_timeout

  echo "65536" > /proc/sys/net/core/somaxconn
  echo "65536" > /proc/sys/net/ipv4/tcp_max_syn_backlog  

  echo "262144" > /proc/sys/net/netfilter/nf_conntrack_max

The HTTP server is nginx 1.0.14 built with ./configure && make, and run in-place with objs/nginx -p . -c nginx.conf.

nginx.conf

 
worker_processes     16;
worker_rlimit_nofile 262144;

daemon off;

events {
  use epoll;
  worker_connections 16384;
}

error_log error.log;
pid /dev/null;

http {
  sendfile   on;
  tcp_nopush on;

  keepalive_requests 100;

  open_file_cache max=100;

  gzip            off;
  gzip_min_length 1024;

  access_log off;

  server {
    listen *:8080 backlog=16384;

    location / {
      root   html;
      index  index.html;
    }
  }
}

Hardware

A dual Intel Xeon X5670 with 24GB of RAM from SoftLayer. The X5670 has 6 cores @ 2.93 GHz, 2 threads per core, /proc/cpuinfo shows 24 CPUs.

The Shocking Blue Green Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.