The art of benchmarking

A benchmark comparing JavaEE and NodeJS made the buzz lately on the web. I was surprised by the results and decided to reproduce it to verify an intuition. Also, the article was followed by multiple comments that are themselves worthy to be commented. Which brings us to the current blog post.

But what is a benchmark? A benchmark is meant to measure the performances a piece of software. An attempt to reproduce in laboratory what will happen in production. If you ask the domain experts, they will tell you that benchmarks are a really dangerous, even though useful, tool. And that they are pretty much providing false results. From my point of view, it is a bit exaggerated but not that much as we will see below.

However, one thing is certain. Benchmarks must not be taken lightly and some precautions are in order. For instance:

  • Have database volumes similar to the target
  • Have a data set sufficiently heterogeneous and meaningful to prevent any cache induced bias
  • Hardware with at least proportional throughput when compared to the target

Note that, as opposed to popular belief, benchmarks are generally optimistic. If you have bad response times during your benchmark, you can bet that it won’t improve in production. You should beware the opposite. Good response times during a benchmark do not guarantee good response times in production. Because the benchmark might be flawed.

Another interesting topic is the benchmark perimeter. Its goal. The one of our current concern is about NodeJS vs J2EE. To be precise, by reading the code, we notice that on the Java side, it is a servlet retrieving a Json document in CouchDB (and not CouchBase) through couchdb4j. That’s what we want to test. It’s meaningless to suggest to use Vert.x instead because it has a closer behavior to NodeJS. That’s not what we want to test. We are testing NodeJS against a servlet using couchdb4j to retrieve a json document. That’s it. Knowing if another technology could be more efficient is another benchmark.

Then, and that’s why experts are wondering, you need to make sure you are testing what you think you are. For instance, for this benchmark, we might reprove using a database on the same machine as the application server. It prevents you from clearly knowing which server is using the system resources. However, I’ve configured the same environment to stay close to the original.

Contrariwise, benchmarks are frequently criticized on the optimisation side. “Have you activated the parameter xxx?”. I disagree with that approach. A benchmark is always done at the best of one’s knowledge. As soon as you made sure you are testing the right thing (that it’s really the application server that is the bottleneck for instance), you’re good.

What do I mean by “at the best of one’s knowledge”?

If, for example, I’m testing Java against C++. But I’m not aware that I should add compilation flags to optimize the C++ code. And Java wins the contest. My results are not flawed. Because, at the best of my knowledge, if I put the Java program in production, it will be faster. My goal is reached. My benchmark is valid.

Of course, if I then publish my result and get recommendations from others, my knowledge has now increased. So I can now redo the benchmark taking advantage of my new wisdom. If I still have time and budget for it obviously. And remember, a benchmark is a scientific experience. Publishing the protocol is as important as publishing the results.

But back to our original subject. If we dig a little bit into the benchmark, we notice that the results are suspicious. Indeed, it doesn’t feel right to have a stable requests per second rate and a slower response time when adding virtual users. That’s a sign that we have a bottleneck somewhere.

So I’ve reproduced the benchmark. I haven’t used the same framework versions because some where quite old and hard to find. So I’ve bootstraped these ones:

  • Java HotSpot 64 bits 1.7.0_21
  • Tomcat 7.0.35
  • CouchDB 1.2.0
  • couchdb4j 0.3.0-i386-1

All this was run on a Ubuntu 13.04 VM with 4 CPUs but that’s not really important.

I’ve considered that the “concurrent requests” mentioned were in fact the number of virtual users. I’ve replaced JMeter by Gatling for a matter of personal taste. I got the following results:

 

Concurrent Requests Average Response time (ms) Requests/second
10 233 43
50 1237 42
100 2347 42
150 3506 42

Good news! My results are coherent with the ones in the article. By that I mean that I also encounter this suspicious stability of the requests per second rate (raw performances are a lot lower for Java and NodeJS on my machine for some reason hard to explain without the original benchmark code). Still feels like a bottleneck. To be convinced, a little vmstat is of purpose.

cs   us sy id
1134  3  4 93
1574 13 13 74
1285 13 13 74
1047 12 13 75

It shows us that our system CPUs, on heavy load, are just hanging around doing nothing.

So I logically follow this analysis by a thread dump to determine what’s blocking my system.

"http-bio-8080-exec-294" - Thread t@322
java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <49921538> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject .await(AbstractQueuedSynchronizer.java:2043)
at org.apache.http.impl.conn.tsccm.WaitingThread .await(WaitingThread.java:159)
at org.apache.http.impl.conn.tsccm.ConnPoolByRoute .getEntryBlocking(ConnPoolByRoute.java:339)
at org.apache.http.impl.conn.tsccm.ConnPoolByRoute$1 .getPoolEntry(ConnPoolByRoute.java:238)
at org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager$1 .getConnection(ThreadSafeClientConnManager.java:175)
at org.apache.http.impl.client.DefaultRequestDirector .execute(DefaultRequestDirector.java:324)
at org.apache.http.impl.client.AbstractHttpClient .execute(AbstractHttpClient.java:555)
at org.apache.http.impl.client.AbstractHttpClient .execute(AbstractHttpClient.java:487)
at org.apache.http.impl.client.AbstractHttpClient .execute(AbstractHttpClient.java:465)
at com.fourspaces.couchdb.Session.http(Session.java:476)
at com.fourspaces.couchdb.Session.get(Session.java:433)
at com.fourspaces.couchdb.Database.getDocument(Database.java:361)
at com.fourspaces.couchdb.Database.getDocument(Database.java:315)
at com.henri.couchbench.GetDocumentServlet.doGet(GetDocumentServlet.java:26)

Interesting. The system is waiting to obtain an HTTP connection to CouchDB. The problem is that couchdb4j is using httpcomponents. And httpcomponents only allows, by default, 2 parallel connections. That’s sad for my 150 users. They just have to wait for their turn.

I fix that by hacking the driver to hardcode allowing 150 connections. Hop! My response times are cut in half and requests per second rate is now at 100.

 

Concurrent Requests Average Response time (ms) Requests/second
10 100 96
50 416 103
100 1023 95
150 1450 89

But my job is not done. If I look at my new metrics, my CPU is higher. However, I’m seeing a quite high system CPU (sy) and the context switching flying high.

cs    us sy id
11979 57 21 22
12538 54 22 24

Tomcat 7 is using a synchronous HTTP connector by default. Let’s switch in NIO to reduce the amount of threads required to handle the load.

 

Concurrent Requests Average Response time (ms) Requests/second
10 54 177
50 135 357
100 278 348
150 1988 73

Yippee! The results are really good. For 50 users, a 8,5 times requests per second increase and response times 10x lower.

But with an unexpected surprise. The response times for 150 users are worse. The reason is simple. If you have a look at the GC graphs, because of the higher throughput, the JVM memory amount isn’t sufficient for 150 users. The CPU time dedicated to GC is now at 30%.

But I will stop my optimization for now. This post is already pretty long.

To conclude, a benchmark is a really useful tool to use with care. Nobody knows all the possible optimizations of a system and everyone does his best. However, you should keep an eye open. Suspicious results are a sign that something is wrong and should be investigated. In case of doubt, call an expert ;-)