The art of benchmarking

A benchmark comparing JavaEE and NodeJS made the buzz lately on the web. I was surprised by the results and decided to reproduce it to verify an intuition. Also, the article was followed by multiple comments that are themselves worthy to be commented. Which brings us to the current blog post.

But what is a benchmark? A benchmark is meant to measure the performances a piece of software. An attempt to reproduce in laboratory what will happen in production. If you ask the domain experts, they will tell you that benchmarks are a really dangerous, even though useful, tool. And that they are pretty much providing false results. From my point of view, it is a bit exaggerated but not that much as we will see below.

However, one thing is certain. Benchmarks must not be taken lightly and some precautions are in order. For instance:

  • Have database volumes similar to the target
  • Have a data set sufficiently heterogeneous and meaningful to prevent any cache induced bias
  • Hardware with at least proportional throughput when compared to the target

Note that, as opposed to popular belief, benchmarks are generally optimistic. If you have bad response times during your benchmark, you can bet that it won’t improve in production. You should beware the opposite. Good response times during a benchmark do not guarantee good response times in production. Because the benchmark might be flawed.

Another interesting topic is the benchmark perimeter. Its goal. The one of our current concern is about NodeJS vs J2EE. To be precise, by reading the code, we notice that on the Java side, it is a servlet retrieving a Json document in CouchDB (and not CouchBase) through couchdb4j. That’s what we want to test. It’s meaningless to suggest to use Vert.x instead because it has a closer behavior to NodeJS. That’s not what we want to test. We are testing NodeJS against a servlet using couchdb4j to retrieve a json document. That’s it. Knowing if another technology could be more efficient is another benchmark.

Then, and that’s why experts are wondering, you need to make sure you are testing what you think you are. For instance, for this benchmark, we might reprove using a database on the same machine as the application server. It prevents you from clearly knowing which server is using the system resources. However, I’ve configured the same environment to stay close to the original.

Contrariwise, benchmarks are frequently criticized on the optimisation side. “Have you activated the parameter xxx?”. I disagree with that approach. A benchmark is always done at the best of one’s knowledge. As soon as you made sure you are testing the right thing (that it’s really the application server that is the bottleneck for instance), you’re good.

What do I mean by “at the best of one’s knowledge”?

If, for example, I’m testing Java against C++. But I’m not aware that I should add compilation flags to optimize the C++ code. And Java wins the contest. My results are not flawed. Because, at the best of my knowledge, if I put the Java program in production, it will be faster. My goal is reached. My benchmark is valid.

Of course, if I then publish my result and get recommendations from others, my knowledge has now increased. So I can now redo the benchmark taking advantage of my new wisdom. If I still have time and budget for it obviously. And remember, a benchmark is a scientific experience. Publishing the protocol is as important as publishing the results.

But back to our original subject. If we dig a little bit into the benchmark, we notice that the results are suspicious. Indeed, it doesn’t feel right to have a stable requests per second rate and a slower response time when adding virtual users. That’s a sign that we have a bottleneck somewhere.

So I’ve reproduced the benchmark. I haven’t used the same framework versions because some where quite old and hard to find. So I’ve bootstraped these ones:

  • Java HotSpot 64 bits 1.7.0_21
  • Tomcat 7.0.35
  • CouchDB 1.2.0
  • couchdb4j 0.3.0-i386-1

All this was run on a Ubuntu 13.04 VM with 4 CPUs but that’s not really important.

I’ve considered that the “concurrent requests” mentioned were in fact the number of virtual users. I’ve replaced JMeter by Gatling for a matter of personal taste. I got the following results:

 

Concurrent Requests Average Response time (ms) Requests/second
10 233 43
50 1237 42
100 2347 42
150 3506 42

Good news! My results are coherent with the ones in the article. By that I mean that I also encounter this suspicious stability of the requests per second rate (raw performances are a lot lower for Java and NodeJS on my machine for some reason hard to explain without the original benchmark code). Still feels like a bottleneck. To be convinced, a little vmstat is of purpose.

cs   us sy id
1134  3  4 93
1574 13 13 74
1285 13 13 74
1047 12 13 75

It shows us that our system CPUs, on heavy load, are just hanging around doing nothing.

So I logically follow this analysis by a thread dump to determine what’s blocking my system.

"http-bio-8080-exec-294" - Thread t@322
java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <49921538> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject .await(AbstractQueuedSynchronizer.java:2043)
at org.apache.http.impl.conn.tsccm.WaitingThread .await(WaitingThread.java:159)
at org.apache.http.impl.conn.tsccm.ConnPoolByRoute .getEntryBlocking(ConnPoolByRoute.java:339)
at org.apache.http.impl.conn.tsccm.ConnPoolByRoute$1 .getPoolEntry(ConnPoolByRoute.java:238)
at org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager$1 .getConnection(ThreadSafeClientConnManager.java:175)
at org.apache.http.impl.client.DefaultRequestDirector .execute(DefaultRequestDirector.java:324)
at org.apache.http.impl.client.AbstractHttpClient .execute(AbstractHttpClient.java:555)
at org.apache.http.impl.client.AbstractHttpClient .execute(AbstractHttpClient.java:487)
at org.apache.http.impl.client.AbstractHttpClient .execute(AbstractHttpClient.java:465)
at com.fourspaces.couchdb.Session.http(Session.java:476)
at com.fourspaces.couchdb.Session.get(Session.java:433)
at com.fourspaces.couchdb.Database.getDocument(Database.java:361)
at com.fourspaces.couchdb.Database.getDocument(Database.java:315)
at com.henri.couchbench.GetDocumentServlet.doGet(GetDocumentServlet.java:26)

Interesting. The system is waiting to obtain an HTTP connection to CouchDB. The problem is that couchdb4j is using httpcomponents. And httpcomponents only allows, by default, 2 parallel connections. That’s sad for my 150 users. They just have to wait for their turn.

I fix that by hacking the driver to hardcode allowing 150 connections. Hop! My response times are cut in half and requests per second rate is now at 100.

 

Concurrent Requests Average Response time (ms) Requests/second
10 100 96
50 416 103
100 1023 95
150 1450 89

But my job is not done. If I look at my new metrics, my CPU is higher. However, I’m seeing a quite high system CPU (sy) and the context switching flying high.

cs    us sy id
11979 57 21 22
12538 54 22 24

Tomcat 7 is using a synchronous HTTP connector by default. Let’s switch in NIO to reduce the amount of threads required to handle the load.

 

Concurrent Requests Average Response time (ms) Requests/second
10 54 177
50 135 357
100 278 348
150 1988 73

Yippee! The results are really good. For 50 users, a 8,5 times requests per second increase and response times 10x lower.

But with an unexpected surprise. The response times for 150 users are worse. The reason is simple. If you have a look at the GC graphs, because of the higher throughput, the JVM memory amount isn’t sufficient for 150 users. The CPU time dedicated to GC is now at 30%.

But I will stop my optimization for now. This post is already pretty long.

To conclude, a benchmark is a really useful tool to use with care. Nobody knows all the possible optimizations of a system and everyone does his best. However, you should keep an eye open. Suspicious results are a sign that something is wrong and should be investigated. In case of doubt, call an expert ;-)

7 commentsfor “The art of benchmarking”

  1. It looks like “The art of profiling” would be a better title. Nice post anyway!

  2. Hi Henri,

    Brilliant post rebutting a bad benchmark. Also this is a great example of how to use my Performance Diagnostic Model as I teach it in my course. ;-) I might

  3. You can use adaptive control valves to determine the optimal level of concurrency which is something that is not necessarily constant throughout the day or under different workloads.

    I wrote two articles on applying this control mechanism to Apache Cassandra.

    http://www.jinspired.com/site/adaptively-controlling-apache-cassandra-client-request-processing

    http://www.jinspired.com/site/canceling-uncontrolled-jitter-with-adaptively-controlled-jitter

  4. @Praise, no, this is the art of benchmarking.. recognizing when things are wrong and taking appropriate steps to fix them. Since benchmarking is about performance it seems appropriate that to fix you’d engage in performance troubleshooting… which Henri did.

    Henri recognized the results were the result of a artificially constrained workload and proceeded from there. You don’t have to read Williams blog postings to know that having properly sized connection/thread pools is an important aspect to a well tuned system.

  5. “You don’t have to read Williams blog postings to know that having properly sized connection/thread pools is an important aspect to a well tuned system”

    Well, it looks like you yourself did not read the articles…hmmm…what’s new.

    The point of the articles is that no one (man) can from the outset determine the appropriate set point of such a system. It will change with time and workload changes. It is far better for the system to adaptive tune this setting. Our work should be to engineer such adaptive mechanisms and then stand back and let the self regulated mechanism (the machine) do its magic.

    I covered this in a TSS video here (3rd video segment):

    http://www.theserverside.com/feature/A-revolutionary-new-approach-to-application-monitoring-with-William-Louth

  6. Hi William,

    The maths behind queues is very well known and I would suggest that one man can indeed set a reasonable limit on a thread pool size. It just requires you first understand what your constraining/contended resource is, it’s average service time and then you can set a limit that prevents your users from taking over your system. I’ll agree that this does get tricker in a multi-modal system but it’s still a reasonable approach.

    As for other thoughts from your blog. I know use it as a perfect example of how *not* to find memory issues. By treating a memory problem as an execution problem you end up saying the world is a single color and everything should be viewed that way. In doing so you end up making it more difficult to see and understand why. If you look at it as a memory problem the different view makes it much much simpler to see and understand why.

  7. Kirk it is not a memory problem it is either a (resource) capacity management problem or workflow control problem. You either add more capacity to meet the requirements, assume you could make the world ever so small and static, or you control the cost of execution (and its allocate rate).

    …and queuing laws are still to this day being misrepresented and misinterpreted…even by those claiming to be specialists. But AGAIN that is not the issue here. Code changes. Workload changes. Systems and resources changes (they too have queues). Time changes.

    THERE IS NO ONE NUMBER THAT CAN BE OPTIMAL AT ALL TIMES. THE SYSTEM AND THE VALUES THAT DRIVE ITS BEHAVIOR MUST ALSO BE ADAPTED WITHIN AN EXPLORATIVE MODE. BUT IF PEOPLE CAN’T ADAPT OR THINK IN TERMS OF ADAPTIVE MECHANISMS THEN YOU MIGHT AS WELL PICK FROM THE FOLLOWING SET

    {10, 25, 50, 100, 1000}

Leave a comment