Dropwizard TechEmpower Benchmark - Part 2:Ā URI Parsing,Ā Guava, and Exceptions
This is part 2 of an (I hope) many part series of posts exploring Dropwizard's TechEmpower benchmarks. When we left off, I'd managed to get the benchmarks running locally and wasn't impressed by what I saw.
Where to Begin?
Where you start when things are slow (or even broken) is important. We begin with relatively little information to go on. Simple HTTP requests to http://localhost:9090/json have much higher latency and latency variability than I would expect. While the benchmark weighs in at a mere 485 lines of Java (cloc is a wonderful utility), Dropwizard weighs in at a much healthier 38,235 lines of Java (although that includes all of the modules, whether used or not). In addition, Dropwizard pulls many libraries in via Maven.
Consequently, it's unlikely that we're going to solve this problem by simply doing a code review (I mean, you mightā¦but I'm not). So we need to find another way to get started. In times like this I reach for a profiler of some sort. YourKit has always been a popular choice if you have budget. JConsole can work in a pinch and ships with the JDK. Lately I've been hearing more about Java Mission Control, so I gave that a spin. I won't get into too many details, but Java Mission Control and it's companion, Java Flight Controller are pretty excellent and they both come with the Oracle JDK.
How to Profile?
Posessing my new toy, I fired up the TechEmpower Dropwizard program to be benchmarked, started Java Flight Recorder, and proceded to run a script I'd constructed by dissecting what traffic the benchmark was generating. The script runs wrk, an HTTP load generation tool (a newer ApacheBench for those of you who know that tool). The script executes a short "priming" run where it's looking to make sure the server being tested is actually working. It then executes an intensive "warmup" run where it's attempting to load code, prime caches, and get the program through any JIT compilation stages (important for benchmarking a Java server). Once that's finished, it runs the benchmark stages: 15 second tests with 2 traffic driving threads at 8, 16, 32, 64, 128, and 256 concurrent connections. When the script finishes, I stop Java Flight Recorder and dig into the results.
What I'm Looking For
Now that I've collected my profile, it's time to see what was going on while the benchmark was running. Given how erratic the response times are in the original benchmarking runs, I suspect the JVM garbage collector is at play. Depending on what garbage collector you choose (I hope to do an entire post on that in the future), many of your collections are going to run without too much impact to your process. Major collections, however, are stop-the-world affairs. In these situations, the JVM stops all application threads while the garbage collector scours and scrubs memory in an attempt to reclaim memory. The JVM doesn't care what your application is doing when it decides to do one of these. Your threads get stopped, even if they're in the middle of servicing a user request. Externally, the benchmark script has no idea this is happening. It just notices that requests stall momentarily. This periodic slowing of requests is really good at providing the latency distributions we're seeing: nice, low latency numbers in the lower percentiles and ugly, higher latency numbers in the upper percentiles.
Java Flight Recorder has an entire section of the UI dedicated to memory. You can see how much is used over time. You can see the garbage collections over time, both minor and major, and how long they take (pausing your application in the process). Even more useful, like any other profiler Java Flight Recorder records all of the object allocations including the number of allocations, the size of the allocations, and the stack trace showing where the allocations are made. I quickly scan the list of classes:
There's no real smoking gun just from looking at the object types. Byte arrays aren't really surprising for a server that's reading and writing bytes to sockets. Character arrays and Strings aren't very surprising, eitherā¦you can easily make a lot of Strings. Integer arrays and Object arrays might be interesting. It kind of depends on what they're getting used for.
Curious to dig a bit, I select java.lang.Object[] and look at the stack traces where they're being generated:
Whoaā¦stop right there. Do you see the third entry: java.lang.Throwable.fillInStackTrace(int)? Throwable is part of Java's Exception hierarchy. I didn't see any failed requests during the run, so what's throwing so many exceptions that I'm generating 1.39GB of java.lang.Object[] just to fill in stack traces?
Ohā¦Hey, Guava
Expanding the java.lang.Throwable.fillInStackTrace(int) stack trace reveals a few things. First, Jersey (the reference implementation of JAX-RS, which Dropwizard uses for REST) has repackaged and included its own copy of Google's Guava (a popular collection of common code) internally (you can see it in the jersey.repackaged.⦠classes in the call stack). The exception being thrown is created in com.google.common.net.InetAddresses.forUriString(String), which is called by com.google.common.net.InetAddresses.isUriInetAddress().
Looking more closely at the code, this is a common anti-pattern where exceptions are used as flow control. In this case, isUriInetAddress() calls forUriString(), which throws an IllegalArgumentException if the host string given isn't a valid IP address. isUriInetAddress() catches the exception and returns false. In this particular case, the exception isn't necessary. It's immediately caught and discardedā¦including the java.lang.Object[] that was allocated to fill in the stack trace.
Following the stack trace a bit farther, we see org.glassfish.jersey.servlet.ServletContainer.service(ServletRequest, ServletResponse). This means that the exception throwing code in Guava is being called on every HTTP request. Given that we're trying to run a benchmark, we know we're trying to do a very high rate of requests. That causes us to dump a high rate of unused Object arrays onto the heap for the garbage collector to deal with.
Allocation is expensive enough as it is, but you pay a double cost when allocating memory with languages like Java because of the garbage collector. So it's best not to allocate memory unless you really need to.
This is also a case where the cost of allocation is largely hidden from the caller. The developer thinks they're just throwing an Exception. Under the hood, however, is a relatively expensive operation. This particular case is even more costly because the code generating the message for the exception, formatIllegalArgumentException(String, Object...) is executing java.lang.String.format(Locale, String, Object...), which is creating even more throw away memory, not to mention unnecessarily burning CPU cycles.
Guava, Could You Not?
Fortunately, the solution is pretty straightforward. isUriInetAddress() needs a version of forUriString() that doesn't throw an IllegalArgumentException. Maybe one that just returns null instead, so isUriInetAddress() can simply return forUriString(ipString) != null. I made the change locally, rebuilt the server, and ran a new benchmark against the original benchmark code and my newly enhanced Guava-based code.
This time instead of running the TechEmpower benchmark I ran a simpler benchmark that executed a fixed number of requests (500,000). With a fixed number of requests we can more accurately compare memory profiles. If we fixed the time, instead, it's possible that a faster server would execute more requests and since the memory use we're targeting is based on the number of requests being made, the memory profiles would be more difficult to compare.
Once again, I recorded both runs with Java Flight Recorder. I then held them side by side. The "base" version of the code executed 281 minor garbage collections (0 major) while the "enhanced" version executed 268 minor garbage collections (also 0 major). I'm not entirely sure how to reconcile the small difference here. Used, reserved, and committed heap size appears pretty similar in both tests. You may be able to see in the screenshot that I've bounded the time considered to just when the benchmark was running. You can also see that the bounded time for the "base" test was about 42 seconds while the bounded time for the "enhanced" test was about 36.5 seconds. So even though the collections were comparable, the "enhanced" run was about 13% shorter.
The major sign of improvement comes from looking at the Thread Local Allocation Buffer (TLAB) statistics. For the "base" version, there was a TLAB count of 185,187 and 16.70GB of total memory allocated. For the "enhanced" version, there was a TLAB count of 155,059 (>16% reduction) and 12.91GB of total memory allocated (>22% reduction). One thing that's incredible about this is that the "base" profile showed only 130.77MB of java.lang.Object[] allocated as a result of Guava throwing exceptions. We shaved 3.79GB off the total allocations, however. So it's obvious that while Object arrays are what got our attention, they aren't the only thing that was being allocated by the thrown exceptions.
I suspect the lack of improvement in GC counts and time spent is mostly due to GC tuning. I did a separate run with -XX:NewRatio=1, which doubled the size of the "young" generation: the default is -XX:NewRatio=2 and sizes the young-to-tenured generations by setting the young generation to 1 / (1 + NewRatio) of the heap. Comparing -XX:NewRatio=2 to -XX:NewRatio=1 with the Guava enhancement yields a drop from 268 to 178 minor collections and a drop in overall GC time from 685ms to 458ms. Nothing in this run affected how much garbage was being generated or how much needed cleaning, only the ratio of the size of the young generation to the tenured generation.
This makes sense since most of the objects being created are during a request, which we've seen can complete in about a millisecond. A larger young generation means more space for these short-lived objects. One thing that's slightly unintuitive to me, however, is that even though the collections are less frequent they take just as long, even though they're collecting more garbage. A quick look at the verbose GC output shows a minor collection with -XX:NewRatio=2 collecting 69,976K in 0.0014054 seconds and a -XX:NewRatio=1 minor collection wrangling 104,965K in 0.0019387 seconds. That comes out to 49,790,806K/s and 54,141,950K/s respectively. That's an 8.7% improvement in collection speed for doing nothing more than re-arranging our heap.
This is where my garbage collection knowledge begins to run out, so I'll leave explaining what's going on there for another day. In the meantime, I'll also make a mental note to adjust the young-to-tenured ratio for the official benchmark.
But Does the Benchmark Improve?
Now the important question, did the benchmark improve at all? I'll preface this by saying I don't have a strong benchmark setup locally just yet. So some of my numbers aren't the best. I think there's contention between a few processes that needs to be sorted out by setting up a proper benchmark environment.
With that said, I moved the code to an EC2 instance (a t2.micro, which I realize is horribly under-powered). With the code there, I re-ran the benchmarks and the tl;dr is yes, the JSON benchmark is quite a lot better.
At 8 concurrent connections average latency goes from 2.40ms to 1.86ms with the standard deviation reducing from 4.90ms to 1.91ms. 99th percentile latency drops from 25.33ms to 6.90ms. Overall throughput climbs from 4,546 requests per second to 4,763 requests per second.
We see similar gains at 16 concurrent connections as well. Average latency drops from 6.87ms to 3.13ms. Standard deviation drops from 15.82ms to 3.65ms. 99th percentile latency drops from 80.92ms to 13.71ms. Throughput shoots from 4,746 requests per second to 5,742 requests per second.
We continue seeing gains at 32, 64, 128, and 256 concurrent connections. However, the gains aren't nearly as strong. This may be partly due to the t2.micro instance not having the horsepower to keep up (not enough CPU cores for so many concurrent requests). Future development is going to require a better benchmark environment. Hopefully something closer to whatever TechEmpower is using.
Output for the base and enhanced Guava versions of the benchmark is available here.
So, Guavaā¦
Right, so none of this helps if we don't get the patches in the hands of the right people. As luck would have it, my very first contribution to Guava was recently accepted! Now everyone can check host strings for IP addresses without fear of over-taxing their garbage collector.
A few things need to happen, however. First, remember how I mentioned that Jersey packages their own Guava? Yeahā¦they're going to have to upgrade the version of Guava that they bundle to include my changes. Then they're going to have to release a new Jersey with the new, bundled Guava. Once that's done, Dropwizard needs to pick up the new Jersey (preferably the new Guava as well since Dropwizard has it's own dependency on not-bundled-with-Jersey Guava). Once that's done, all of the Dropwizard users need to update to the new Dropwizard with the new Jersey with the new Guava (I make it sound so simple).
Of course, if you're impatient you can patch your own JAR files. That's how I tested all of this before submitting any patches. š
We're Done, Then?
Goodness no, we're not done with the TechEmpower benchmark by a long shot. This is just the first thread I pulled. In the course of looking for this leak I found a few other areas to explore and not all of them are memory-related (tease). Before we get there, however, I need to get a proper benchmark environment set up and maybe re-run the analysis from the Guava fix to establish new, better baselines for comparison.
If you've enjoyed this post and you're not already, follow me on Twitter to be notified of the next post. I'm also interested in feedback, questions, suggestions, emoji (preferrably not š©, keep that to yourself), cat pics, and reactions (GIFs are more than acceptable). Send those via Twitter.
If you're an intrepid Dropwizard user and you patch your service before new Guava, Jersey, and Dropwizard libraries are released I'd love to hear how it goes. If you want help figuring out how to do that, hit me up on Twitter. I'd love to see if this positively impacts real world applications. Same goes for Guava users who aren't also Dropwizard users. I'm curious to know if I've made life measurably better for you as well.
Next post in about a week (I hope). As I said, I already have some other fun areas for exploration in mind.
If you've read this far, know that I appreciate you. š













