I am writing this post to summarize the 1.5 weeks I spent on performance tuning of the new machines we got.
Before that, I had got the best performance of our multi-threaded DNS cache resolver from another set of machines, which were borrowed from another team. Then we purchased these new machines, which have 20 CPU cores (versus 16 in the other set of machines), 132G memory (versus 64G), the same NIC (10gbE). The operating system is CentOS 6.5 on both sets of machines. But the performance I got from the new machines were miserable (1/5 of the best performance).
This is the first time ever I have been involved on machine performance tuning. So I had no clue where to start.
I started by inspecting all the network card settings, receive & transmit buffer setting, and the rx-flow-hash setting. Except for the new machines have a newer version of network card firmware, I didn't find anything different. I asked our IT guys if it was possible to downgrade the firmware version, and was told NO.
Then I started looking at the CPU settings. Man there is so much difference there. First I noticed the CPUs in the new machines have different sets of flags from the cores in the old machines. So I asked the IT guy if we could change the flags there. I was told NO again, unless we are the OEM (of course we are not). At this point, I felt I was pretty stupid anyway for asking these two questions. Although I couldn't really find online whether those changes could be made. The IT guy got a little impatient too I think, because he sent out an email to all the engineers with the RedHat performance tuning book link, and some tap scripts.
The other difference I found between the two sets of machines is that there is only one NUMA node in the new machines with all the 40 cores (20 physical cores hyperthreaded), while there are two NUMA nodes in the old machines with 32 cores (16 physical cores hyperthreaded). There are two sockets in the new machines too, but why only one NUMA node? I was really suspecting that was the whole reason, and started looking into how to add another NUMA node. Unfortunately there was little information I could find online on that topic. Then my coworker informed me a tool that does exactly the same thing I was testing but only without I/O. I used that tool, and got better performance in the new machines, which ruled out the CPU performance issue.
Now I had the idea that I should do some bench marking, which should have been the first thing to start with.
The network performance tool I picked is iperf, because it can test the performance with multiple parallel streams, which is the case in our engine. The results showed exactly the same network performance between the two sets of machines.
Then I started benchmarking everything else with sysbench. It's a wonderful tool because 1)It can test multiple things, including CPU, memory, thread and file IO. 2) it supported multithreading as well.
So I tested all of those. The results showed slightly slower multithread file I/O, CPU and memory, but better threads and 1-thread file I/O. Since our internal tool showed better performance in the new machine without the I/O, I wasn't too concerned about the slightly slower CPU and memory. But now what is the issue?
I turned to the system tap scripts our IT guy threw me and found the nettop.stp. I had to install the debug package of the matching linux kernel to use the script, which I did on both sets of the machines, when I found out on the old machine the kernel is a slightly newer version. A big difference I found is that in the new machines, ksoftirqd was doing all of the packet receiving, while our engine is doing the transmitting; but on the old machines, our engine is also doing the transmitting, but it also does some transmitting, and the rest of the transmitting is done by a process called swapper (with a process ID 0).
I had no idea why that was. So I started suspecting it is because of the newer version of kernel on the old machine. There was me again quickly filed a ticket asking the IT guy to update the new machine kernel to the same. But then, I saw actually that was the only machine in the whole set of old machines that had the newer kernel version. All the other ones have the same version as what's in the new machines, and I did get the same performance from the other old machines as well. OK, that's probably not it. Apologize again Tod.
Keep looking, I found another possible culprit - irqbalance. It was not even installed in the new machines!
So the kernel itself has an irq balancer. But it is not intelligent enough to distribute the workload to the cores effectively to achieve the best performance. irqbalance could help. So I installed it on the new machines. But I didn't see much performance improvement.
Then I turned it off on the old machines - performance dropped by a half! I started it again, performance didn't come back! !!!
Now I was super confused. Maybe there was some configuration I need to do to irqbalance? I had an 1-1 with my manager, who suggested I ask the performance engineer who might know more about the configurations on those old machines, where I might be able to get a quick answer from.
A quick asking-around told me nobody has touched any performance tuning on those old machines. So I was left with no clue again... I was pretty tired and decided to take a break.
The next morning I came in and started looking at the all the configurations. I found I tricked myself by not setting the rx-flow-hash on the old machine when I restarted them, and that was the whole reason why the performance didn't come back after I started irqbalance.
So now both sets of machines are functioning, with good performance. I cannot say best performance, but it is to my satisfaction at least. Whew~