It's often surprising just how much software performance depends on how the software is deployed. All the time and effort you've invested in optimization can be erased by a few bad decisions in scheduler policy, affinity, or background workload on a server.
So here are a few things I check for when an app's performance is unexpectedly bad. These are things that should apply to any OS running on a modern server, but the specific tools I'll mention are for Linux.
Are you running the binary you think you are?
It's funny how often seemingly bizarre problems have simple explanations. So I start by checking that I'm really running the right binary. I use md5sum to get a hash:
md5sum /path/to/your/executable
Then I verify that the hash matches the md5sum of the app I was trying to deploy. On Linux, you do the same trick to check a running binary via the proc filesystem if you know the process's PID:
md5sum /proc/$PID/exe
Are the dynamic libraries in use the same?
Sometimes the same app will perform unexpectedly because the dynamic libraries are not what you'd expect. Ldd can tell you which libraries will be linked at startup time:
ldd /path/to/your/executable 
# or 
ldd /proc/$PID/exe
Did you affinitize the process?
With Linux, you can restrict the cores that a process runs on. That can be a benefit because it helps keep the process's data warm in the processor's cache. For a single-threaded app, affinitizing to a single core might be the right choice, but a busy multi-threaded app may require multiple cores.
And you can see which cores a process is able to run on via taskset -p $PID. Taskset can also be used to control which cores a process runs on.
Don't forget about NUMA effects
Modern servers use NUMA, which means that latency and throughput to RAM, disk or the network depends on which core an application is running on. Though the penalty is small for each operation (in the range of hundreds of nanoseconds), when aggregated across an application the affect can be noticable.
Keep each application close to the things it uses. If an application uses the network, then affinitize the application to a core that's on the same NUMA node as the network adapter that it's using.
On Linux, you can the topology of your hardware using numactl -H. Here's sample output:
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14
node 0 size: 65442 MB
node 0 free: 63882 MB
node 1 cpus: 1 3 5 7 9 11 13 15
node 1 size: 65536 MB
node 1 free: 63515 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
The output tells you that there are 2 nodes, each with 64 GB of RAM and 8 cores.
What about other processes?
Just because you affinitized your app to a specific core doesn't mean that other apps won't also use that core. So once you start affinitizing one app, you'll want to affinitize the other apps on the server as well.
For a while, the Linux kernel has a command line option to reserve cores from boot time: isolcpus. For instance, booting Linux with the kernel parameter isolcpus=1,3-5 tells the kernel that by default, no process should be scheduled cores 1, 3, 4 and 5. However, we as well as others have found that isolcpus can lead to unintended behavior where load is concentrated rather than spread across cores, so we don't use it.
Affinity and other hardware
If an app uses a lot of peripherals (e.g. network or storage), make sure the app is affinitized to the same NUMA node as the peripheral.
To check the NUMA node of an ethernet device, you can use sysfs:
cat /sys/class/net/$ETH/device/numa_node
The Linux tool hwloc-ls will also tell you how system components map to NUMA nodes.
Machine setup
Sometimes the problem isn't with how the software is deployed but the performance difference comes from the machine itself: either its hardware or software setup is not quite what you'd expect.
Performance on a virtual machine is often quite a bit worse than on a physical machine. You can check if a machine is virtual by looking for the hypervisor flag in /proc/cpuinfo:
grep -q '^flags.* hypervisor.*' /proc/cpuinfo && echo this is a VM
Is this the software you expect?
For starters, you can figure out the version of the Linux kernel on a machine with uname -a. Different kernels can behave very differently on the same workload.
You can also use your OS package manager to list all the packages and versions. Often I'll run the same command on two servers and diff the output:
function hdiff () { diff -u <(ssh $1 $3) <(ssh $2 $3) }
You can use this to diff the software installed on two hosts:
hdiff $host1 $host2 "dkpg -l"
Is the hardware what you expect?
As a first step, check the processor model and speed via cat /proc/cpuinfo.
DMI can tell you many things about the hardware you're dealing with:
hdiff $host1 $host2 "sudo /usr/sbin/dmidecode"
The output of dmidecode is huge and very detailed. One thing to pay particular attention to is the version of the BIOS:
BIOS Information Vendor: Computers Inc. Version: 1.5.1 Release Date: 06/23/2012
Finally, when dealing with the unexpected, it never hurts to check whether the server you're running on has been rebooted recently enough:
$ uptime 13:16:41 up 300 days, 9:21, 1 user, load average: 0.00, 0.00, 0.00
Three hundred days is way too long.
Summary
Advances in server architecture have led to spectacular performance gains, but often the gains are only realized when apps are tuned properly. This post only scratches the surface of the issues in performance tuning. Still, I've found these tools useful and I hope you will too.
By the way, if you enjoy solving these sorts of problems, Jane Street is hiring!