How is the await field in iostat calculated?
PART 1: 15 seconds of await
One of our customers was running some third party monitoring software, which was reporting very occasional spikes of many seconds (6-15 seconds)worth of await on their local disk. Looking at the datadog code, we can see that the python is really just running iostat and capturing the output:
if Platform.is_linux(): stdout, _, _ = get_subprocess_output(['iostat', '-d', '1', '2', '-x', '-k'], self.logger)
From the man page, the options iostat is running with are:
-x Display extended statistics. This option works with post 2.5 kernels since it needs /proc/diskstats file or a mounted sysfs to get the statistics. This option may also work with older kernels (e.g. 2.4) only if extended statistics are available in /proc/partitions (the kernel needs to be patched for that).
-k Display statistics in kilobytes per second instead of blocks per second. Data displayed are valid only with kernels 2.4 and later.
-d Display the device utilization report.
The '1' and '2' will mean that iostat runs every one second, and will return two results before exiting. iostat with the -x flag will read from /proc/diskstats by default (it says it can also use sysfs, but a look at the code shows that /proc/diskstats is preferred when available). /proc/diskstats is a file containing a set of incrementing counters with various disk statistics. The first set of output from iostat will be statistics since the system was booted (as per the man page), and the second result will be statistics collected during the interval since the previous report. This will give us output like this:
# iostat -d -k -x 1 2 Linux 2.6.32-642.6.2.el6.x86_64 (linux-test2) 01/26/2017 _x86_64_ (1 CPU) Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.01 0.02 0.00 0.00 0.05 0.09 69.08 0.00 31.58 3.05 53.63 0.74 0.00 xvda 0.00 1.40 0.05 1.22 1.10 10.49 18.18 0.00 0.66 0.62 0.67 0.13 0.02 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
The first thing to do to try and investigate the issue was to try and replicate the results using iostat ourselves. We ran the following in a while loop:
# while true; do date >> /var/log/iostat.log ; iostat -d 1 2 -x -k >> /var/log/iostat.log; sleep 1; done
After running that for a while, we calculated the min, max and average values that iostat logged for await:
# cat /var/log/iostat.log | grep "dm-2" | awk '{print $10}' | sort -V | awk 'NR == 1 { max=$1; min=$1; sum=0 } { if ($1>max) max=$1; if ($1<min) min=$1; sum+=$1;} END {printf "Min: %d\tMax: %d\tAverage: %f\n", min, max, sum/NR}'
The results looked something like this:
Min: 0 Max: 15506 Average: 1.862469
Suspiciously the max was indeed showing 15506ms. Given that iostat is running every second, 15000ms (15 seconds!) of await is logically impossible. But in case the customer pressed further, the next question is what actually is await? A quick google around the internet doesn't show us exactly how it's calculated, so that means we need to go to the source.
PART 2: Obtaining the source code
The iostat program is part of a package included in the 'sysstat' package. The customer was running sysstat-9.0.4-27.el6.x86_64. The first thing to do was to download the source RPM for this version; it's not sufficient to browse the latest code repository online since software can change significantly from version to version. We found the SRPM on the red hat website and downloaded it on a test machine, and installed it.
# wget http://ftp.redhat.com/pub/redhat/linux/enterprise/6Server/en/os/SRPMS/sysstat-9.0.4-27.el6.src.rpm # rpm -Uvh sysstat-9.0.4-27.el6.src.rpm # cd rpmbuild
Looking in the SOURCES directory, there is some code and a bunch of .patch files. If we wanted to look at the source for the program as it was installed on disk, we would need to apply the patch files to the original source. We can do this with rpmbuild. From the relevant section of the rpmbuild man page:
-bp Executes the "%prep" stage from the spec file. Normally this involves unpacking the sources and applying any patches.
Before we can apply the patch files, we need to install some dependencies. We got an error for missing dependencies when trying to build the package initially which told us which dependencies were missing - gettext and if.h (which a yum whatprovides shows is provided by the gettext and kernel-devel package respectively):
# yum install kernel-devel gettext
Finally we can run rpmbuild.
# rpmbuild -bp SPECS/sysstat.spec
Once this finished, we can find the patched source for the version of sysstat we care about in the ~/rpmbuild/BUILD/sysstat-9.0.4 directory. From the naming of the files, the one we want is iostat.c.
PART 3: Reading the source
Since this is quite a short piece of code (~2000 lines), the first step is to go through and read the comments, functions and variable naming to get a broad picture for what we're looking at. Initially the first thing that sticks out is the write_ext_stat function - the comment above it reads this:
/* *************************************************************************** * Display extended stats, read from /proc/{diskstats,partitions} or /sys. * * IN: * @curr Index in array for current sample statistics. * @itv Interval of time. * @fctr Conversion factor. * @shi Structures describing the devices and partitions. * @ioi Current sample statistics. * @ioj Previous sample statistics. *************************************************************************** */ void write_ext_stat(int curr, unsigned long long itv, int fctr, struct io_hdr_stats *shi, struct io_stats *ioi, struct io_stats *ioj)
That looks like what we want. Looking at the code, we can see pretty clearly where the iostats output is printed when displaying extended stats:
/* DEV rrq/s wrq/s r/s w/s rsec wsec rqsz qusz await svctm %util */ printf("%-13s %8.2f %8.2f %7.2f %7.2f %8.2f %8.2f %8.2f %8.2f %7.2f %6.2f %6.2f\n", devname, S_VALUE(ioj->rd_merges, ioi->rd_merges, itv), S_VALUE(ioj->wr_merges, ioi->wr_merges, itv), S_VALUE(ioj->rd_ios, ioi->rd_ios, itv), S_VALUE(ioj->wr_ios, ioi->wr_ios, itv), ll_s_value(ioj->rd_sectors, ioi->rd_sectors, itv) / fctr, ll_s_value(ioj->wr_sectors, ioi->wr_sectors, itv) / fctr, xds.arqsz, S_VALUE(ioj->rq_ticks, ioi->rq_ticks, itv) / 1000.0, xds.await, /* The ticks output is biased to output 1000 ticks per second */ xds.svctm, /* Again: Ticks in milliseconds */ xds.util / 10.0);
The part we care about is the await part, which is the third column from the right. As above, we can see that what's printed for the await column is "xds.await". Looking a bit further up the function, we can see that xds is likely set here:
compute_ext_disk_stats(&sdc, &sdp, itv, &xds);
The iostat.c file doesn't contain this function, but a quick grep over the source directory shows that the definition is present in common.c:
/* *************************************************************************** * Compute "extended" device statistics (service time, etc.). * * IN: * @sdc Structure with current device statistics. * @sdp Structure with previous device statistics. * @itv Interval of time in jiffies. * * OUT: * @xds Structure with extended statistics. *************************************************************************** */ void compute_ext_disk_stats(struct stats_disk *sdc, struct stats_disk *sdp, unsigned long long itv, struct ext_disk_stats *xds)
Within this function, we can see that await is set based on some arithmetic using various variables, like this:
/* * Kernel gives ticks already in milliseconds for all platforms * => no need for further scaling. */ xds->await = (sdc->nr_ios - sdp->nr_ios) ? ((sdc->rd_ticks - sdp->rd_ticks) + (sdc->wr_ticks - sdp->wr_ticks)) / ((double) (sdc->nr_ios - sdp->nr_ios)) : 0.0;
From the above, it's important for us to know a few pieces of information:
The sdc and sdp questions can be answered by the comment above the compute_ext_disk_stats function:
* @sdc Structure with current device statistics. * @sdp Structure with previous device statistics.
The next question - what nr_ios actually is - can be obtained back in the write_ext_stat function. There, we have some lines of code which look like this:
sdc.nr_ios = ioi->rd_ios + ioi->wr_ios; sdp.nr_ios = ioj->rd_ios + ioj->wr_ios;
It appears then that nr_ios is the sum of rd_ios and wr_ios.
The remainder of the information can be obtained back in iostat.c. We know from the man page that extended disk stats rely on the /proc/diskstats file. A quick search for this shows a function called read_diskstats_stat:
/* *************************************************************************** * Read stats from /proc/diskstats. * * IN: * @curr Index in array for current sample statistics. *************************************************************************** */ void read_diskstats_stat(int curr)
This function will parse the contents of /proc/diskstats. The important part of this function:
/* major minor name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq */ i = sscanf(line, "%u %u %s %lu %lu %llu %u %lu %lu %llu %u %u %u %u", &major, &minor, dev_name, &rd_ios, &rd_merges_or_rd_sec, &rd_sec_or_wr_ios, &rd_ticks_or_wr_sec, &wr_ios, &wr_merges, &wr_sec, &wr_ticks, &ios_pgr, &tot_ticks, &rq_ticks); ... sdev.rd_ticks = rd_ticks_or_wr_sec;
Documentation on sscanf and % format specifiers is here:
https://linux.die.net/man/3/sscanf https://www.le.ac.uk/users/rjm1/cotter/page_30.htm
From the sscanf man page:
The scanf() family of functions scans input according to format as described below. This format may contain conversion specifications; the results from such conversions, if any, are stored in the locations pointed to by the pointer arguments that follow format.
So the first field of each line is an int and will be read into &major, the second field of each line is an int and will be read into &minor, the third field of each line is a char array and will be read into dev_name, and so on.
From the above, now we know that rd_ios, wr_ios, rd_ticks and wr_ticks directly correlate to fields in /proc/diskstats. We can get a plain English description of what each field in the /proc/diskstats file means by reading the kernel documentation, here: https://www.kernel.org/doc/Documentation/ABI/testing/procfs-diskstats
From cross referencing the variable names above against the fields in the kernel documentation, we can get a plaintext description of what these variables actually are.
rd_ios = field 4. In English, "reads completed successfully"
rd_ticks = field 7. In English, "time spent reading (ms)"
wr_ios = field 8. In English, "writes completed"
wr_ticks = field 11. In English, "time spent writing (ms)"
Now we know all of the information we need to work out what await actually is. A summary:
sdc - Structure with current device statistics
sdp - Structure with previous device statistics
nr_ios - The sum of reads and writes completed successfully (fields 4 and 8 in /proc/diskstats)
rd_ticks - Time spent reading in ms (field 7 in /proc/diskstats)
wr_ticks - Time spent writing in ms (field 11 in /proc/diskstats)
Back to the await calculation, we can see that it's set to this:
xds->await = (sdc->nr_ios - sdp->nr_ios) ? ((sdc->rd_ticks - sdp->rd_ticks) + (sdc->wr_ticks - sdp->wr_ticks)) / ((double) (sdc->nr_ios - sdp->nr_ios)) : 0.0;
One last bit of key information: in C, what you're looking at here with the "?" is a conditional operator (sometimes called a ternary operator). It evaluates to its second argument if the first argument evaluates to true. It evaluates to its third argument (the bit after the :) if the first argument evaluates to false. In English, if there is no difference between the current count for number of I/Os and the previous counter it means there was no disk activity, so set await to 0.0. Otherwise, set await to:
((sdc->rd_ticks - sdp->rd_ticks) + (sdc->wr_ticks - sdp->wr_ticks)) / ((double) (sdc->nr_ios - sdp->nr_ios))
Rewriting that in English:
await = (The difference between the time spent reading in ms plus the time spent writing in milliseconds), divided by (the number of writes completed plus the number of reads completed)
await is the time performing read/write operations divided by the number of read/write operations performed.
We know that in the 1000ms between iostat runs, there can not possibly be more than 1000ms worth of reading/writing done. If iostat is sampling 1 second apart, it's impossible for there to be a 15,000ms await. iostat is returning incorrect results. The customer should upgrade to a later version of sysstat, and if the bug persists then we should try and reproduce it, and then open a support case with Red Hat to get it fixed.