Disk I/O bottlenecks are easy to overlook when analyzing CI pipeline performance, but tools like iostat and fio can help shed a light on what might be slowing down your pipelines more than you know.
GitHub offers different hosted-runners with a range of specs, but for this test we are using the default ubuntu-22.04
runner in a private repository, which does give us an additional 2 vCPUs but does not alter the disk performance.
How to monitor disk performance
Getting a baseline benchmark from a tool like fio
is useful for comparing the relative disk performance of different runners. However, to investigate if you are hitting disk I/O bottlenecks in your CI pipeline, it is more useful to monitor disk performance during the pipeline execution.
We can use a tool like iostat
to monitor the disk while installing dependencies from the cache to see how much we are saturating the disk.
Monitoring disk during untar of Next.js dependencies
In the above test, we used iostat
to monitor disk performance while the cache action downloaded and untarred the dependencies for vercel/next.js
:
The full step took 12s to complete, and we can estimate the download took around 3s, leaving 9s for the untar operation.
The compressed tarball is only about 328MB, but after extraction, the total amount of data written to the disk is about 1.6GB. That smaller size got our cache across the network plenty fast, and most CPUs can handle decompression fast enough, meaning higher compression is often favorable. Once download and decompression are no longer the bottleneck, that leaves writing to disk.
Reading from a tarball is a fairly efficient process as it's mostly sequential reads, however, we then need to write each file to disk. This is where we can hit disk I/O bottlenecks, especially with a large number of small files.
It’s important to note that this is just a single run, not an average. Running multiple tests over time will give you a much clearer picture of the overall performance. Variance between runs can be quite high, so an individual bad run doesn’t necessarily indicate a problem.
What this run suggests is a possible throughput bottleneck. We’re seeing spikes in the maximum total throughput, with most hovering around ~220MB/s. This is likely the maximum throughput we are able to achieve to this disk, we'll verify this next. We should continue to monitor this and compare it to other runners to see if we can find an ideal runner for our workflow. We'll use fio
to double-check if we are hitting the disk's maximum throughput.
An interesting aside before we move on, we can see from this side-by-side how relatively low read operations to writes there are. Since we’re reading from a tarball, most reads are sequential, which tends to be more efficient. That read data is likely going into a buffer before being written to the disk in a more random pattern as it creates a copy of each file. This is why we see a higher write IOPS than read IOPS.
Maximum disk throughput
One of the first optimizations developers usually make to their CI pipelines is caching dependencies. Even though the cache still gets uploaded and downloaded with each run, it speeds things up by packaging all your dependencies into one compressed file. This skips the hassle of resolving dependencies, avoids multiple potentially slow downloads, and cuts down on network delays.
But as we saw above, network speed isn't usually our bottleneck when downloading the cache.
Test Type | Block Size | Bandwidth |
---|---|---|
Read Throughput | 1024KiB | ~209MB/s |
Write Throughput | 1024KiB | ~209MB/s |
Using fio
to test our throughput, notice that both "read" and "write" throughput are both capped at the same value. This is a fairly telling sign that the limitation here is not actually the disk physically, but rather a bandwidth limit imposed by GitHub. This is a standard practice to divide up resources among multiple users who may be accessing the same physical disk from their virtual machines. It isn't always documented, but most providers will have higher bandwidth limits on higher tier runners.
What we measured here aligns fairly closely with the 220MB/s we saw in the untar test, giving us another hint that we are likely being slowed down during our dependency installation, not by the network or CPU, but by the disk.
Regardless of how fast our download speed is, we won't be able to write to disk any faster than our max throughput to the disk.
Estimated time to write to disk: Select a cache payload and throughput speed
Realistically, your disk performance will vary greatly depending on your specific cache size, the number of files, and just general build-to-build variance. That's why it's a good idea to monitor your CI runners for a consistent baseline, and we'll talk about testing your workflow on multiple runners for comparison.
Maximum IOPS (Input/Output Operations Per Second)
After downloading the cache tarball, it needs to be extracted. Depending on the compression level it could be a CPU-intensive operation but this isn't usually a problem. When untar-ing the dependencies, we are performing a lot of small read and write operations, which is where we can hit disk I/O bottlenecks.
Test Type | Block Size | IOPS |
---|---|---|
Read IOPS | 4096B | ~51K |
Write IOPS | 4096B | ~57K |
Random Read IOPS | 4096B | ~9370 |
Random Write IOPS | 4096B | ~3290 |
IOPS is a measure of how many read/write operations can be performed in a second. When we have a lot of small files, like especially with a node_modules
directory, it is possible to saturate the IOPS limit of the disk (or the imposed limit) and become a different kind of IO bottleneck.
Similarly to how we can't write to the disk any faster than the bandwidth limit, there is a limit to how many IOPS we can perform on the disk.
Running benchmarks on different runners
If you are seeing bottlenecks in your CI pipeline, of any kind, we want to try to optimize for those issues with strategies like caching and parallelizing where possible. But we also need to know if we are hitting the limits of the runner we are using. It's easy enough to add a matrix strategy to your workflow to test on multiple runners for a quick comparison of speed of the same steps on different hardware.
To get a more detailed look at the specific disk performance of each runner, you can use the fio
benchmarking tool we mentioned earlier. This will give you a better idea of the disk performance of each runner, and a reference point for checking for bottlenecks in your CI pipeline.
Ultra-fast disk I/O with Depot Ultra Runner
Depot is launching a new runner type with ultra-fast disk I/O, the Depot Ultra Runner. The Ultra Runner utilizes a large RAM disk cache and higher-powered CPUs to maximize performance in both high IOPS and high throughput scenarios.
Want to be notified when the Depot Ultra Runner is available? Subscribe to our changelog for all major updates.
Try comparing your current workflow on a Depot runner. Sign up for our 7-day free trial and compare your CI pipeline performance on Depot Runners with a matrix job.