Speeding up CI in AWS

One of our development teams highlighted that their build was taking too long to run. We obtained a near three times speed improvement in most part by using newer AWS instance types and allocating fewer Buildkite agents per CPU.

Envato use the excellent Buildkite to run integration tests and code deployments. As a bring-your-own-hardware platform, Buildkite offers us a lot of flexibility in where and how these tasks run.

This means that we’re able to analyse how that build is using its hardware resources and try to work out a better configuration.

The build in question is for the “Market Shopfront” product: a React & node.js application written in TypeScript, built with webpack, and tested using Jest and Cypress.

On-branch builds were taking between ten to twenty-five minutes. master builds, which also include a separate build of a production container and a deploy to a staging environment, were taking between fifteen and forty minutes.

A typical 'master' build, taking over 35 minutes.

Builds should take less than five minutes: any longer and the waiting for a build becomes a reason to switch to something else, forcing an expensive context switch back once the build has finished. Worse, a consistently failing build can easily consume an entire day, especially if it’s only repeatable in CI.

The efforts described below are one part of a larger project to improve this build’s performance and use what we learned to improve other builds at Envato.

Investigation

The first thing that stood out to me was the very high variance in build times. This hinted either that:

  • the build was relying on third party APIs with varying response times, or
  • the build’s performance was being affected by other builds stealing its resources

The first possibility was quickly ruled out: the parts of the build that talk to things on the internet or in AWS showed the same level of variance as other parts of the build that are entirely local.

We don’t (yet) have external instrumentation on the build nodes, so we ssh’d into them individually and used the sysstat toolkit to watch the instance’s performance. We found that CPU was almost entirely utilised while memory, disk bandwidth, disk operations per second, and network throughput still had a fair amount of headroom. We also found that the CPU being fully utilised by concurrent builds on the same node was the cause of the large variance in build times.

This confirmed what several people in the team already suspected: we needed more CPU.

Exploratory Research

Dedicated spot fleets and Buildkite queues were created to perform indicative testing on the effect of different node configurations and classes on build performance.

The existing configuration was c3.xlarge and m3.xlarge spot instances with one agent per AWS virtual CPU.

We tried:

  • increasing the instance size from xlarge to 2xlarge
  • halving the number of agents per virtual cpu
  • moving to current generation c5 and m5 instances
  • using the newly released super-fast CPU z1d instances

We found that:

  • current generation instances provide a 50% speed increase over their previous generation counterparts
  • the difference between m5 and c5 instances was minimal
  • z1d instances provided a further 30% performance increase, but at double the cost
  • halving the number of agents per virtual cpu provided a performance increase
  • using smaller instance types meant steps more often needed to docker pull cache layers, which randomly increased build times

However, these results and findings are indicative only: only one sample was taken for each instance class.

Modern Instances

We isolated two steps from the build that were not network-dependent and were idempotent: the initial webpack build and the first set of unit tests. They were run multiple times using the avgtime utility on a set of instance types:

This confirmed (at least, for these two steps) the indicative findings on the performance improvements offered by the newer instance types: c5 & m5 instances are approximately 50% faster than their older c3 counterparts for this type of work. It is also interesting that c5 and m5 instances are almost exactly as fast as each other for this step, despite the c5’s reported 3ghz versus the m5’s 2.5ghz.

Virtual vs “Real” CPUs

AWS advertises its instances as having a certain number of “virtual CPUs” or vCPUs. This can be misleading if you’re not already familiar with Intel’s Hyperthreading, where for every processor core that is physically present, two “logical” cores are made available to the operating system. AWS’ vCPUs map directly to logical cores, not physical ones,

Our instances were configured to run one agent per logical core, not physical. This meant that even single-threaded build steps could take up to twice as long to run if the instance’s CPU was fully taxed. This was originally a cost saving measure that was based on the assumption that most build steps would spend their time waiting on network resources or other tasks. For this build queue this assumption proved to be incorrect.

We ran two benchmarks on a single c3.large: one with a single webpack build running, and one with two running in parallel. We also ran the same benchmark on a c5.large to determine whether the newer instance type provided better Hyperthreading optimisations:

On both classes of instance, running two identical steps at the same time on the same physical CPU nearly doubled the execution time versus running only one, despite the benefits offered by Hyperthreading.

Other findings: Docker COPY vs Bind Mounts

All of the tests above were run via docker run on a container without volumes or bind mounts: node_modules and the project’s source were baked into the image via COPY . /app. Running the webpack build with these files instead bind mounted (via -v $(pwd):/app) showed us a significant performance improvement:

Unfortunately, this isn’t something that we can easily take advantage of in our builds without making them significantly more complicated. Bind mounts also gave us no performance improvements when running the unit test step.

Configuration Changes

Based on the above results, we decided on two initial actions:

  • moving to c5d.2xlarge and m5d.2xlarge instances
  • halving the number of agents per virtual cpu

We opted for the d class instances as we wished to keep using the instance storage provided by the c3 and m3 class instances. Doubling the instance size while halving the number of agents per virtual CPU meant that we still had the same number of agents per spot instance, meaning that we’d increase build performance without increasing cache misses on docker image layers.

This was recorded as a set of Architectural Decision Records in the git repository containing the StackMaster configuration for this fleet so that future maintainers would know the context and thinking behind these changes.

Costs

Prediction of how much this would increase costs was difficult: we anticipated that while each individual instance cost twice as much, we’d ultimately need less of them, as faster builds would mean that the spot fleet autoscaling rules would be triggered less frequently. We expected that the change would increase costs as our fleet is configured to always have one instance running regardless of load, and we’d be doubling that instance’s size.

We found that the change more than doubled the cost for this fleet: the newer instance types are in higher demand and therefore attract higher spot prices. Fortunately for us the original costs were very low, so this level of increase was not a big worry!

Impact

This change had an almost immediate and significant effect on branch builds, as shown in the scatter plot below in the middle of November. Master builds have also improved, but less so, as the deploy to staging adds a significant chunk of time:

Through this change builds have become both much faster and much more consistent: branch builds that previously took between ten and twenty five minutes now take between four and ten, and master builds that took between fifteen and thirty five minutes now take between seven and thirteen.

Other improvements have been made to this build, but of all of them it was this change that had the highest impact. We’re now hoping to take what we’ve learned here and roll it out to a single consolidated fleet of agents that can be shared by all projects, rather than a single fleet per project. This will allow us to consider faster instance types (like the lighting-fast z1d instances) as we’ll have less “idle” agents, offsetting costs.

Eagle-eyed readers will notice that the times in the scatter plot above are faster than the speculative improvements we expected in our initial runs. The above improvements aren’t the whole story, just the change we made with the highest impact: additional improvements were made in our webpack configuration, balancing of E2E tests between nodes, and docker layer caching strategies.

More on these further changes soon!