To The Cloud in-depth

In a previous post, Envato Market: To The Cloud! we discussed why we moved the Envato Market websites to Amazon Web Services (AWS) and a little bit about how we did it. In this post we’ll explore more of the technologies we used, why we chose them and the pros and cons we’ve found along the way.

To begin with there are a few key aspects to our design that we feel helped modernise the Market Infrastructure and allowed us to take advantage of running in a cloud environment.

  • Where possible, everything should be an artefact
    • Source code for the Market site
    • Servers
    • System packages (services and libraries)
  • Everything is defined by code
    • Amazon Machine Images (AMIs) are built from code that lives in source control
    • Infrastructure is built entirely using code that lives in source control
    • The Market site is bundled into a tarball using scripts
  • Performance and resiliency testing
    • Form hypotheses about our infrastructure and then define mechanisms to prove them

We made a few technical decisions to achieve these goals along the way. Here we’ll lay those decisions out and why it worked for us, as well as some caveats we discovered along the way, but first.

The implementation

Auto Scaling Groups (ASG)

We rely heavily on Auto Scaling Groups (ASGs) to keep our infrastructure running, they are the night watchmen keeping our servers running so we don’t have to. At the core of designing infrastructure to run in the cloud is the concept of designing for failure; only when you embrace failure do you enable yourself to take advantage of the scalability and reliability of cloud services.

Every server lives in an Auto Scaling Group; which defines a healthcheck to ensure the server is running. If the server fails it is terminated and replaced with a new one. We also run our ASGs across three Availability Zones (different data centres in the same region). If an Availability Zone fails, the failed servers are launched automatically in another.

In order to use ASGs we must define a server artefact to launch. To provide operational efficiencies we want that artefact to be built automatically.

Packer and Puppet

For simple servers, like our log forwarder, we use vanilla Packer with embedded bash code to build AMIs. The JSON files are our code and the AMIs our build artefact.

We’ve been using Puppet for a number of years to manage our servers and we’re comfortable with it. Since the migration took many months it was also good to use the same code we used to define our servers in both our old and new environments, so we didn’t miss any updates or fixes. So for our application servers (which by far have the most complex requirements) we decided to build our AMIs with Puppet and Packer, using BuildKite to do the build for us to ensure consistency.

We also have a lot of ServerSpec tests we run locally on our laptops to test our Infrastructure Code. Running them locally on our machine was sometimes a slow and buggy process, especially for those of us who work from home and don’t have such fast internet. Also it’s not entirely accurate as the virtual machine on our laptop doesn’t exactly replicate the AMI we are building. So we developed AmiSpec to help us utilise our Continuous Integration systems to test our servers before they go into production.

We build as much of the software and configuration as we can into our AMIs. This enables us to launch replacement instances quickly, but we don’t bake our application into AMIs, as Netflix does, a concept called “immutable AMIs”. This gives us a degree of flexibility to deploy as often as we do at lower cost, while allowing us to launch new servers relatively quickly (generally within a few minutes).

Code Deploy

In a previous post we discussed how we implemented automated deploys. During the migration we moved our aging Capistrano deployment code to CodeDeploy. We make upwards of 40 changes and 18 deployments to our website a day and we need the deployments to be reliable and fast. The change was significant but necessary, our existing deployment code had many problems:

  • It had grown organically over many years and resembled spagetti code
  • The mixture of Bash and Ruby made the code difficult to read, write and reason about
  • It had zero unit tests
  • For all these reasons it was extremely fragile which further prevented us from refactoring it.

This led to a very brittle deployment approach that everyone wanted to avoid touching.

With CodeDeploy the deployment code continued to live in our source code repository, but since CodeDeploy handled downloading the source to every server we were able to write most of it in Bash. Some more complicated parts required Ruby, but even with Bash we are able to write tests using Bats. This differs from our previous approach of “shelling out” to Bash from Ruby, because we are defining specific functions in one language. Each component can then be unit tested and swapped out easily for another if necessary.

While we found CodeDeploy was a great choice for us, we also had a couple of issues that tripped us up more than once.

Diagnosing launch failures

It can be tricky to diagnose why a new instance fails to launch; CodeDeploy will automatically fail the launch of any instance to which it fails to deploy, causing the instance to be terminated. If your ASG is trying to scale up to meet desired capacity and this continually happens you end up launching new instances in a loop. This is very expensive if it goes unchecked since you’re paying for 1 hour of time for each instance launch and can launch many instances an hour. We highly recommend anyone using CodeDeploy to ensure they monitor for this scenario and wake someone up if necessary to resolve it, to do this we chose DataDog, but there are other solutions we won’t cover in this article.

To troubleshoot this you should first check the CodeDeploy deployment log, available in the AWS Console. You can also use the get-console-output cli command to see the output from your instance at boot time to help understand if the server started correctly.

Creating a new CodeDeploy application with no successful deployment revisions

If you have to re-create your CodeDeploy application then there are no healthy revisions of your application. When you have no healthy revisions it is impossible to get CodeDeploy to deploy if you use an Elastic Load Balancer (ELB) healthcheck.

Your instances won’t get the application deployed to them on boot because you have no previous “healthy” revision (that is a revision that was successfully deployed). Because they have no application deployed you can’t deploy the application because you have no healthy instances, they will be stuck in a respawn loop as described above.

We chose to switch to EC2-based checks to work around this situation.

Concurrent deploys

There’s a limitation of 10 concurrent deploys per account. Each instance you launch in a CodeDeploy deployment group is one deployment. When we wanted to scale up our ASG by more than 10 instances at a time, the rest of the instances fail to launch and are terminated because their heartbeat times out (10 minutes by default). The maximum number of concurrent deployments is a service limit you can ask AWS to raise.

Starting the CodeDeploy agent on boot

If you require your user data to run before your app can be deployed, you need to start CodeDeploy from your user data. The CodeDeploy agent can start before user data runs, resulting in a race condition bug.

We found the AWS CodeDeploy Under the Hood blog post extremely valuable for understanding how CodeDeploy works and troubleshooting these types of issues.

Elastic Load Balancers (ELB)

This is a no brainer for our core web servers. ELBs scale to support hundreds of thousands of requests per minute and are at the core of almost every major AWS deployment.

We also spent a lot of time planning and creating our healthcheck endpoint. We chose Rack ECG, an in house developed open source tool, to create a simple endpoint for the ELB to check. We were deliberate about only checking hard dependencies of our application, like our databases and cache. We ensure our databases are writable so if our database fails and Rails does not reconnect or re-resolve the DNS entry, the instance is terminated and a new one provisioned. We did lots of testing with different failure scenarios to make sure we could recover automatically where possible and as quickly as possible.

ELB connection draining over application reloads

One decision we made, without measuring the performance impact, was how we stop serving traffic on a web server in order to deploy a new revision of our code without impacting users.

We use Unicorn as our backend HTTP server. It supports a reload command that will allow existing connections to finish while stopping old threads and starting new ones with our new code. This resulted in a four fold increase in response times for a brief period, until Unicorn settled down.

Using the ELB to drain connections to our web servers instead we’ve noticed our response time only increases 50% during deployments.

Route53

For hosts not directly part of our application server group that don’t need to be, or cannot be, load balanced we use Route53 to register domain names with one or more IPs that we associate to our Auto Scaling Group instances on boot.

CloudFormation and StackMaster, a match made in heaven

Last year, as part of a hackfort project, some of our developers put together a tool called StackMaster. You can read more about it in a previous blog post.

Initially we reviewed Terraform as well as StackMaster, but chose StackMaster for its flexibility combined with the maturity of CloudFormation. Time and again we’ve found the modularity of SparkleFormation dynamics combined with StackMaster Parameter Resolvers and many other features produce small re-usable stacks, with little repetition that enable us to reduce the amount of code needed and make that code easier to reason about.

We like smaller stacks because from our previous experience it’s possible, through human error or software bugs, for a stack to become “wedged” in a state that’s unrecoverable. That’s also why we chose not to define our database resources in CloudFormation, but using scripts. When a stack is wedged you have little choice but to destroy it and re-create it. By creating smaller stacks we reduce the impact of having to destroy a CloudFormation stack.

Also by splitting certain resources out, like our Elastic IPs and Domain Names, we decouple the infrastructure in a way that allows us to more easily make changes in the future. At the moment we’re considering adding another Load Balancer to an Auto Scaling Group, an operation that requires the Auto Scaling Group be destroyed (along with all its instances) and recreated. This would normally be a change that would cause downtime, but by defining the Domain Name that points to our Load Balancer in a separate stack we can stand an exact copy of that stack up, swap the domain name over to it and delete the old stack, similar to a Blue-green deployment.

In Summary

  • We created what we like to call “semi-immutable” machine images, to balance speed of scaling up with cost and flexibility.

  • We chose to restructure how we deploy our infrastructure and application in order to take advantage of a cloud platform.

  • We spent time investigating our technology choices and design decisions to validate they solved the right problems

All this would be for nothing if there was no impact on users. No amount of fancy cloud buzzwords would make it valuable if our customers were not better off. Thankfully we’ve already started to see impressive performance improvements on our site, mostly because we recognised during early performance testing some bottlenecks and were able to quickly resolve them.

Here’s a chart of our backend response time (in blue) with response times from before the migration in grey. That is how fast our server responds to queries from users.

Backend Response times; Rackspace in grey, AWS in blue

Our agility is what allowed us to make this improvement and it’s what drove us to move the Envato Market sites in the first place, so it’s already paying dividends.