Post Mortem report: 19 October 2016

On Wednesday 19 October, Envato Market sites suffered a prolonged incident and were intermittently unavailable for over eight hours. The incident began at 01:56 AEDT (Tuesday, 18 October 2016, 14:56 UTC) and ended at 10:22 AEDT (Tuesday, 18 October 2016, 11:22 UTC). During this time, users would have seen our “Maintenance” page intermittently and therefore would not have been able to interact with the sites. The issue was caused by an inaccessible directory on a shared filesystem, which in turn was caused by a volume filling to capacity. The incident duration was 8 hours 26 minutes; total downtime of the sites was 2 hours 56 minutes.

We’re sorry this happened. During the periods of downtime, the site was completely unavailable. Users couldn’t find or purchase items, authors couldn’t add or manage their items. We’ve let our users down and let ourselves down too. We aim higher than this and are working to ensure it doesn’t happen again.

In the spirit of our “Tell it like it is” company value, we are sharing the details of this incident with the public.

Context

Envato Market sites recently moved from a traditional hosting service to Amazon Web Services (AWS). The sites use a number of AWS services, including Elastic Compute Cloud (EC2), Elastic Load Balancing (ELB), and the CodeDeploy deployment service. The sites are served by a Ruby on Rails application, fronted by the Unicorn HTTP server. The web EC2 instances all connect to a shared network filesystem, powered by GlusterFS.

Timeline

NewRelic graph: outage overview

(All times are in AEDT / UTC+11)

  • [01:56] Load balancer health checks start failing and terminating web EC2 instances
  • [02:00] “Site down” alerts from Pingdom (one of our monitoring systems)
  • [02:09] The on-call engineers begin investigating
  • [02:19] Web instances are replaced and serving traffic, site appears healthy
  •  
  • [02:23] Health checks begin failing again, terminating web instances
  • [02:40] Automatic termination is disabled to avoid flapping and facilitate investigation
  • [02:50] We notice all Unicorn worker processes are tied up accessing a Gluster mount
  • [02:55] Sites are isolated from Gluster by way of an outage flip
  • [03:05] We recognize the Gluster mount is at 100% utilization
  • [03:13] All web instances are replaced to clear stalled processes, restoring site functionality
  • [03:30] Woke up the Gluster subject matter expert from our Content team for help
  •  
  • [04:02] Health checks begin failing once again
  • [04:05] We recognize one particular code path doesn’t respect the shared filesystem outage flip
  • [04:14] All web instances are replaced again to clear workers and restore the site
  •  
  • [04:55] The next round of health check failures begins
  • [05:25] A fix to the broken code path above is deployed, but the deployment fails
  • [05:44] “Maintenance mode” is enabled, blocking all users from the site and showing a maintenance page
  • [06:22] We notice that the maintenance mode has disrupted CodeDeploy deployments, preventing them
  • [06:47] Maintenance mode is disabled, and we block user access to the site in a different manner
  • [06:56] The deployment of the previous fix is finally successful, users are allowed back into the site
  •  
  • [08:57] Once again, health checks start failing
  • [09:01] We notice another code path which doesn’t respect the shared filesystem outage flip
  • [09:13] Our Gluster expert identifies a problem with one directory in the shared filesystem
  • [10:22] A fix to use a different shared directory is deployed, restoring the site to service.

Analysis

This incident manifested as five “waves” of outages, each subsequent one occurring after we thought the problem had been fixed. In reality there were several problems occurring at the same time, as is usually the case in complex systems. There was not one single underlying cause, but rather a chain of events and circumstances that led to this incident. A section follows for each of the major problems we found.

Disk space and Gluster problems

The first occurrence of the outage was due to a simple problem which went embarrassingly uncaught: our shared filesystem ran out of disk space.

DataDog graph: system disk free

As shown in the graph, free space started decreasing fairly quickly prior to the incident, decreasing from around 200 GiB to 6 GiB in a couple of days. Low free space isn’t a problem in an of itself, but the fact that we didn’t recognize and correct the issue is a problem. Why didn’t we know about it? Because we neglected to set an alert condition for it. We were collecting filesystem usage data, but never generating any alerts! An alert about rapidly decreasing free space may have allowed us to take action to avoid the problem entirely. It’s worth mentioning that we did have alerts on the shared filesystem in our previous environment but they were inadvertently lost during our AWS migration.

An out-of-space condition doesn’t explain the behavior of the site during the incident, however. As we came to realize, whenever a user made a request that touched the shared filesystem, the Unicorn worker servicing that request would hang forever waiting to access the shared filesystem mount. If the disk were simply full, one might expect the standard Linux error in that scenario (ENOSPC No space left on device).

The GlusterFS shared filesystem is a cluster consisting of three independent EC2 instances. When the Gluster expert on our Content team investigated he found that the full disk had caused Gluster to shut down as a safety precaution. When the lack of disk space was addressed and Gluster started back up, it did so in a split brain condition, with the data in an inconsistent state between the three instances. Gluster attempted to automatically heal this problem, but was unable to do so because our application kept attempting to write files to it. The end result was that any access to a particular directory on the shared filesystem stalled forever.

A compounding factor was the uninterruptible nature of any process which tried to access this directory. As the Unicorn workers piled up, stuck, we tried killing them, first gracefully with SIGTERM, then with SIGKILL. The only option to clear these stuck processes was to terminate the instances.

Resolution

One of the biggest contributors to the extended recovery time was how long it took to identify the problem with the shared filesystem’s inaccessible directory–just over seven hours. Once we understood the problem, we reconfigured the application to use a different directory, redeployed, and had the sites back up in less than an hour.

GlusterFS is a fairly new addition to our tech stack and this is the first time we’ve seen errors with it in production. As we didn’t understand its failure modes, we weren’t able to identify the underlying cause of the issue. Instead, we reacted to the symptom and continued trying to isolate our code from the shared filesystem. Happily the issue was identified and we were able to work around it.

Takeaway: new systems will fail in unexpected ways, be prepared for that when putting them into production

Unreliable outage flip

In order to isolate our systems from dependent systems which experience problems, we’ve implemented a set of “outage flips” – basically choke points that all code accessing a given system goes through, allowing that system to be disabled in one place.

We have such a flip around our shared filesystem and most of our code respects it, but not all of it does. Waves 3 and 5 were both due to code paths that accessed the shared filesystem without checking the the flip state first. Any requests that used these code paths would touch the problematic directory and stall their Unicorn worker. When all the available workers on an instance were thus stalled the instance was unable to service further requests. When that happened on all instances the site went down.

Resolution

During the incident we identified two code paths which did not respect the shared filesystem outage flip. Had we not identified the underlying cause, we probably would have continued the cycle of fixing broken code paths, deploying, and waiting to find the next one. Luckily, as we fixed the broken code the frequency with which the problem reoccurred decreased (the broken code we found in wave five took much longer to consume all available Unicorn workers than that in the first wave).

Takeaway: testing emergency tooling is important, make sure it works before you need it.

Deployment difficulties

We use the AWS CodeDeploy service to deploy our application. The nature of how CodeDeploy deployments work in our environment severely slowed our ability to react to issues with code changes.

When you deploy with CodeDeploy, you create a revision which gets deployed to instances. When deploying to a fleet of running instances this revision is deployed to each instance in the fleet and the status is recorded (successful or failed). When an instance first comes into service it receives the revision from the latest successful deployment.

A couple of times during the outage we needed to deploy code changes. The process went something like this:

  1. Deploy the application
  2. The deployment would fail on a few instances, which were in the process of starting up or shutting down due to the ongoing errors.
  3. Scale the fleet down to a small number of instances (2)
  4. Deploy again to only those two instances
  5. Once that deployment was successful, scale the fleet back to nominal capacity

This process takes between 20-60 minutes, depending on the current state of the fleet, so can really impact the time to recovery.

Resolution

This process was slow but functional. We will investigate whether we’ve configured CodeDeploy properly and look for ways to decrease the time taken during emergency deployments.

Takeaway: consider both happy-path and emergency scenarios when designing critical tooling and processes

Maintenance mode script

During outages, we sometimes block public access to the site in order to carry out certain tasks that would disrupt users. To implement this, we use a script which creates a network ACL (NACL) entry in our AWS VPC which blocks all inbound traffic. We found that when we used this script, outbound traffic destined for the internet was also blocked. This was especially problematic because it prevented us from deploying any code.

CodeDeploy uses an agent process on each instance to facilitate deployments: it communicates with the remote AWS CodeDeploy service and runs code locally. To talk to its service it initiates outbound requests to the CodeDeploy service endpoint on port 443. When we enabled maintenance mode the agent was no longer able to establish connections with the service.

As soon as we realized that the maintenance mode change was at fault, we disabled it (and blocked users from the site with a different mechanism). After the incident, we investigated the cause further, which turned out to be an oversight in the design of the script. Our network is partitioned into public and private subnets. Web instances live in private subnets, and communicate with the outside world via gateways residing in public subnets. Traffic destined for the public internet crosses the boundary between private and public subnets, and at that point the network access controls are imposed. In this case, the internet-bound traffic was blocked by the NACL added by the maintenance mode script.

Resolution

As soon as we realized that the maintenance mode script was blocking deployments, we disabled it and used a different mechanism to block access to the site. This was effectively the first time the script was used in anger, and although it did work, it had unintended side effects.

Takeaway: again, testing emergency tooling is important

Corrective measures

During this incident and the subsequent post-incident review meeting, we’ve identified several opportunities to prevent these problems from reoccurring.

  1. Alert on low disk space condition in shared filesystem

    This alert should have been in place as soon as Gluster was put into production. If we’d been alerted about the low disk space condition before it ran out, we may have been able to avoid this incident entirely. We’re also considering more advanced alerting options to avoid the scenario where the available space is used up rapidly.

    This action is complete; we now receive alerts when the free space drops below a threshold.

  2. Add monitoring for GlusterFS error conditions

    When Gluster is not serving files as expected (due to low disk space, shutdown, healing, or any other type of error) we want to know about it as soon as possible.

  3. Add more disk space

    Space was made on the server by deleting some unused files on the day of the incident. We also need to add more space so we have an appropriate amount of “headroom” to avoid similar incidents in the future.

  4. Investigate interruptible mounts for GlusterFS

    The stalled processes which were unable to be killed significantly increased our time to recovery. If we could have killed the stuck workers, we may have been able to recover the site much faster. We’ll look into how we can mount the shared filesystem in an interruptible way.

  5. Reconsider GlusterFS

    Is GlusterFS the right choice for us? Are there alternatives that may work better? Do we need a shared filesystem at all? We will consider these questions to decide the future of our shared filesystem dependency. If we do stick with Gluster, we’ll upskill our on-callers in troubleshooting it.

  6. Ensure all code respects outage flip

    Had all our code respected the shared filesystem outage flip, this would have been a much smaller incident. We will audit all code which touches the shared filesystem and ensure it respects the state of the outage flip.

  7. Fix the maintenance mode script

    The unintended side effect of blocking deployments by our maintenance script extended the downtime unnecessarily. The script will be fixed to allow the site to function internally, while still blocking public access.

  8. Ensure incident management process is followed

    We have an incident management process to follow, which (amongst other things) describes how incidents are communicated internally. The process was not followed appropriately, so we’ll make sure that it’s clear to on-call engineers.

  9. Fire drills

    The incident response process can be practiced by running “fire drills”, where an incident is simulated and on-call engineers respond as if it were real. We’ve not had many major incidents recently, so we need some practice. This practice will also include shared filesystem failure scenarios, since that system is relatively new.

Summary

Like many incidents, this was due to a chain of events that ultimately resulted in a long, drawn out outage. By addressing the links in that chain, similar problems can be avoided in the future. We sincerely regret the downtime, but we’ve learned a lot of valuable lessons and welcome this opportunity to improve our systems and processes.