This is the story of how we moved Envato’s Market sites to the cloud. Envato Market is a family of seven themed websites selling digital assets. We’re busy; our sites operate to the tune of 25,000 requests per minute on average, serving up roughly 140 million pageviews per month. We have nearly eleven million unique items for sale and seven million users. We recently picked this site up out of its home for the past six years and moved to Amazon Web Services (AWS). Read on to learn why we did it, how we did it, and what we learned!
A short history of hosting at Envato
Back in 2010, Envato was hosted at EngineYard, and looking to move. EngineYard was then a Ruby-only application hosting service. The Market sites were growing to the point where the EngineYard service was no longer suitable, and in addition Envato wanted to focus on the core business of building marketplaces rather than running servers. In August 2010 the Market sites moved to Rackspace’s managed hosting platform.
From 2010 to 2016 the Market sites were hosted by Rackspace. While managed hosting was a good choice for the Envato of that time, the company and the community have grown significantly since then. Around 2013, we found ourselves looking once again for a platform that better fit our needs.
Like many tech companies, Envato runs “hack weeks”, where we pause our normal work and spend a week or two trying out new ideas. In a hack week in September 2014, a team wondered if it was possible to move Market to AWS in one week. In true hack week style, this project focused solely on that goal, and was successful. The team had one Market site running in AWS within the week, and proved that the “lift and shift” strategy was feasible. While we’d have loved to migrate to AWS then and there, the work was only a proof of concept and nowhere near production-ready.
Flash forward nearly two years from that first try, and we’ve made it a reality for all the Market sites!
Why we moved
A strong element in the development culture at Envato is a “do it yourself” attitude – rather than waiting for someone else to do something for us, we’d prefer to do it ourselves. Managed hosting was no longer such a good fit, because we had so much to do and were constantly being constrained by the delays inherent in a managed service.
The managed nature of the service means we were effectively hands-off of the infrastructure. While we had access to the virtual machines that ran our sites, everything else – physical hardware, storage, networking – was controlled by Rackspace and required a very manual support ticket process to change. This process was often lengthy, and held us back from operating with the speed we desired.
Working in AWS requires a paradigm shift. While Amazon still manages the physical infrastructure, everything else is up to you. Provisioning a new server, adding storage capacity, changing firewall rules or even network layout: these tasks, which would have taken days to weeks in a managed hosting environment, can be accomplished in seconds to minutes in AWS.
Sometimes we run experiments to prove or disprove an idea’s feasibility, often in the form of putting a new website up and observing how people use it. That means taking the idea from nothing to a functional site in a short period of time. With managed hosting, that could take weeks or even months to accomplish. In AWS, we can build out a site and its supporting infrastructure very rapidly. This ability to quickly run an experiment is crucial to developing new products and features.
Finally, there is a cost incentive to moving to AWS. In Rackspace, we leased dedicated hardware and paid a fixed cost, no matter how much traffic we were serving. We had to pay for enough capacity to handle peak traffic load for our sites at all times; at non-peak times we paid the same rate. In AWS, “you pay for what you use” – you’re only billed for the actual use of the resources you provision. The ease with which you can add or remove capacity means that we’ll be able to add capacity during peak times, and remove it during non-peak times, saving money. After an initial settling period, we’ll be able to model our usage and we expect to see cost savings on the order of 30-50%.
With a limited timeframe to accomplish the migration, we had some tough decisions to make. Rebuilding the application from scratch to work in the cloud was not an option, due to the enormous amount of time that would take. Instead, we chose a common strategy in software: MVP, minimum viable product. We did just the amount of work required to deliver the new, migrated platform, without rebuilding every component. This reduced the time to market and let us focus on the core problems.
A big choice faced by companies moving workloads to the cloud is to “lift and shift” or rearchitect. Lift and shift refers to picking up an entire application and rehosting it in the cloud. This has the advantage of speed and reduced development effort, however applications migrated like this can’t leverage the capabilities of the cloud platform and often cost more than they did pre-migration. Rearchitecting, on the other hand, is cost and time intensive, but results in an application built for the platform which can benefit from all the features it provides.
Envato performed a lift and shift migration of our Single Sign-on system (
account.envato.com) a couple of years ago; we learned that while this approach can be accomplished quickly in the short term, it requires significant work after the fact to get the systems involved running as desired. Had we realized that up front, we may have chosen to do that work as we migrated.
Why not both?
Instead of picking one or the other, we chose a hybrid approach in moving the Market. Functionality that could easily be left unchanged, was. Only changes that were required or that would be immediately beneficial were made.
Amongst the more important changes made were the following:
We replaced our aging Capistrano-based deployment scripts with AWS CodeDeploy. CodeDeploy integrates with other AWS systems to make deployments easier. While Capistrano can be made to work in the cloud, it falls short supporting rapid scaling.
Scout, our existing Rails-specific monitoring system, has been replaced by Datadog for monitoring and alerting. Monitoring in the cloud requires first-class support for ephemeral systems, and Datadog provides that along with excellent visualization, aggregation, and communication functionality.
The key component of the Market sites, our database, was moved from a self-managed MySQL installation to Amazon Aurora, a high performance MySQL-compatible managed database service from AWS. Aurora offers significant performance increases, high availability, automated failover, and many other features.
For some core services, we opted to use AWS’ managed versions, rather than managing ourselves. We chose Amazon ElastiCache for application-level caching; the Aurora database mentioned above is also a managed service; and we make use of the Elastic Load Balancing service for our load balancers.
The application now runs on Amazon EC2 instances managed by Autoscaling groups, effectively removing the concept of a single point of failure from our infrastructure. If a problem affects any given instance, it is easily and quickly replaced and returned to service. Adding and removing capacity literally takes nothing more than the click of a button.
As a counterpoint, some specific things which didn’t change:
Shared filesystem (NFS) for some content: while we really wanted to get rid of this part of our architecture, it would have been too time consuming to remove our reliance on it. We’ve instead marked it as something to address post-migration.
Logging infrastructure: we had a good look at Amazon Kinesis which looked to provide a new AWS-integrated log aggregation system. However, it turned out that there were irreconcilable problems with this approach, so we left the current system unchanged. Again, we’ll review this at a later date.
The vast majority of the Market codebase was untouched during the migration. Any code that didn’t need to be changed, wasn’t.
A key decision we made early on in the project was to manage our infrastructure as code. Traditionally, infrastructure is defined by disparate systems: routers, firewalls, load balancers, switches, databases, hosts, and rarely do these systems share a common definition language or configuration mechanism. That’s a major difference in AWS; everything is defined in the same way. We chose the AWS CloudFormation provisioning tool, which lets you define your infrastructure in “stacks”. The benefit is that our infrastructure is under source control; changes can be reviewed before being applied, and we have a history of all changes. We use CloudFormation to such an extent that we’ve written StackMaster to make working with stacks easier.
In Rackspace, our systems were spread over a small number of physical hosts, on which we were the only tenants. Contrast that to AWS, where our systems are spread out over hundreds of physical hosts which we share with other AWS customers. A consequence of the increased number of systems is an increased failure rate of individual servers. However, this can be mitigated by architecting with expected failures in mind:
As mentioned previously, all our instances are members of Autoscaling groups, which means they are automatically replaced if they become unhealthy.
Most systems are deployed to multiple physical locations, ensuring a problem (e.g. loss of power, cooling, or internet connectivity) at any one location does not affect the availability of the site. Those systems deployed to only a single location are able to run in any location, and when disrupted in one location can launch in another.
Managed services (Aurora and ElastiCache, most notably) are also configured to run in multiple locations, and are tolerant of the loss of a location.
Not only have we followed the cloud best practice of designing for failure, we’ve taken it a step further by researching possible failure scenarios, validating our assumptions, and where possible, optimizing our designs for quick recovery. Additionally, we’ve worked to create self-healing systems; many problems can be resolved without human intervention. This gives us the confidence that not only can we tolerate most failures, but when they do occur we can quickly recover.
Readers familiar with cloud architecture may ask, “why not multi-region?” This refers to running applications in multiple AWS regions. Even though we’ve architected for availability by running in multiple locations (availability zones) and storing our data in multiple regions, we still only serve customer traffic from a single region at a time. For availability and resiliency on a global scale, we could run out of multiple regions concurrently. Running a complex application like Market simultaneously in multiple locations is a hard problem, but it is on our roadmap.
The mandate from our CTO was clear: “optimize for safety.” Many of our community members depend on Market for their livelihoods; any data loss would be unacceptable. This requirement led to a hard decision: the migration would incur downtime – Market sites would be entirely shut down during the actual cutover from Rackspace to AWS.
While we would have liked to keep the Market sites open for business the entire time, there was no way to guarantee that every change – purchases, payments, item updates – would be recorded appropriately. This is due in large part to the fact that the source of truth for all this data, our primary database, was moving at the same time. Maintaining multiple writable databases is a very difficult problem to solve, and we opted to take the safer route and temporarily disable Market sites.
Months of planning led to the formation of a runsheet: a spreadsheet containing the details of every single change to be made during the cutover, including timing, personnel, specific commands, and every other detail required to make each change. Multiple rollback plans were made, instructions for undoing the changes in the event of a major failure.
The community was notified; authors were alerted, vendors were consulted, Envato employees informed. Preparation for the cutover day, scheduled for a Sunday morning (our time of lowest traffic and purchases), began the week prior. On Sunday morning, the team arrived (physically and virtually) and ran the plan. Market was taken down, the move commenced, and four and a half hours later, the sites were live on AWS! Not only live, but showing a small performance increase as well!
In the following app-level view from one of our monitoring systems, you can clearly see the spike in the middle of the graph showing the cutover, and the decreased (faster) response time following it:
In this browser-level view, you can again see the cutover at the same time, and following that the better-than-historical behavior of the new site:
While the sites have successfully been moved to AWS, we’re far from done. There is plenty of clean-up work to do, removing now-unused code and configuration. Our infrastructure at Rackspace needs to be decommissioned.
Another major task which will continue for some time is to start modifying the Market to take advantage of the AWS platform – or as it’s more commonly known, “drinking the kool-aid.” AWS provides many services, and we’ve only scratched the surface during the migration. As we continue to develop and operate the Market sites in AWS, we’ll evaluate these services and use them where it makes sense.
A factor that really contributed to the success of this migration was having the right team involved. The migration team had representatives from several parts of the business: the Customer group (owners of the Market sites themselves), the Infrastructure team (responsible for company-wide shared infrastructure), and the Content group (who look after all the content we sell on the sites). Having stakeholders from each area involved in the day-to-day work of the migration meant that we had confidence that everyone was up to speed and we weren’t missing any major components.
Another contributing factor was the “get it done” strategy we employed – the team was empowered to make the necessary decisions to complete the project. That’s not to say that we didn’t involve other people in the decision-making process, but we were able to avoid the “analysis paralysis” problem by not asking each and every team their opinion on how to proceed.
With a project of this scale, there will certainly be things that don’t go right. One area where we could have improved is communication. This project affected many teams at Envato, but our communication plan didn’t reflect that. Notifications were left until later in the project, and we didn’t communicate every detail we should have. Given another chance, communicating early and often to the rest of the company would have helped ensure everyone was on the same page and had all the information they required. Similarly, we didn’t communicate our plan to the community until the project was nearing its end; more lead time would have been helpful.
On cutover day, we had trouble with the database. Indeed, migrating the database was far and away the most complex part of the move. We had a detailed plan for it, but due to the fact that it contained live data and the complexity around it, we had no opportunity to practice that part of the migration. Finding a way, however difficult, of practicing the database migration may have mitigated some of this trouble. Ultimately, though, we found solutions to the problems and the database was safely migrated without ever putting data at risk of loss or corruption.
Were we to offer any tips to the reader thinking about a similar migration, they’d be these:
First and foremost, understand your application. A solid understanding of what the app does and how it works is critical to a successful migration. Our biggest fear, happily unrealized, was of some unknown detail of our ten-year-old system that would show up and stop the show.
Get AWS expertise on board. There’s no substitute for experience, and having that experience in the team was critical. Send team members to training, if necessary, to get the knowledge, but also practice it.
Beware the shiny things! There are a lot of cool technologies in AWS, and it’s tempting to use them anytime you see a fit. This can be dangerous and distract from the migration goal. You can always revisit things once the project is complete.
Consider AWS Enterprise Support. It may seem expensive, but having a technical account manager (TAM) on call to answer your questions or pass them off to internal service teams when required will save your team valuable time. The TAM will also analyze your designs, highlight potential problems, and help you address them before they become real problems. AWS provides a service called IEM, where the TAM will be available during major events (e.g. migrations), proactively monitoring for issues, and liaising with internal service teams in realtime to address actual problems.
As this post has hopefully demonstrated, a lot of thought went into this migration. Due to the comprehensive planning the move went relatively smoothly. We’re now in a position to start capitalizing on our new platform and making Envato even better!
A follow-on post, To The Cloud in-depth, provides more in-depth detail on how our new systems work.
Update 2016-08-18: Thanks to John Barton for his correction.