Migrating edge network providers

Unknown to our users, we recently migrated edge network providers. This involved some particularly interesting problems that we needed to solve in order to migrate without impacting availability or integrity of our services.

Before we get into how we made the move, let’s look at what an edge network actually does.

An edge network allows us to serve content to users from a physically nearby location. This allows us to deliver the content in a fast, secure manner by avoiding sending every request to our origin infrastructure, which may be physically distant from users.

On the security front, using an edge provider allows us to perform security mitigations without tying up our origin infrastructure resources. This becomes quite important when we start talking about Distributed Denial of Service (DDoS for short) attacks that aim to saturate your network and consume all of your compute resources making it difficult for legitimate users to visit your site. By offloading the defense against malicious traffic and mitigation work to a set of purpose built servers distributed across the globe, you free up your origin resources to do what they need to do and service your users.

DDoS + WAF

Malicious users are a very real threat and something we deal with on a daily basis. The majority of these attacks aren’t volumetric however they can impact other users if they manage to generate enough requests to slow down a particular part of our service ecosystem.

Depending on the type of malicious traffic, we have two options: a Web Application Firewall (WAF) and DDoS scrubbing.

WAF is used for most of our mitigations. This is a series of rule sets that have been developed over the years based on attacks that we’ve seen against our services. We “fingerprint” a large sample of requests and then extract out common traits to either block the traffic, tarpit the request or perform a challenge that requires human interaction to proceed.

DDoS scrubbing comes into play when we have a highly distributed attack or we are seeing high volumes of network traffic. The goal of this mitigation is to filter out the malicious requests (much like using WAF) however it is usually done far more aggressively and involves inspecting other aspects than just HTTP.

Prior to the move, these were two separate systems and the request flow looked like this:

This setup wasn’t perfect and lead to a few issues.

  • Debugging was very difficult. To get to the bottom of any request issues, we needed to use both systems to piece together the full picture. While both systems had correlation IDs that we could map against each other, it was easy to get confused which part of the request/response you were looking at in either system.
  • Coordinating changes was hard. As we added additional features to either our DDoS defense or WAF we needed to do some extra work to ensure that rolling out changes in one would continue to work with the other.
  • API differences. We are big users and advocates for Infrastructure as Code, however only some of our service providers offered this facility and even then only for limited portions of their services. This resulted in us either needing to use the UI with manual reviews or only storing part of it in code which added confusion on what went where.
  • Getting blocked in one system could be misunderstood by the other system. If a user managed to trigger one of our WAF rules, the DDoS system that had some basic origin healthcheck capabilities could read the response as the origin being under load and start throwing confusing errors. This would get in the way of finding the real issue as you would get errors from both systems instead of just a single one.

So, we set out to combine the two systems into a single port of call for our traffic mitigation needs.

DNS

We maintain both internal and external DNS services. For our internal DNS, we use AWS Route53 as that is already well integrated with our infrastructure. However, externally we needed something that would do all the standard stuff plus cloak our origin and prevent recursive lookups from finding the origin.

Something else we wanted to improve was the auditability of our DNS zone changes. Our existing DNS provider didn’t lend itself very well to managing the records as code. This resulted in changes needing to be staged in a UI and then posted to Slack channels for review from other engineers before being committed. Managing our DNS in code would help us level up in our security practices because it would keep DNS changes easily searchable and aid mitigating vectors like dangling DNS vulnerabilities.

Preparing for the move

One of our biggest concerns with migrating these services was conformity between the new and old. Having discrepancies between the two systems could cause a bunch of issues that if not monitored, would create bigger issues for us.

We decided that we would address this in the same way we prevent regressions in our applications; we would build out a test suite. Our engineering teams are very autonomous which meant that this test suite needed to be easy to understand and use by the majority of the engineering team since they could be potentially making changes and would need to verify behaviour.

After some discussions, we landed on RSpec. RSpec is already a well understood framework in our test suites and the majority of our teams are using it daily. Even though using RSpec would get us most of the way, we would still need to extend it to add support for the expected HTTP interactions and conditions. To do this, we wrote HttpSpec. This is our HTTP RSpec library that performs the underlying HTTP request/response lifecycle and has a bunch of custom matchers for methods, statuses, caching, internal routing and protocol negotiation. Here is an example of something you might see in our test suite:

1
2
3
4
5
6
7
8
9
10
11
12
# HTTP/S
RSpec.describe 'themeforest.net' do
  it { is_expected.to redirect_http_to_https }
  it { is_expected.to support_tlsv1_2 }
end

# Caching
RSpec.describe 'themeforest.net/category/all' do
  it { is_expected.to be_browser_cachable }
  it { is_expected.to be_shared_cachable }
  it { is_expected.to serve_stale_while_revalidating(ttl: 60) }
end

This solved the issue for most of the functionality we were looking to port however we still didn’t have a solution for DNS. We started putting together a proof of concept that relied on parsing dig responses and a short while later decided that wasn’t scalable to our configuration due to the number of variations that could be encountered. This prompted us to go in search for a more maintainable tool. Lucky for us, Spotify had already solved this issue and open sourced rspec-dns. rspec-dns was a great option for us since it could be integrated into our existing RSpec test suites and gave us the same benefits we wanted in our edge test suite. This is what our DNS tests looked like:

1
2
3
4
5
6
7
8
9
10
RSpec.describe 'themeforest.net' do
  # CNAME
  it { is_expected.to have_dns.with_type('CNAME').with_rdata('origin.hostname') }

  # TXT
  it { is_expected.to have_dns.with_type('TXT').with_data('v=spf1 include:spf.mandrillapp.com -all') }

  # MX
  it { is_expected.to have_dns.with_type('MX').with_exchange('aspmx.l.google.com').with_preference(1) }
end

Now that we had a way of confirming behaviour on both systems, we were ready to migrate!

Making the move

The second big issue we hit was that the two providers didn’t use the same terminology. This meant that a “zone” in provider A wasn’t necessary going to be the same thing in provider B.

Remedying this wasn’t a straight forward process and required a fair amount of documentation diving and experimentation with both providers. In the end, we built a CLI tool that took the API responses from our old provider and mapped then to what our new provider expected to manage the equivalent resources. This greatly reduced the chance of human error when migrating these resources and ensured that we would be able to reliably create and destroy resources over and over again. An upside of taking this automated approach is that we could couple resource creation with spec creation. For instance, if the CLI tooling found a DNS record in provider A it would also update our specs to include an assertion based on what was going to be created in provider B (Yay, for free test coverage!)

Our approach to cutting over sites followed standard Test Driven Development process:

  • Create the expected tests
  • Run our test suite and see them all fail
  • Port over the functionality to the new provider
  • Re-run the test suite and see what is missing
  • Rinse and repeat

As an additional safety measure, we configured our edge and DNS test suite to run hourly (outside of regular Pull Request triggered builds) and trigger notifications for any failures. This ensured that we were constantly getting feedback on a quickly changing system if we broke anything.

To keep the blast radius of changes as small as possible while we were gaining confidence in the migration process, we migrated the systems in order of traffic and their potential for customer impact. By taking this approach, we were able to give stakeholders confidence we were able to bring over larger systems without impacting users.

Once we were happy the site was working as expected we would release it to our staff using split horizon DNS to gain some confidence that there wasn’t anything missed. If a regression was found, we’d go back through the TDD process until we were completely confident in our changes.

After we were happy with the testing, we’d schedule some time to perform the public cut over. On cut over day, the migration team would jump into a Hangout and start stepping through the runbook and monitoring for any abnormal changes.

A caveat to note about DNS NS record TTLs: Despite taking precautions such as lowering the NS TTLs weeks before hand, the TTL for NS records are pretty well ignored by most implementations. This means that while you may cut over at 8am on a Monday, the change may see a long tail until full propagation is achieved. In our case, up to 3 or 4 days in some regions. For this reason we introduced a new 24x7 on call roster that would help the system owners mitigate this issue should we have needed to roll back the cut over.

Final thoughts

Embarking on a migration project for your edge network provider is no small feat and it definitely isn’t without risks. However, we are extremely pleased thus far with the improvements and added functionality that we have gained from the move.

In the future, we will be looking to integrate our edge provider closer with our origin infrastructure. The intention behind this is to automate away some of the manual intervention we currently perform when applying traffic mitigation. In the long term this will help us build a safer and more resilient Envato ecosystem.