Remedying the API gateway

To expose our internal services to the outside world, we use what is known as an API Gateway. This is a central point of contact for the outside world to access the services Envato Market uses behind the scenes. Taking this approach allows authors to leverage the information and functionality Envato provides on its marketplaces within their own applications without duplicating or managing it themselves. It also benefits customers who want to programmatically interact with Envato Market for their purchases instead of using a web browser.

The old API gateway

The previous generation API gateway was a bespoke NodeJS application hosted in AWS. It was designed to be the single point of contact for authentication, authorisation, rate limiting and proxying of all API requests. This solution was conceived one weekend as a proof of concept and was quickly made ready for production in the weeks that followed.

This solution worked well and allowed Envato to expose a bunch of internal services via a single gateway, removing the need to know which underlying service it was connecting to and how to query it correctly.

Here is an overview of how the infrastructure looked:

Whilst building a Ruby client for the Envato API I noticed some niggling issues that I fixed internally however throughout the whole process, I was getting intermittent empty responses from the gateway. This was annoying but at the time I didn’t think much of it since my internet connection could have been to blame and there wasn’t any evidence of this being a known issue.

March 2016 saw Envato experience a major outage on the private API endpoints due to a change that incorrectly evaluated the authorisation step, resulting in all requests getting a forbidden response. You can read the PIR for full details however during this outage we had many of our authors get in touch and conveyed their justified frustrations. Due to this incident, we implemented a bunch of improvements to the API and created some future tasks to address some issues that weren’t user facing but would help us answering some questions we had about the reliability of our current solution.

Following on from these discussions, in April a couple of our elite authors got in touch regarding some ongoing connectivity issues with the API. They were experiencing random freezes in requests that would eventually just time out without a response or warning. During the conversations, they also mentioned they would see an occasional empty body in the responses. We spent a great deal of time investigating these reports and working with the elite authors to help mitigate the issue as much as possible. We finally managed to trace down some problematic requests and begin trying to replicate the issue locally.

Even though we were able to eventually reproduce the issue locally, it was very difficult to isolate the exact cause of the problem for a number of reasons:

  • The single API gateway application had so many responsibilities and tracing requests showed it crossing concerns at every turn.
  • We were using third party libraries for various parts of functionality, however the versions we were running were quite old and included many custom patches we added along the way to fit our needs.
  • The proxying functionality (used for sending requests to the backends) didn’t perform a simple passthrough. There was a great deal of code covering discrepancies in behaviour between backends and the content was rewritten at various stages to conform to certain expectations.

All of the above points were made even more difficult since we have very little in-house support for NodeJS and those who are familiar with it are primarily working on the front end components, not the backend so this was a new concept for them too.

After spending a few weeks trying to diagnose the issue, we realised we weren’t making enough headway and we needed a better strategy. We got a few engineers together and starting working on some proposals to solve this for good. During the meeting we decided that going forward, NodeJS wasn’t going to work for us and it needed to be replaced with a solution that handled our production workload more effectively and we knew how to run at scale.

The meeting created the following action items:

  • Throw more hardware into the mix with the aim of reducing the chance of hanging requests by balancing the load over a larger fleet of instances. While this didn’t solve the issue entirely, it would allow our consumers hit this issue less often.
  • Find a replacement solution for the NodeJS gateway. It needed to be better supported, designed in a way that allowed us to have better visibility, be highly scalable and fault tolerant.

The new API gateway

Immediately after the meeting we scaled out the API gateway fleet and saw a drop off in the hanging requests issue. While it wasn’t solved, we saw significantly fewer occurrences and eased the pressure.

We started assessing our requirements for the new API gateway and came up with a list of things that we set as bare minimums before a solution was considered viable:

  • Must isolate responsibilities. If a single component of the service was impaired, it should not impact the rest.
  • Must be able to be managed in version control. This was important for us since we are big fans of infrastructure as code and all of our services take this approach to ensure we can rebuild our infrastructure reliably each time, every time.
  • Must be able to maintain 100% backwards compatibility with existing clients so that our consumers don’t need to redo their whole applications to fit our changes.
  • Have great in-house support. If something goes pear-shaped, we have the skills to solve the problems.

Following some trialling of PaaS and in-house solutions we landed on AWS API gateway. This met all of our criteria and employed many AWS products we were already familiar with which made the transition far smoother. However, a problem for us was that much of the functionality we needed was still under development by AWS and for a long time, we were building against a private beta of the service and hit various bugs that were still being addressed by the AWS teams.

We finally managed to ship a private beta of the service to a select few elite authors in late November and after ironing out a few bugs we found, we dark launched the new gateway to public use in January.

Here is what the infrastructure and request flow looks like (as of this writing):

This new infrastructure has allowed us to meet all the requirements we set out to while also removing a bunch of the confusion around which components are associated with which responsibilities. When we go to perform changes to a piece of this infrastructure, we know exactly what the impact will be and how to best mitigate it. The move has also given us a bunch of improvements around scalability and resiliency. Now if we experience a request surge the gateway infrastructure is able to scale to meet the needs instead of throwing errors because all the available resources have been exhausted.

While it’s still early days, we are far more confident in the API Gateway’s reliability. Since the move we have full visibility into each component, which was lacking before and a major cause of frustration. Consequently we are able to measure the availability and act quickly when a component fails.

P.S. If you haven’t already, why not check out the Envato API?