On Thursday the 10th of January at around 3:25am AEDT (UTC+11) the Envato Marketplaces began suffering a number of failures. These failures caused searches to stop working, buy now purchases to be charged but not recorded, newly submitted items to not be processed, our review queues to be blocked by error pages and the site to be generally unstable.
We’d like to take the time to explain what happened, why so many seemingly unrelated areas of the app failed simultaneously and the measures we’ve put in place to try to prevent similar problems from occurring in the future.
On Thursday at around 3:00am our Solr server (the search platform used by most of the Marketplaces) began responding to requests very slowly.
This presented itself in a number of different ways. Users performing searches experienced slow response times, and several other features of the site (seemingly unrelated to search) began to have issues:
- Buy now purchases failed to actually record the purchase and give buyers their items.
- Our Delayed Job queues that handle pushing authors’ files from our servers up to Amazon S3 and sending out emails began to blow out to huge lengths, causing many jobs to not be processed.
- Reviewers were unable to approve items.
- Page load time rose to unacceptable levels, since all the app servers ended up waiting for the search server to time out before being able to serve other requests.
- To top it all off, our support system crashed.
At around 3:00am, the on-call developer for the week was paged by an automated trigger for excessively high CPU usage on our Solr server and immediately started to look into what was happening. He quickly discovered that search responses were extraordinarily slow, mixed in with many other seemingly unrelated failures across the application.
His first action was to disable Solr. This made the rest of the site work, but with reduced functionality—we were handling errors returning quickly, but not slow responses.
After an hour the second in line on-call developer came online to help. In order to isolate Solr from the the application for diagnosis, they firewalled off Solr traffic. Under these conditions the Solr server appeared to respond to locally generated searches with normal response times but the rest of the app continued to be broken in weird and wonderful ways. We realised later that this was due to dropping the Solr traffic instead of rejecting it, which led to the same behaviour as slow responses.
Eventually, we shut down the Solr server completely which solved the response times, the review queues and allowed new buy now purchases to complete correctly. This was possible because our code that communicates with Solr gives up right away if the server is offline. What we hadn’t accounted for was slow responses.
The problem now was that this left us with most of the Marketplaces with no search, our Delayed Job queues were still growing and support was still broken.
Another team in the business maintain our support server, and they were separately alerted of it being down. They found it in a frozen state, probably caused by all of the extra load from new support requests due to the other failures. Their on-call developer forced the machine to reboot, which brought it back up but its database came back online corrupted. A reload of the latest database backup was done, bringing it back online and allowing people to tell us that everything else was broken.
At about 8AM developers started arriving in the office, allowing our pair of on-call developers to hand over to some fresh minds and recuperate with some well deserved sleep. They wrote up a gist containing the night’s IM conversation history, a list of known problems and what had been figured out to date. That made it much simpler to bring each developer up to speed as they arrived in the office.
We discussed the situation and organised to have pairs of developers working to concurrently diagnose and fix the many problems we were seeing.
Buy Now Purchase Problems
The investigation into our buy now purchase problems discovered that inside the database transaction, just after the system completed a buy now purchase and acknowledged it with PayPal, it incremented the sales count on the item.
This seemingly innocent action triggered a callback to re-index the item in Solr. Solr responded very slowly, slower than our maximum allowed time for a web response, causing the request to be dropped and the database transaction to be rolled back.
The immediate fix for this was to push the Solr re-indexing into Delayed Job so that it no longer mattered how long indexing took, the purchases would complete quickly.
The guys working on this problem then devised a way to find the buy now purchases affected by this, re-verify with PayPal that they had made the charge, and complete the purchase at our end giving buyers access to their items.
Item Processing and Asset Uploads
We examined the contents of the Delayed Job queue and it immediately became obvious why our asset upload jobs were not completing, and why submission processing had stopped. Solr re-index jobs had been queued with default priority, which happens to be higher priority than our item submission processing, S3 uploads etc.
The Solr re-index jobs would tie up a worker until the Solr server responded or completely timed out, until eventually every worker was tied up trying to update Solr. This exacerbated the load situation on our Solr server, and stopped our Delayed Job queues from carrying out useful work.
We fixed this by temporarily lowering the priority on this type of job, so that any other type of job was worked off before workers started getting blocked trying to update Solr.
Review Queue Errors
Finally, we examined the review queues and noticed that on approval of an item, its de-normalized representation is updated or created. This triggers a callback to index the item in Solr in-request, which again was taking so long that the request would time out and the reviewer would get an error page.
The final piece of the puzzle, why the Solr server was slowing down so dramatically, wasn’t solved in such a satisfying manner. Load looked normal and individual requests to the machine were responded to normal times.
After poking around with it for the better part of the day, and still unable to explain why it was behaving the way it was, we decided to reboot the server “just in case”. After the reboot, we let traffic from a single web server through to Solr and found that performance seemed normal again, so we slowly ramped up how many web servers we allowed through to Solr until it was again serving full load with normal response times.
Thus proving you should always try turning it off and on again.
Various theories floated around the office as to the root cause; was it the leap second from 7 months earlier, suddenly causing problems on a machine that was never rebooted? Had we crossed some invisible point above which load becomes unstable? Was it caused by cosmic rays?
We’re sad to say we’re still not sure. Unexpected failures can and do happen no matter how much engineering you do. This event highlights our need to do better at handling those failures.
Preventing it from Happening Again
The take-home lesson for us is that we need to engineer the Marketplaces to be better isolated from the services they depend on, even the ones that are on our internal network and run by us. Solr may slow down on us again in the future (and we’ll be much quicker to try rebooting the server if it does), but it shouldn’t affect the Marketplaces.
This incident really highlighted how a slow response from a service can be worse than no response at all. There’s some great sections on why this is and how to defend against it in “Release it!” by Michael Nygard, specifically chapters 4.9, 5.1 and 5.2.
We’ve since taken the time to prevent a repeat situation by wrapping the calls to modify Solr’s index with much shorter timeouts, allowing web requests to finish in a timely manner even if Solr is misbehaving, and circuit breakers so we don’t waste time making calls to Solr when it’s already responding too slowly to be useful.
We’re taking this lesson and applying it to all of our services, to hopefully prevent any similar failures in the future.