How we tracked down Ruby heap corruption in amongst 35 million daily requests

Back in November 2015, one of the Envato Market developers made a startling discovery - our exception tracker was overrun with occurrences of undefined method exceptions with the target classes being NilClass and FalseClass. These type of exceptions are often a symptom that you’ve written some Ruby code and not accounted for a particular case where the data you are accessing is returning nil or false. For our users, this would manifest itself as our robot error page letting you know that we encountered an issue. This was a particularly hairy scenario to be in because the exceptions we were seeing were not legitimate failures and replaying the requests never reproduced the error and code inspection showed values could never be set to nil or false.

It’s worth noting that during the assessment of these exceptions we were able to confirm that the security of Envato Market and that of our community’s data was not impacted. This was a notion we continously challenged and ensured was still true throughout every step of our investigations and if at any stage that was not clear, we stopped and tightened third party checks and monitoring to make sure we were certain.

These exceptions were harder to track down than the regular errors that manifest in a single point of the application and our error tracker showed the errors had begun sometime in October, although we could not isolate a specific deployment or change that matched that timeframe. Initially we tried upgrading and downgrading newly introduced gem version changes and when none of these stopped the errors, we also rolled back our Gemfile to previous months with no success. We also read close to every article on the internet related to Ruby heap corruption available and tried all the proposed solutions without any luck.

To add to the problem this issue was only occurring in production. Any attempts to replicate this locally or in our staging environments never created the error - even attempting to replay the production traffic through another environment didn’t create the issue. In production this was very difficult to create a reproducible test case as we did not know exactly where the problem originated from. About mid investigation we did think we had a script to reproduce it however it turned out that it was just running into the exceptions much like our usual traffic was.

Our suspects

Premature garbage collection

From the early investigations, we suspected a premature GC due to valid pointers disappearing mid call. This became more suspicious as upgrading to MRI 2.2.x and introducing generational GC made our situation worse. To dive further into this we instrumented Ruby GC stats using Graphite and watched for any unusual change in object allocations and heap health. However, we could never find anything that pointed us to a GC problem despite spending a large amount of our time investigating and tuning this Ruby’s GC behaviour.

Unicorn

Envato Market runs unicorn as the application server and suspicion was raised after we traced a series of requests back to the same parent process. Since unicorn uses forking and copy on write we thought there could be a chance that the parent process was becoming corrupt and then passing it onto its children processes. We lowered the worker count to 1 and attached rbtrace to the parent and child processes but came up with nothing that looked like corruption and never managed to capture a segmentation fault in the watched processes so we were able to rule this out.

C extensions

We use a handful of gems that rely on C extensions to perform the low level work in a speedy and efficient manner. Using C is a double edged sword - using it well results in excellent performance whereas using it poorly (especially unintentionally!) results in very difficult to diagnose issues unless you are well versed in the language and environment. For the most part we relied on valgrind and manual review to identify anything that we thought could be the cause. To get a list of all gems that we needed to inspect we use the following snippet which returns a list of the gems which have a ext directory.

1
puts `bundle show --paths`.split("\n").select { |dep| File.directory?("#{dep}/ext") }.map { |dep| dep.split('/').last }.join("\n")

Once we had a list of potential culprits we set off reviewing each one individually and running it through valgrind looking for memory leaks and anything that didn’t quite look right that might lead us to a heap corruption scenario. Our investigations here didn’t solve our specific issue however we did submit some upstream patches to the projects that we identified potential issues with.

Our Ruby build process

Before the Envato Market application got onto MRI 2.x we ran a custom Ruby build with a series of patches that required us to build and maintain our own fork of Ruby. Ruby is in a far better place now so we rarely need to patch it but we still maintain this process as it allows us to package up our version of Ruby and ship it to our hosts as a single package and eliminate attempting to download and build on individual hosts.

During the investigations there were concerns that the version of Ruby we were building and deploying was being corrupted at some stage and we were seeing that as the segmentation faults in production. To troubleshoot this further, we tried to build our custom version with additional debugging flags but quickly discovered they were not being applied as we thought they were in the past. After spending a bit of time digging into the issue we identified the cause was the position we were passing in the flags and inadvertantly ended up getting trampled on by ruby-build default options instead. We fixed this within our packaging script and shortly after we were able to verify that our version of Ruby matched the upstream Ruby once it had been packaged up manually and deployed to our hosts.

Getting help

After a few months of trying and eliminating everything we could think of, we reached out for external help. We had spent a bit of time analysing our C extensions but we didn’t feel we went deep enough with our investigations. To get more insight into the C side of things we teamed up with James Laird-Wah and started going through our core dumps and gems that relied on C extensions.

After many long hours of debugging and stepping through core dumps, we had found a smoking gun in the form of a Nokogiri release. Nokogiri RC3 introduced a patch for a nasty edge case where if you have libxml-ruby and Nokogiri loaded in the same application, you could encounter segmentation faults due to the way some fields were managed. Updating to Nokogiri RC3 saw us drop the number of segmentation faults to a third of their previous counts! Looking at this behaviour further we identified the cause of this edge case was the way libxml-ruby managed the fields despite the ownership on them. In order to address the remaining segmentation faults, we got together with James and formulated a patch that would ensure libxml-ruby would only manage the fields it explicitly owned. We tested it and deployed it to production where we monitored it closely for 24 hours. Lo and behold, 0 segmentation faults! We’d finally found the cause to our issues! Excited with our discovery we pushed the patch upstream and it’s now available at xml4r/libxml-ruby#118.

Ensuring this doesn’t happen again

Like most regressions, proactive monitoring and insights into your normal application behaviour are the best solution for avoiding long running issues like this. To make sure we are not stung with this again, we are taking the following measures:

  • Implementing alerting based on any occurrences of segmentation faults within our applications. Our applications should not ever be hitting segmentation faults and if they are, we need to mark them as a high priority and assign engineers to resolve them.
  • Dependency graph review on each bundle update to see when we have a lingering dependency and possible removal. The gem we were using that leveraged libxml-ruby has since been refactored out but the relationships and dependencies were never cleaned up once it was removed.
  • Better monitoring and roll ups of exceptions on a team by team basis. We are aiming to better integrate our developer tooling with our exception tracker to ensure we can quickly identify an increase in exceptions and work on a resolution before they become a bigger issue.