Back in November 2015, one of the Envato Market developers made a
startling discovery - our exception tracker was overrun with occurrences
of undefined method
exceptions with the target classes being
NilClass
and FalseClass
. These type of exceptions are often a
symptom that you’ve written some Ruby code and not accounted for a
particular case where the data you are accessing is returning nil
or
false
. For our users, this would manifest itself as our robot error
page letting you know that we encountered an issue. This was a
particularly hairy scenario to be in because the exceptions we were
seeing were not legitimate failures and replaying the requests never
reproduced the error and code inspection showed values could never be
set to nil
or false
.
It’s worth noting that during the assessment of these exceptions we were able to confirm that the security of Envato Market and that of our community’s data was not impacted. This was a notion we continuously challenged and ensured was still true throughout every step of our investigations and if at any stage that was not clear, we stopped and tightened third party checks and monitoring to make sure we were certain.
These exceptions were harder to track down than the regular errors that
manifest in a single point of the application and our error tracker
showed the errors had begun sometime in October, although we could not
isolate a specific deployment or change that matched that timeframe.
Initially we tried upgrading and downgrading newly introduced gem
version changes and when none of these stopped the errors, we also
rolled back our Gemfile
to previous months with no success. We also
read close to every article on the internet related to Ruby heap
corruption available and tried all the proposed solutions without any
luck.
To add to the problem this issue was only occurring in production. Any attempts to replicate this locally or in our staging environments never created the error - even attempting to replay the production traffic through another environment didn’t create the issue. In production this was very difficult to create a reproducible test case as we did not know exactly where the problem originated from. About mid investigation we did think we had a script to reproduce it however it turned out that it was just running into the exceptions much like our usual traffic was.
Our suspects
Premature garbage collection
From the early investigations, we suspected a premature GC due to valid pointers disappearing mid call. This became more suspicious as upgrading to MRI 2.2.x and introducing generational GC made our situation worse. To dive further into this we instrumented Ruby GC stats using Graphite and watched for any unusual change in object allocations and heap health. However, we could never find anything that pointed us to a GC problem despite spending a large amount of our time investigating and tuning this Ruby’s GC behaviour.
Unicorn
Envato Market runs unicorn as the
application server and suspicion was raised after we traced a series of
requests back to the same parent process. Since unicorn uses forking and
copy on write we thought
there could be a chance that the parent process was becoming corrupt and
then passing it onto its children processes. We lowered the worker count
to 1 and attached rbtrace
to the
parent and child processes but came up with nothing that looked like
corruption and never managed to capture a segmentation fault in the
watched processes so we were able to rule this out.
C extensions
We use a handful of gems that rely on C extensions to perform the low
level work in a speedy and efficient manner. Using C is a double edged
sword - using it well results in excellent performance whereas using it
poorly (especially unintentionally!) results in very difficult to
diagnose issues unless you are well versed in the language and
environment. For the most part we relied on
valgrind and manual review to identify anything
that we thought could be the cause. To get a list of all gems that we
needed to inspect we use the following snippet which returns a list of
the gems which have a ext
directory.
1
|
|
Once we had a list of potential culprits we set off reviewing each one individually and running it through valgrind looking for memory leaks and anything that didn’t quite look right that might lead us to a heap corruption scenario. Our investigations here didn’t solve our specific issue however we did submit some upstream patches to the projects that we identified potential issues with.
Our Ruby build process
Before the Envato Market application got onto MRI 2.x we ran a custom Ruby build with a series of patches that required us to build and maintain our own fork of Ruby. Ruby is in a far better place now so we rarely need to patch it but we still maintain this process as it allows us to package up our version of Ruby and ship it to our hosts as a single package and eliminate attempting to download and build on individual hosts.
During the investigations there were concerns that the version of Ruby
we were building and deploying was being corrupted at some stage and we
were seeing that as the segmentation faults in production. To
troubleshoot this further, we tried to build our custom version with
additional debugging flags but quickly discovered they were not being
applied as we thought they were in the past. After spending a bit of
time digging into the issue we identified the cause was the position we
were passing in the flags and inadvertently ended up getting trampled on
by ruby-build
default options
instead. We fixed this within our packaging script and shortly after we
were able to verify that our version of Ruby matched the upstream Ruby
once it had been packaged up manually and deployed to our hosts.
Getting help
After a few months of trying and eliminating everything we could think of, we reached out for external help. We had spent a bit of time analysing our C extensions but we didn’t feel we went deep enough with our investigations. To get more insight into the C side of things we teamed up with James Laird-Wah and started going through our core dumps and gems that relied on C extensions.
After many long hours of debugging and stepping through core dumps, we
had found a smoking gun in the form of a Nokogiri
release.
Nokogiri RC3 introduced a
patch
for a nasty edge case where if you have libxml-ruby
and Nokogiri
loaded in the same application, you could encounter segmentation
faults due to the way some fields were managed. Updating to Nokogiri
RC3 saw us drop the number of segmentation faults to a third of their
previous counts! Looking at this behaviour further we identified the
cause of this edge case was the way libxml-ruby
managed the fields
despite the ownership on them. In order to address the remaining
segmentation faults, we got together with James and formulated a patch
that would ensure libxml-ruby
would only manage the fields it
explicitly owned. We tested it and deployed it to production where we
monitored it closely for 24 hours. Lo and behold, 0 segmentation
faults! We’d finally found the cause to our issues! Excited with our
discovery we pushed the patch upstream and it’s now available at
xml4r/libxml-ruby#118.
Ensuring this doesn’t happen again
Like most regressions, proactive monitoring and insights into your normal application behaviour are the best solution for avoiding long running issues like this. To make sure we are not stung with this again, we are taking the following measures:
- Implementing alerting based on any occurrences of segmentation faults within our applications. Our applications should not ever be hitting segmentation faults and if they are, we need to mark them as a high priority and assign engineers to resolve them.
- Dependency graph review on each
bundle update
to see when we have a lingering dependency and possible removal. The gem we were using that leveragedlibxml-ruby
has since been refactored out but the relationships and dependencies were never cleaned up once it was removed. - Better monitoring and roll ups of exceptions on a team by team basis. We are aiming to better integrate our developer tooling with our exception tracker to ensure we can quickly identify an increase in exceptions and work on a resolution before they become a bigger issue.