Happily upgrading Ruby on Rails at production scale

The Envato marketplace sites recently upgraded from Rails 2.3 to Rails 3.2. We did this incrementally at full production scale, while handling 8000 requests per minute, with no outages or problems. The techniques we’ve developed even let us seamlessly and safely experiment with mixing Rails 4 servers into our production stack in the near future.

We wanted to be able to confidently make the huge version jump without having to do an all-or-nothing cutover. In order to achieve this we made a number of modifications that allowed us to run Rails 3.2 servers side-by-side on our load balancer with all of our 2.3 servers. This let us build confidence in our upgrade to the new version gradually with far lower risk of our users receiving a bad experience.

Concurrent Rails

We’ve released the patches we used as a gem rails_4_session_flash_backport (github) that magically lets Rails 2, 3, and 4 servers live happily side by side.

If you are still stuck back on a Rails 2.3 app, this should help kick start your upgrade progress to Rails 3 (and beyond to 4 if you’re ready).

This post will go into the technical details around making this upgrade as smooth as it was.

History

The Envato Marketplaces are actually quite an old code base, started back in February 2006 on Rails 1.0. It has seen it’s fair share of hairy Rails upgrades over the years, but the changes in the framework between Rails 2.3 and 3.0 were pretty much a rewrite. For a long time it felt like we’d accrued too much technical debt from previous upgrades to make the jump without giving the code base a lot of love first.

We were stuck on 2.3 quite a while, and we let 3.0, 3.1 and a large number of 3.2 releases slip by. Eventually the pain from being on such an old unsupported version of Rails became enough that we were able to get the approval to pay back some of our technical debt and do some longer overdue upgrades.

One of the things we did to build up our own confidence, and the confidence of the business, in our upgrade to Rails 3.2 was to create some patches both to Rails 2.3 and to Rails 3.2 that would allow us to have requests bounce between servers of both versions seamlessly. This let us run a single Rails 3 server amongst the many other Rails 2 servers for short bursts, so that even if the Rails 3 server failed catastrophically, it would still only cause a small percentage of user-visible failures, and would allow us to test for performance and reliability without having to do an all-in switch at the first smoke test.

Problem 1 - Changes to SessionHash

The SessionHash is where details are stored that let us identify who a user is logged in as, so we can determine what parts of the site they can access, show them their own profile settings, etc.

In Rails 2, the SessionHash (source) was much closer to a plain old hash, meaning if you store something in session[:foo] and try to access it via session["foo"], you’d get nil. On Rails 3, the SessionHash (provided by Rack, source) started acting like a HashWithIndifferentAccess so both session["foo"] and session[:foo] would return the same thing.

This meant that taking a session from Rails 2 to 3 worked correctly, the newer version of Rails didn’t care if we stored the session_id as a string or a symbol, it could happily find it and we’d end up with the same session_id we had on Rails 2.

The problem was that once a session goes through Rails 3, all of the keys to the SessionHash are then stored as strings, meaning if we take the session back to Rails 2 and it looks for session[:session_id], we get nil and a new session_id is generated, ignoring the old session_id and logging out the user.

This is obviously not desirable as it means any user who hits our Rails 3 test server is very likely going to be logged out on their next request. This is because our load balancer is not configured to do “sticky sessions” where a user’s requests will always hit the same server, and consequently the next request will probably hit a Rails 2 server.

To deal with this, we wrote a patch for ActionController::Session on Rails 2.3 which approximated the behaviour of Rails 3 closely enough that requests could bounce back and forth between versions and the user remained logged in with the same session_id.

The basic idea is that we store everything in the SessionHash as strings, and then when looking things up we seek using a string key first, fall back to seeking with a symbol key if it’s not found (ie. If it’s a SessionHash from a vanilla Rails 2.3 server).

Problem 2 - Changes to How Flash Messages are Stored

In Rails, there is the concept of a flash, which is a short term storage place for putting messages/errors/etc that will be displayed for a user on their next page view. An example of such is when you edit your profile and it successfully saves, a message is passed on to the next page via the flash, which is then displayed to say everything went according to plan.

The class used to do this is marshalled into a binary format in the session, and then unmarshalled on the next request. The problem that occurred here was that the class used for flashes completely changed between Rails 2 and 3, meaning that if you attempted to unmarshall a Rails 2 session with a flash object stored in it on a Rails 3 server (or vice versa), the request would blow up with a ActionDispatch::Session::SessionRestoreError, complaining that an object of a class was found, but the class isn’t defined anywhere.

In Rails 4 pre-release builds, this practice has stopped and the flash is now stored in the session as a basic Ruby hash, meaning it’ll happily unmarshall on any version of Rails, even if the version of Rails doesn’t know how to get the flash message out from that data structure.

We think this is a much better approach, and thus we back-ported this new method of flash serialization back to both Rails 2.3 and Rails 3.2

We also got some hacks working that let us unmarshall the missing flash class on each Rails version even without a full class definition, meaning we could use our knowledge of how things were stored internally in those objects and #instance_variable_get to pull out the messages and bring them into the new format.

With these patches in place on both our Rails 2 servers and our Rails 3 test server, it was possible for a user to bounce between servers of different versions without being logged out, and without seeing error pages because the current server couldn’t understand the flash message from the previous server. It would theoretically even be possible to add a Rails 4 box to the pool and have the session be happily understood on any server in the pool.

Problem 3 - Consistent URLs Between Versions

The final hurdle in running Rails 2 and 3 concurrently in production was making sure all our URLs remained the same between Rails 2 and 3. The format for the routes file completely changed between these versions and we took this as an opportunity to kill off a lot of overly-permissive routes that let through too many HTTP verbs, many dating back to a time before Rails even spoke REST.

The initial work on this involved lots of manual testing to ensure URLs and form actions were still matching up. Once we had fairly high confidence that most things aligned, we developed some time charts in our log aggregation tool Splunk which allowed us to see when a request came through and was unable to be routed to a particular controller. We always see a level of background noise as we gets LOTS of requests for random php files, etc. Some of these requests are obviously malicious, some just innocently bad URLs, but by graphing them on a time chart we are able to determine what is normal background noise and what are new routing problems caused by Rails 3.

Rollout

With these patches in place and the Splunk charts at our disposal, we were in a very safe position to silently start serving requests on Rails 3 in short bursts, and get an even better level of confidence that all our URLs matches up correctly. Initially we added one Rails 3 server to the load balancer for a 1 minute smoke test, which revealed very few problems, just a few unusual routes we’d missed surrounding the API. These were fixed and we did progressively longer tests, each time fixing any problems that were revealed, and eventually we could have a Rails 3 server in rotation for 30+ minutes with no obvious change in error rate.

Conclusion

The extra work involved in getting things into a position where we could run a Rails 2 and Rails 3 version of our app concurrently was definitely worthwhile. It allowed us to detect more problems without the site appearing broken to all users than we otherwise could have. This gave us a huge degree of confidence that the final cutover would go smoothly and it allowed us to properly assess the performance of our Rails 3 app with production levels of traffic without having to “bet the farm” so to speak.

I ended up being the on-call person for the first night after the full cut-over, but because of all the work we’d put in beforehand to make sure things went smoothly, the night was so quiet that I started to worry if my phone was actually working.

You would expect with an upgrade this big that even when you think of everything there’ll still be something that slips through, but the things that did were decided minor enough that I got an uninterrupted nights sleep after what was probably our most high-risk upgrade to date.