A number of our services were until recently running on Ubuntu 14 (Trusty Tahr), for which long term support (LTS) was coming to an end. Trusty Tahr was released in 2013 and has since served us quite effectively, —so effectively in-fact, that we near completely forgot about it. That is until one morning, an announcement from Ubuntu declaring that LTS for good ol’ Trusty was coming to an end, demanded attention. Our engineering team, being the enthusiasts that we are, popped the hood, blew the dust off and diligently set upon the task of taking stock, discussing and carding up requirements for the great migration.
It made sense —we reasoned, that since we were performing a recondition, we might as well bump it all the way up to the present in one go and shepherd in the era of Ubuntu 18 (Bionic Beaver).
The rub —we all agreed, was that there was a fairly large surface area of change between versions, and with layers of provisioning and configuration in the mix that would have to be adapted, the likelihood of this going smoothly was pretty slim.
Now I can not say whether it was the looming LTS deadline, anarchic tendencies or enthusiasm that had the biggest influence on our decision, but decide to do this we did, and over the ensuing sprints juggled learning about and adapting our infrastructure to the ways of Bionic.
After all the chopping, changing, tests and refinements, it was time to direct production traffic over, fire extinguishers in hand we flipped the switch and.. —you’re expecting to read about the fireworks right about now, aren’t you? Yup, we were expectant too, nothing fell over. Bionic hummed along, all the gauges and readouts were claiming normalcy, and after doubtingly observing this strange phenomenon for a few hours, we hesitantly accepted the possibility of success. For good measure, we decided to let this new and wondrous machine run for a full 24 hours before declaring victory.
Twenty-four hours later, with an evidently well-tuned infrastructure upgrade in production, we made all changes permanent and left the building in victory stride. The migration was done and we were moving on to the next thing. It wasn’t until a couple of days later, when one of our web instances fell over —complaining about missing a semaphore, that we began to wonder about what we might have overlooked.
What follows is a walk-through of our investigation into the case of the missing semaphore, the rabbit-hole it sent us down and the anagnorisis that hit, revealing the unexpected culprit. See if you can call it before the big reveal.
We started looking at backtraces after a semaphore went missing for a second time, the missing semaphores were used by a part of our code-base that utilises Semian, a library that has been a salutary part of our code-base for a good few years. Semian provides circuit breaker and bulkhead functionality which can be used to improve the resilience of a service that needs to call out to other services.
Semian in turn utilizes semaphores to implement this functionality, and what the backtraces were claiming, was that our service was referencing a semaphore that did not exist.
Our initial investigation did not give us much more than this to go by, Bionic was fresh, but it had been running smoothly for a few days and nothing was pointing to it being the perpetrator. Moving through the expected steps; we were still running the same version of Semian, associated code had not been changed, logs and monitoring had nothing and there were no co-occurring events to read into.
Going by the little we had, this would have to be an exercise in abductive reasoning. Besides ipcs
confirming that semaphores were indeed missing, and some early speculation about the possibility of bottlenecks, starvation and resource constraints, we had nothing. Opting to garner more information before trying our hand at hypothesising, we set logging up to capture what semaphore activity could be traced, expecting to catch a complaint if/when another semaphore went missing.
The pager went off early the following morning, all of our web instances had hit this same snag, between log data and some manual testing we began ruling possible causes out, no complaints, no timeouts, no workers were being killed. Everything we looked at was behaving as advertised.
Being unable to incite this misbehaviour, and with nothing revealing in the logs, we turned our attention to the only other lead we had, which was from observing that all web instances were hit, and were hit at the same time, the time that daily cron jobs run.
We went on to explicitly invoke those of the daily cron jobs, that could be run on a more frequent basis without consequence/side-effects, this fell flat with nothing misbehaving. Perhaps the job we were after was not in this selection.
Our investigation till this point had revealed very little, some cron jobs could not be run till the following morning, others were invoked yet no semaphores disappeared. Inspecting these cron jobs —some of which were contemporaries of Trusty, only served to raise more questions about their relevance and function.
Left with few options, we decided to disable a single, different, daily cron job per web instance and waited for things to get noisy the following morning, the instance that lost none of its semaphores would tell us which cron job could instigate misbehaviour on account of it being the one that was disabled on said instance.
This played out as expected and a specific cron job, which had the purpose of rotating log files was implicated. The plot thickened a little here, besides being a fairly straightforward job, it did not interact with Semian, semaphores, endpoints or any part of the service that we could use to connect the dots.
What it did reveal was that it only performed its job when log files were either old or large enough to warrant rotation, inferring that this was a plausible explanation for why we did not encounter odd behaviour in the 24 hours that immediately followed the Bionic upgrade.
Besides the mystery of how this cron job was instigating misbehaviour, when forced to run, it ran successfully, no misbehaving here either. Suspicion in Bionic’s direction was mounting by this point, but there was no evidence to go by.
All questions, no answers and no leads, a couple of us got up early the following morning, hopped onto a few web instances, got our tracing tools of preference out and collected a few gigabytes of system event data right as semaphores began to disappear.
There’s plenty to see in a couple of minutes of system event data, making sense of it all is an entirely different matter. We filtered for errors, carefully inspected warnings, looked into the sequence of semaphore related messages, focused on log rotation events and came up with a number of hypotheses that could rival the best of conspiracy theories, then disproved them all.
Having looked at all the meaty, noisy, interesting or otherwise cryptic events and having learnt very little —grasping at both straws and sanity, we set out to trace our steps one more time in the hopes of finding something that we had missed.
As fate would have it, the thread that served as catalyst to unraveling this mystery was hidden in plain sight between the grasped straws, having looked at everything else we were now reading the event messages that were in plain English, a message from logind
announcing a logout event in particular.
Now we knew that the log rotation job, when running, would sudo
in to a specific account and log out once done. Looking at this logind
logout event spurred thought in a different trajectory, which ended in a question; what does logind clean up at logout?
Things unraveled pretty quickly from here, a search revealed a number of articles on the topic, systemd is the best example of Suck caught my attention, which linked to systemd-logind deletes your message queues, next up was Linux Database crashes when semaphores get removed. In TL;DR, logind
is a daemon that Systemd employs to manage logins, although not the standard in older versions of Ubuntu, this is the defacto system and services management solution that ships with Bionic, and is configured by default to remove IPC related resources when a user account logs out.
A quick check-in with the service’s provisioning code confirmed that the account in question did in-fact live in user space, and remedial action would be to change default behaviour by setting RemoveIPC=no
in /etc/systemd/logind.conf
, or move the account to system space where it would be unaffected.
Summed up succinctly, the service’s Unix account was in user space, whenever it was logged into then logged out of, all Inter Process Communication (IPC) resources would get cleaned up, of which semaphores are a part of. From the system’s perspective, semaphores were not missing, they were cleaned up as expected and it was the service that was misbehaving by trying to refer to semaphores that were no longer there.
Takeaways..
Hindsight tends to be 20/20, reflecting on the decisions we made on the journey to solving this conundrum, there are a couple of things that I think are worth calling out;
- Diversity amplifies synergy, solving this would have taken longer had it not been for all the unique contributions.
- We might have been able to solve this sooner, perhaps by having tested every aspect of the instigating cron job once we identified which job it was, but debugging with not much to go on is difficult, you do not know what you do not know, thus all inference is weighed equally until something new is discovered, and you do not know how far, how close or in which direction that next discovery is.