Using Historical Incident Management Data to Plan for System Upgrades

Published September 13, 2017 for PagerDuty

As a freelance developer, inheriting projects is a necessary evil. Almost every project has legacy code that the team is afraid to touch, but when you inherit a project as a freelancer, more often than not, the entire codebase is “legacy.” While dealing with an unfamiliar code base is tough, what can be even more difficult is getting that code base running in a production environment.

Guessing Games

Last October, I inherited a project that drove me to near insanity. The source code itself was in shambles for sure, but what made the project such a nightmare was the lack of documentation and communication from the previous developers. This led to me having to reverse engineer the application in order to get it running in the new production environment.

I was essentially playing a guessing game with the architecture. I had an idea of what type of resources I needed to provide, but without getting it in front of users, I really didn’t know what to expect. As I’m sure you can guess, this didn’t end well. Due to inefficient programming patterns, the site required four times the resources it should have in order to achieve some modicum of stability.

Luckily for me, however, one of the first things I did was integrate some incident management tools into the project. What this allowed me to do was identify specific pain points early and often, and fix them immediately. This led to strategic resource and project upgrades to improve the stability of the project.

So, what exactly did I see?

While I felt like I was playing whack-a-mole with half a dozen issues at any given time, there were two that cropped up infrequently enough that I would not have noticed their impact had I not integrated any incident management tools: database locking and memory issues. These are two relatively common development issues that can occur, but while common, they can be difficult to diagnose and solve.

Database Locking

After stabilizing the production site, one of the first things we noticed was that the site was crashing every hour, on the hour, for about 15 minutes each time. Thanks to the information provided by our incident management tools, I was able to narrow down the problem to an hourly cron job. What I found was that a critical cron job was locking a primary database table every time it ran, effectively taking down the site until the process was done. This led me to easily refactor that particular script, which allowed me to increase the uptime of the site and reduce user frustration.

Memory Issues

Memory leaks suck. In a complicated application, they can be incredibly difficult to track down — especially when they occur in a production environment. Unfortunately for me, this project was filled to the brim with them. Some are easy to fix, like log entries showing the Redis server running out of memory (insert more memory here), but others can be pretty elusive.

One common and seemingly random memory issue that occurred was timeouts. Occasionally, the site would start timing out for users after attempting to load for five minutes. While I knew from experience that this was likely caused by more inefficient database queries, narrowing down the exact queries was a bit of a challenge. Again, thanks to the incident management framework I’d put in place, I was able to identify a specific set of profile pages that were taking almost half an hour to retrieve data from the database. Because this process took too long, users kept reloading the page and restarting the whole process.

The first thing I was able to do was identify exactly how long users were waiting before they reloaded the page or gave up (about 1 minute). Then, I made some changes to both the web and database server configurations to kill everything after 1 minute. This gave me some breathing room, so those pages didn’t crash the rest of the site.

Then, I had to identify the exact queries that were causing the problems. Unfortunately, these particular pages were pretty query-heavy, but after referencing the logs I was able to narrow it down to one particular line that was querying over 1GB of data from the database server without caching the result. From here, the next steps were to refactor the query, cache the result for an appropriate time, and get the fix out to users as soon as possible.

While these are just a few examples of the problems I was able to solve thanks to my historical incident management data, if I hadn’t implemented the toolset early on, I would probably still be playing guess-and-check with various solutions. Don’t get me wrong, though. The same incident management tools can also be used to plan upgrades for a well-architected application. Identifying the circumstances where your servers overload or things start slowing down is crucial towards scaling your project to accommodate growth.