Debugging Jita Live is For Real Men
Debugging Jita Live is For Real Men
One of the problems that we're faced with in running a cluster on a massive scale such as our beloved Tranquility is the fact that it's extremely difficult to test specific load issues before code is deployed onto the cluster.
We have a series of load-inducing tests that we run on our test servers and we get players to participate in huge fights on our public test server in order to gauge the effects of new code.
In Apocrypha we had a staggering number of changes to the code base from dozens of programmers working on three continents. Keeping tabs on the changes was a daunting task and, as always in large software projects, a few bugs slip the net and make their way to the production server.
In this case, a bug caused Jita to start suffering from performance degradation with 300 people in the system and we had no idea why. Basically all the nodes were running hotter than they should be, and in the case of Jita, it was running at 100% CPU capacity under load which should only have it running at 30%.
Luckily our live team has become quite proficient in debugging issues on TQ and immediately sprung into action. We have extremely talented people in all departments and for this specific problem Programming, Quality Assurance and Virtual World Operations formed the backbone of Operation: Fix-whatever-is-wrong-on-TQ.
We have extremely good diagnostic tools in place on the cluster, but the first thing we noticed is that those tools did not properly report where the CPU cycles were bleeding. From studying the graphs, it looked as if the server was doing the same amount of work as before only each work unit was more expensive than it used to be.
After much deliberation, we concluded that the fault must lie in one of the low level systems of the game that reside outside our timing logic (in lay man's terms: Planck-Scale Spacetime Fluctuations). Moving forwards we split the taskforce into two teams. The first team (Team Alpha) started to examine the code in search of the needle in a haystack while the second team (Team Bravo) started to look more closely into the program execution on the cluster itself.
Team Alpha looked at the code carefully and found out that this must lie in a piece of logic that has to do with "bound object connections". Following that route led them to an internal reproduction case. Once they had that, finding the actual problem was rather straightforward.
Team Bravo went onto the Jita node and paused the running process to see in what segment of code it was running through the CPython interpreter.
By a startling coincidence both teams arrived at the same piece of code at the same time!
It was literally one line of code. Typical...
Fixing the problem was trivial, and the teams then deployed the fixed code directly onto a single node on the live running cluster. This had never been done before in this way and was extremely exciting. The Jita solar system was moved onto the node and we saw immediately that the fix was working.
The next day, during downtime, the fixed server build was deployed cluster-wide and space-time was finally allowed to heal itself.
This is how real men (and women) do it...
- CCP Atlas