Tales of Configuration Management | EVE Online

Tales of Configuration Management

2007-04-26 - Mephysto

So, its been a while since my last devblog. I've been pretty busy this year, as my primary role within QA has changed from Tester to Configuration Manager (Tanis is now primarily responsible for the Testing area, you can read his blog here). The definition of the CM role is "The control and adaptation of the evolution of complex systems. It is the discipline of keeping evolving software products under control, and thus contributes to satisfying quality constraints.". In more understandable terms, I control which changes made by the programmers and content departments are released to Tranquility. Well, not just me, there is a Change Control Board of 4 people: myself, Ulfrr (Head of the QA dept.), Oveur (Production Manager) and CCP's CTO (Chief Technical Officer) prepH. Thus I hardly have final say, although I like to think I do

Obviously this brings up a question of "Why aren't all the fixes put on TQ when they are made?". Well, simply put not all fixes are safe to put onto TQ, and some fixes do not even apply to TQ. One of my bigger problems is when a fix is made for some error, however the fix itself relies on a seperate piece of code thats been written for a new feature being introduced in the next expansion. This is happening alot more often as we get closer to Revelations 2.

Once a fix or feature has been ported to the release branch (actually named STAGING) it goes onto Singularity for testing. All changes that end up on Tranquility appear here first, as Singularity is the closest thing we have to a copy of TQ's setup, even if it is on a much smaller scale. Having a complete copy of TQ's hardware just for testing isn't possible.

Still, Singularity can support over 1000 players although we rarely get that high. This is unfortunate, as we have some issues that just do not appear until there are a very large number of people using the same server, one of the reasons why we have problems with major patches after we have opened the servers to the players. Simply, the problem was not detectable on any of the test servers, and only appeared once 10,000 people or so connected.

Outside of patches, we are regularly applying server side fixes to improve the stability and performance of the cluster. Very rarely will you see any of these happening, as these changes are almost always 'under the hood' and thus only measurable by the reduction in node deaths for example. (The recent outbreaks of mass node deaths are not code related, but hardware).

In addition to the building of patches, I get to beat the programmers and content developers with a stick (or whip, sword, whatever's at hand really) to fix defects for the next patch which I need now this instant so I can build the test server. All in all this takes up a remarkable amount of time each day, even when a patch is not imminent.