The Long Lag

As the CSM has been kind enough to request, and with apologies for a holiday-related delay, here is the presentation(// 2010 - Designing Distributed Applications.pdf) I gave at the Para 2010 conference a couple of months ago.

To give you a little background, this presentation was based on my Ph.D. research, which examined the problem of group organization and co-operation between semi-autonomous robots in scenarios without centralized control. What I was trying to get the robots to do was visual analysis on a shared scene, which also turned out to be a harder problem than I initially expected.  However, this was just the demonstration. The primary goal of the research was to explore the problems you have to solve in order to create groups of robots that can co-operate on shared tasks when centralized control isn't possible.

Every EVE player who has been in a large fleet fight has experienced this set of constraints, just in the work the individual fleet commanders and coordinators try to do in handing out targets and coordinating actions. Similar management is done by the CEO's and Directors of the EVE corporations on a day-to-day basis. The problems are not, in principle, any different in the game's software layers, except that software message processing speed is considerably faster and more reliable, whilst the human layers are arguably more fault tolerant. Arguably.

What both software and human groups are fighting at a very fundamental level is a nasty relationships between organizational structure and available real time communication capacity. These relationships do not scale at all well as groups get bigger, especially as the shared task gets more complicated and requires more communication. Get everybody to spam the chat channel at the same time - easy. Coordinating a complicated operation with different fleet components, loadouts and staggered waves of attack is much harder. These are problems that arise from simple mathematical limits on what can be done or communicated by each node in a distributed system in any singular instance of real time. A very simple example is a human meeting. Four people in a one-hour slot where each can speak for 15 minutes is usually plenty of time to provide the insights that that person can bring to the topic. Twenty people in the same time period can only talk for 3 minutes each. Meetings just don't scale well as a way of exchanging information.

Where I personally think this gets particularly interesting is that there turns out to be a topology (arrangement of links between nodes) constraint on the total amount of instantaneous information that any given group of nodes can process. At the extremes you have a strictly hierarchical topology (think dictator), which can communicate the same message (orders) to everybody in the network very quickly; and a distributed topology (think democracy), within which a much larger amount of different messages can be communicated, but actually getting everybody coordinated quickly becomes a distinct problem. For shared tasks whose requirements are at the extremes, it's not a problem, but what do you do if the task requires both quick coordinated action, and sharing a lot of information between nodes?

The presentation at Para 2010 was an attempt to put this into a very high level framework for designing large scale distributed systems. Personally, I think the scaling limitations themselves are fairly obvious especially once they've been pointed out.  A lot of the design problem in practice comes from hierarchical designs being chosen due to ease of initial implementation. They won't scale very well - but that's not necessarily very obvious when you're working in the lab with too few nodes for the scaling constraints to be a problem. Something akin to a state change occurs in these systems as they grow beyond a certain size.

In practice, both hierarchical and distributed approaches end up getting used to solve the same problem, and there is a fairly complex set of tradeoffs that have to be evaluated when designing individual systems to determine which is actually the most appropriate. Games like EVE Online are not really a single distributed system - they are effectively a superposition of multiple distributed systems which simultaneously provide different aspects of game play, and have to be somehow designed to play well with each other.

Our goal is to not only give you the best possible performance across the cluster as whole, but also for specific activities like fleet fights, measured against the theoretical limits. We have a lot of work to do on that and plans to do it. From the cluster team's perspective, it's work that would have to be done regardless of whether Incarna or DUST 514 was being rolled out or not, as we build the cluster out to the next level of scaling for EVE itself.

From time to time we also discuss scaling issues with game design, since that is the only place where some of these distributed scaling problems can be solved. The longer term view on fleet fight performance lag is that whilst we can and will maximize performance within any single server's area of space, we are going to have to continue to work on game design to somehow limit the number of people that can be simultaneously in that space. Fully granted, given the vast physical immensity that is actual Space, it is a little hard to make a game case that there isn't enough room for a piddling few thousand spaceships.

For the specific issue of fleet fight lag where players are getting stuck jumping into a system, (aka long lag), we know this is an issue. To be technical, it's a non-reproducible, stochastic problem that predominantly manifests itself only under high load. The less formal name for this sort of thing is "every developer's worst nightmare."

Debugging distributed real time systems is a little different from dealing with sequential programs. For one thing, especially on a cluster the size of EVE, putting in extra logging can slow down the entire game if we're not careful. Data mining the resulting logs then becomes its own set of issues.  You just know you're in trouble when the program you've written to do the analysis itself takes hours to run. There can also be Schrodinger effects, where examining the state of the system changes its behaviour enough that the issue doesn't manifest itself. We also have to be generally careful with what we do, since we can easily take down the entire cluster if we're not. It slows us down in the short term, but it hopefully means we don't introduce any more problems than we have to.

What I told the CSM was that we are going to fix this, it is going to take time, and we apologize for how long it has already taken. All I can really say beyond that is that these really are hard problems to solve. I know that sounds pretty lame, but that is also the unfortunate reality. They pretty much have to be tackled scientifically, which is to say that you form an hypothesis, figure out how to test that theory, or to get more information that will help form better theories, swear a lot when your pet theory gets shot down in flames by the empirical evidence and come in the next day and get to do it all again. It takes time, no matter how smart we try to be about it. 

A number of developers have been working on this problem. A number of possible causes have so far been identified, tested, and turned out not to be the direct cause of this particular issue. We thought for a few days that the "Heroic Measures" DB fix that CCP Atlas shepherded through would be it. It certainly fixed some significant problems, but it just didn't fix the problem. I had a pet theory that it was a TCP rate adaption issue, in conjunction with a system lock affecting multiple clients. No such luck. We narrowed it down a little after I accidentally left myself logged on while stuck overnight, and came in the next morning and found I was successfully jumped into the system. (Unfortunately due to the nature of EVE that really only works in an invulnerable ship.)  But that still leaves us with a pretty wide target. We are just starting to reap the benefits of some of the longer term projects that were initiated back in the spring as backup plans: improved cluster monitoring and a much improved testing environment (major props to the Core Infrastructure team for that btw), being the major ones. That and the results from the mass testing efforts are helping to focus our suspicions.

In the meantime, I can guarantee that the team here feels just as frustrated as developers about this problem, as you do as players. Probably the most frustrating part is that based on past experience, when we do find this issue (or issues) it will be something that, in retrospect, appears incredibly obvious and silly to have caused so much pain.

So we will continue to beat our heads against this problem until we solve it, and then I suspect we will beat our heads against the nearest wall for a quite a while afterwards.

We'll keep you posted.

  • CCP Warlock