'We Are Aware of the Problem'
Following the release of the Dominion expansion we started to get reports from players about an increase in "fleet lag." Before we go any further, I would like to assure everyone that a taskforce of developers has been working on this continuously, but tracking down the source has proven a very difficult task.
So, what is the problem?
"Lag" is an ambiguous term and can refer to many things. The graphics in the client might be slowing it down, the network might be congested or a host of other issues may be coming into play. We have spent a long time trying to reproduce the issue on the test servers as well as monitoring fights as they have been unfolding on Tranquility, where we have received help from the players in the form of testing and bug reports. You might see names like "CCP Atlas", "CCP GingerDude" and "CCP Warlock" poking about in Local. Don't be alarmed, we are simply monitoring the node, looking for glitches. There are senior software engineers spending their evenings and weekends on Tranquility monitoring and measuring since we have not been able to see issues when testing because the sheer number of people required to trigger the problems and the specific 'usage pattern' of the game systems in question are outside what can currently be tested in the lab (although this is something that is being worked on**).**
What we are looking at now specifically is the fact that the performance of the sol nodes (the computers in the clusters handling the solar systems) appears to be worse than before. Even though CPU across the cluster has not increased with the Dominion expansion, each reinforced node is no longer able to handle the same number of people fighting as gracefully as it did before.
Additionally, we have the issue of the grid not loading when jumping into a system. Being stuck on the source side and not getting through the gate is one thing since you're not in the battle yet, but if you actually get through the gate and arrive on the other side without managing to load the grid you could be at a disadvantage and possibly vulnerable because of your client's inability to act. This can be detrimental to the fleet jumping into the system.
What is happening is that the client is requesting information about the simulation from the server but it's not receiving it in a timely manner. There may be many minutes before the information is finally received as it is getting stuck in the lower levels of the sol node.
Anatomy of "grid not loading"
When the specific scenario of "grid not loading" occurs, its fingerprint on your client is quite specific (even though sadly we cannot see the effects in the server logs).
Refer to the following handy diagram to determine whether you have encountered the problem:
This is what it would typically look like if you're jumping into a system which exhibits the issue. However, if you are logging into such a system and encounter this issue you will only have a black screen and a "loading" bar.
If we look at the image a bit you will see that the local channel, the right click menu and the system information are all updated to the new system while the nebula is that of the old system. Your overview will also be empty and no brackets in space. If you open up the monitor (control-shift-alt-m) it will have (at least) one outstanding call. This is the request for the simulation instance.
When you see this, the worst thing you can do is to click any buttons or type in local. It will only exacerbate the problem. Relogging will also not help and might even make the situation worse. The best you can do is wait it out (see the "Fleet fight survival guide" below).
Light at the end of the tunnel
We were finally able to reproduce this with a special fleet test on our Singularity test server on Jan 27th where over 400 people participated in helping us out. That was really fantastic to see and I would like to thank everyone who showed up.
On Jan 28th, we had a 1600 person fleet fight on Tranquility which our team monitored closely, keeping the node alive using methods that make our system admins faint. This was one of the biggest, if not the biggest, fleet fight ever in EVE (at least where the node survived the ordeal). This event allowed us to identify what was causing some of these glitches and deploy fixes live.
We have added a stop-gap measure in place which keeps nodes from dying during these high stress scenarios and we have some more permanent solutions in the pipes that will be deployed over the next few weeks.
How can I help?
There is a mass test scheduled for the Singularity test server on Thursday Feb 4th at 16:00 Eve time. Those of you who wish to help out, please come and provide us with client logs when you encounter the "grid not loading" problem or other issues. We will specifically be testing jumping in several solar systems trying out a few of the software solutions that have been implemented.
This requires running Logserver.exe from the client installation folder before you start the client. Your logs are extremely important, and each bug report you submit with this information will go a long way toward bringing us closer to a resolution or fix for this issue.
There is more information on this fleet test in this thread.
Additionally, if you are planning to attack a system we also highly recommend that you notify us at least 24 hours beforehand through the Fleet Fight Notification Form so that the system can be placed on a reinforced node. This will also give us the chance to mobilize people to be in the solar system with measuring devices firmly attached to our Polaris frigates.
Fleet Fight Lag Survival Guide
I'd like to talk a little bit about the best tactics we've found to use during periods of extreme server-side lag and methods to make the experience a little more bearable. Note that I am talking about server-side lag here, not low FPS or bad network conditions (for example, if you find yourself in a fleet fight on your 9600bps cell phone modem up on a glacier somewhere). The lag that this refers to is when there are hundreds of players fighting in a solar system and everyone is experiencing delays, where guns could take minutes to fire and there's some gnarly rubberbanding. If you've encountered this type of lag you typically know the symptoms. It's under these circumstances that the "grid not loading" issue manifests itself for example.
Do not hammer that button!
One of the most important things to keep in mind when you are in a system with heavy lag is that you should not hammer any buttons and you should try to execute as few commands as possible. If you are waiting for the grid to load the worst thing you can do is click buttons on the interface or chat in local. These actions will have a negative impact on your client's ability to recover (more to the point, on your session on the server). If you have an outstanding call you should wait for it to complete, and it will complete eventually. It might take minutes but your gun will fire and clicking the button again will not help. Do not browse the market or look up contracts while waiting for the system to load. I know it's hard but you have to leave the client alone while it waits for the calls to complete or you will add to the problem and can even cause the session to get stuck.
Use the Monitor
The network tab in the ingame monitor debugging tool (control-shift-alt-m) can help you understand whether your calls are completing in a timely manner. If there is a number in the "Outstanding" row that means that the server is busy executing your command. Wait for it to complete before executing another one!
Calls do time out
What I said about calls completing eventually is true... from a certain point of view. The call will be handled eventually by the server but it's possible that the client has given up by then. Calls typically time out in 8 minutes and this is true of the request for the simulation instance (grid). If you have waited for 8 minutes for the grid to load you will not recover. If you have the monitor open and see the Outstanding number drop to 0 and your grid doesn"t load then you have no choice but to relog, and even that might not fix things if the server is still overloaded.
Note that if you get podded while in this state then the client should recover when your session is moved into your clone station.
We have come a long way in understanding the issues that have plagued large fights after Dominion and have gotten important insight into fleet fight lag as a whole. The problems we are experiencing post-Dominion are being worked on and they will be fixed soon, adding more space carnage to your battles.
I would like to ask that we keep the discussion thread attached to this blog productive and try to keep flaming to a minimum.
EVE Online, CCP Games
A Word from Customer Support
Support's intervention in large fleet engagements in even a remotely fair manner is very problematic due to the complexity of the situation.
Under circumstances where lag is involved or elusive bugs that are difficult to establish as the deciding factor, CCP cannot justify passing judgment on who should have won or who should have lost. While a problem such as the one in focus here is known to affect some players in certain situations, the evidence available in server-side logs, when available at all, can be haphazard and arbitrary. As such, intervention on CCP's part would risk being arbitrary as well.
What this means is that CCP will not be granting reimbursement for fleet fight losses.
Please understand that fleet fight reimbursement has always been very controversial and few issues have been discussed and argued in more detail within CCP. The conclusion has, however, always been to leave fleet battles alone rather than reimbursing them as a whole. We apologize for any inconvenience this may cause and we appreciate your patience as we work diligently towards resolving this issue.