You always pass failure on the way to success | EVE Online

You always pass failure on the way to success

2006-09-16 - By CCP Oveur

Mickey Rooney said that and that's how EVE feels these days. EVE has been successful and the latest 30K record shows that. This was however overshadowed with the state Tranquility is in. We'd like to shed some light on those and make sure you know we're still here and working on addressing the problems.

Dragon fixed a lot of issues with stability and load balancing but it also brought in some new problems. It's our priority focus at CCP to analyze and fix these problems, but we're still far from done.

The problem here is that even though we have been successful in fixing some of the issues, you will not notice an improvement, because different issues can cause the same symptoms.

The most critical symptom is when it takes the cluster down. The second most critical symptom is when it takes a whole "node" down, nodes can contain everything from a single solar system to a constellation, meaning it can crash the solar system or even many solar systems. In either case, you will be disconnected.

We currently have a memory leak left to address which can take the whole node down and at the same time cause severe performance issues for all solar systems situated on that node. If you do fleet battles for example, the server will be eating up memory. This can eventually lead to the death of that node, taking with it all solar systems within. We now have an extensive memory logger ready for deployment and hope to catch nodes which experience this.

We have a core system failing, which describes itself by a lockdown of all services on that node. This results in the node failing a heartbeat check and it's promptly removed from the cluster, taking all solar systems on that node with it. We are deploying tools to debug this situation even further, because one of the things that happen before it dies is that the logserver stops logging errors, which obviously doesn't help the investigation.

We're also seeing an issue with ships within gangs being able to lock down a single solar system. I experienced this yesterday when I was checking up on a solar system which was encountering it. This is connected to delivering ships between sessions and we're on to something after analyzing yesterday’s case while it happened.

Starbases are also being problematic due to the node deaths. Just today, we had an assault in D7-ZAC where some of them were without shields and force field after the node died. We had GMs in there to monitor the Starbases but the node died shortly again and correctly started up the Starbases. However, since this can also affect Sovereignty, it can have long term effects beyond bugged Starbases. We have extensive tests planned on Singularity with QA and we're always reviewing logs from Tranquility to find what causes this, but so far no luck.

Load balancing can also be improved, we're seeing some strange things happening there. We have our man on the case, a number of fixes already in the pipeline on its way to Tranquility as server hotfixes and should be appearing in the next days. We'll continue to monitor and we have some other improvements and optimizations being programmed to address scalability of Tranquility in the pipelines.

But as stated earlier, Dragon solved a lot of issues. We're seeing better performance on the servers and the client, a boatload of issues were fixed as we can see from our internal error log counters, but there is still some way to go.

This instability has of course led to our support queues exploding, each cluster crash generating about 1500 petitions alone. As we have announced before, we're hiring more GMs to address this and prevent our queues to reaching these dramatic heights again.

The current plan is to continue deploying server hotfixes, and to do more testing on Singularity. We'll also be putting in more optimizations in Kali, but as a release date for Kali goes, we're more focused on Tranquility these days and we'd be happy to see Tranquility in a good stable state and Kali well tested and deployed in October. It should be pointed out at this point that Kali isn't all "just new stuff", there are fixes, optimizations and improvements in there which couldn't make it into Dragon.

Kali release can still shift since public testing of Kali has inevitably been delayed due to Singularity being used for Tranquility troubleshooting, we are sure you understand the reasons behind our decision, we want to improve EVE because that's what she deserves.

Thank you all for your patience and sticking with us through thick and thin, we will make it up to you in the near future.