Login, Database & Chat Problems – The War Continues! | EVE Online

Login, Database & Chat Problems – The War Continues!

2018-08-01 - 작성자 CCP Falcon

Here at CCP, we’re always looking for ways in which we can increase our capacity to allow huge engagements such as the recent fight in UALX-3 to occur more smoothly and with no interruption.

We’ve always chosen the more challenging path when it comes to designing virtual worlds, focusing on single shard, open world sprawling environments where actions have consequences and where a single choice can ripple through the entire game environment, impacting the experiences of hundreds of thousands of players.

New Eden is a digital canvas, painted with fifteen years of incredible stories, including historic battles in 6VDT-H, M-OEE8, B-R5RB, Asakai, Nisuwa and many other systems that have become notorious for huge engagements and massive military action.

Every time an engagement of this magnitude occurs, we gather data that assists us in making changes that help us improve performance. This data also allows us to make informed choices regarding hardware upgrades such as those carried out when we replaced the entire infrastructure of the Tranquility cluster with the upgrade to Tranquility Tech III.

Despite upgrades however, since the start of 2018 we’ve experienced several issues that have affected Tranquility’s performance, not only in large scale engagements but in other areas too.

A timeline of issues

Initial problems began in November and December of 2017 with smaller database issues that initially caused minor extensions to daily downtime. This culminated in series of three substantial cluster crashes in late February and March in the run up to the March release. During this initial set of issues, we looked at various resolutions that ultimately did not resolve the core problem.

When we arrived at the deployment date for the March 2018 release, which included the release of the new chat backend, we once again suffered a series of database crashes on March 20th, directly on deployment day.

This highlighted an entirely separate issue with the chat system as pilots swarmed the server to log back in, overwhelming the new cloud-based chat cluster with connection requests in their enthusiasm to reconnect to New Eden. These events also highlighted several other issues that were rapidly resolved, as well as the scalability problem we encountered.

Stability continued to be an issue over the course of April, compounded by several more database issues that caused mass disconnects and reconnects of pilots, adding further load to the chat cluster.

To further complicate the situation, Tranquility was the target of a series of DDoS attacks between Fanfest 2018 and the release of EVE Online: Into The Abyss that added further stress to the system.

These attacks, alongside configuration issues with our DDoS mitigation service that didn’t take into account the chat cluster now being hosted outside Tranquility’s infrastructure meant that our mitigation caused connectivity issues with the chat service until the configuration could be rectified.

Thankfully, with the upgrade to SQL 2017 that we carried out in May, the database issues we were experiencing were resolved, and on that side of things there’s been much smoother sailing. The chat issues have still however persisted and are currently a primary focus for the team working on the new system.

Immediate performance issues were mostly resolved on the day of the deployment, and subsequent connection issues that were highlighted during DDoS attacks were resolved during that period too. While a significant number of issues have been taken care of, there are still lingering issues with channel inconsistency that are currently being worked on.

An additional level of complexity was added to the mix with the start of separate issues with the login servers. These problems intensified after the release of Into The Abyss and added further stress to the chat system.

Issues with the servers that host our Single Sign On system as well as the service that provides login tokens to the launcher resulted in times when pilots would see Tranquility online but couldn’t connect due to the launcher not receiving the correct information from our login servers.

All in all, a range issues across several different areas of our infrastructure managed by different specialist teams coincided to create the perfect storm, which we’re still currently weathering and resolving at the date of release of this blog.

And then of course, we add our bloodthirsty community into the mix!

So far, this summer has been one of the busiest we’ve seen in the last few years, and the conflict that’s ongoing has created more intense load than we were expecting on Tranquility.

With the addition of new Abyssal Deadspace systems, alongside several large fights and the ever-hungry needs of resource harvesting players who spend hour upon hour churning out the raw materials, components and blueprints to fuel the war machine, we’ve seen a far more active summer than we anticipated.

So what’s the plan?

We want to continue to provide the tools for you to wage conflicts of biblical proportions.

We also want to make sure that we continue to have room to grow, to accommodate larger engagements and more fierce fighting.

Regardless of the level of activity that’s occurred over the course of the summer, we’re aware that the impact of these issues on community sentiment has been severe.

We’re working several angles to make sure that the performance and reliability of Tranquility improves immediately, and quickly returns to the level our pilots expect and deserve.

Database Issues:

The upgrade to SQL 2017 has resolved the issues that we were experiencing with the EVE Online database, however we’re currently monitoring performance and looking at ways to further improve reliability and responsiveness.

DDoS Mitigation:

We’ve worked with our DDoS mitigation provider to ensure that when we next become the target of an attack, our infrastructure handles the traffic scrubbing process correctly and that our services aren’t adversely affected. Several configuration changes have been made in close collaboration with our partner to make sure that these issues do not re-occur.

Chat System Issues:

Work is ongoing to improve connectivity to the chat cluster and root out issues that are causing players to experience loss of connection. We’ve had two teams dedicated to looking into these issues over the course of May and June, and a third is now taking a fresh look into the more lingering connectivity and channel inconsistency issues.

Login Service Issues:

An engineering reliability task force has been formed at CCP to look at the issues with both the login service and the chat cluster. Their focus will be to look at improving reliability and reducing the occurrence of the issues that our pilots have been experiencing.

Adding More Hardware:

While more hardware isn’t always the right answer, adding more SOL nodes (the server blades that host solar systems in EVE) to Tranquility gives us more overhead to spread the load across more hardware.

This gives the cluster more breathing space in general and improves performance. While it doesn’t tackle the issue of load on a single node during a large-scale engagement, it allows us to assign other systems elsewhere and give these fights a little more horsepower and breathing room.

We usually have an additional Flex Chassis on standby, to swap out hardware that’s in service should we need to perform maintenance. Due to the high load on the cluster right now, we’ve placed this chassis and all its nodes into rotation to give more horsepower for the hosting of the Alliance Tournament systems as well as more spread for the load that higher activity than usual is causing.

Buying More Hardware:

In addition to adding existing servers that we’d usually use for maintenance swap outs, we’ve also spoken with our hardware partners, and we’re currently waiting on delivery of the following new hardware:

  • 4x SOL Nodes with 2x Intel Xeon Gold 5122 4C 105W 3.6GHz Processors.
  • 1x SOL Node with Intel Xeon Gold 6134 8C 130W 3.2GHz Processor for comparison.

These will be added to the cluster to test their performance and see how they handle dealing with so many spaceships, from there we'll be better informed on next steps.

Where do we go from here?

After the extended downtime in March due to issues with the deployment of the new chat back end and the subsequent problems, we’ve issued a skillpoint gift to all those pilots who were active during the time of the March release along with the publishing of this blog.

Pilots will find that every character on accounts that were considered active on March 20th, 2018 (the date of the March release and the subsequent extended downtime), have been issued with a gift of 250,000 skillpoints.

(For reference, we define “active” as accounts that have logged in during the 30 days prior to a given date.)

While the war rages in New Eden for our pilots, the war for performance continues for us here at CCP.

We know that it’s a never-ending fight, but regardless, we’re pressed ever forward by the achievements and dedication of capsuleers across the globe.

We want the community to be aware that we fully understand concerns over reliability and performance and are working toward resolving these issues as soon as possible.

Our capsuleers deserve better, and we’re on a mission to make sure that performance and reliability return to the levels expected and continue to improve from there.

Our community is everything to us. We were ill prepared for such an increase in activity, and we let our pilots down.

For that, we sincerely apologise.

If there are any comments or questions regarding this blog, please feel free to jump on over to the comments thread on the official forums.