Upgrading Crash Reporting Systems | EVE Online

Upgrading Crash Reporting Systems

Detail-oriented Capsuleers,

New Eden is a complex universe with huge player-driven fights, a vast array of gameplay options, a massive player-controlled market, wormhole civilizations, and much more. Striving to find in-game advantages, players are constantly trying mechanics and interactions which were unplanned and emergent.

The broad range of player computer configurations have a similar complexity. Operating system versions (including patches), individual system configuration changes, driver versions, and the reliability of the underlying hardware all play a critical role in how the EVE Online client behaves.

Complex software can fail in complex ways and client crashes not only take you out of the game unexpectedly, but can disrupt that critical moment you have been waiting for. In some cases, the cause is easy to find and a fix can be deployed quickly. In others, the crashes happen so rarely that it is very difficult to reproduce.

CCP takes client stability extremely seriously. Investments have recently been made to ensure our crash capturing ability is the best it can be, which should result in us finding more crashes and ultimately reducing the number you experience. This along with the bug reports you submit by using F12, will help us to keep the EVE client as stable as possible.


Upgraded reporting tech

When the client crashes, it will attempt to upload a crash report to a 24/7 crash monitoring system. After several years using the well-known and popular Breakpad library from Google, a better solution has been selected called the Crashpad library. Note that no information is sent to Google with either library.

Breakpad sits within the normal EVE client executable. This means that a client crashing can also cause Breakpad to crash, resulting in a lost report. The same situation also happens if the EVE Client runs out of memory; the valuable crash report may be lost.

Crashpad solved this problem by sitting outside the program it is monitoring. This means you will soon see a separate process running alongside the EVE Client appropriately named “eve_crashmon”.

It will look like this in Windows task manager:

This separate application uses no more memory or CPU than the old system it replaces so your experience should remain the same.


How crash reports are used

A crash report will typically fall into one of the five categories below:

  • The crash was caused by the client—confirmed with a reproduction.
  • The crash was likely caused by the client but with no confirmed reproduction.
  • The crash was caused by a device driver.
  • The crash was caused by external software interacting with the client.
  • The crash was likely caused by a hardware problem.

While it may sound counter-intuitive, having a crash confirmed as being caused by the client is the best-case scenario. There is a clearer path to implement a fix, test it, deploy it, and monitor the results.

If the client is likely at fault but the crash cannot be reproduced, your bug reports become increasingly invaluable – more on these shortly.

Driver issues are normally easy to spot as they will be from the same GPU manufacturer and often the same driver version number. Gathering helpful information from bug reports to give to the GPU manufacturer reduces their time to release a fix.

Software that interacts with the EVE client occasionally causes crashes. These are often game overlays, screen recording software, or hardware monitoring software that show statistics like GPU, CPU, FPS, and temperature information. Always make sure you're running the latest version of the software as well as the latest GPU driver.

Finally, hardware issues can be extremely frustrating as they vary from the client running flawlessly for hours to crashing on startup. They often have no clear pattern.


Bug reports

Bug reports are one of the most useful tools for solving issues in EVE Online. There are two ways to file a bug report and one is much better than the other. You can go to support.eveonline.com and press the yellow button (helpful) or you can press F12 in game (incredibly helpful).

If you think you have found an issue, please file a bug report even when you believe that someone else has already done so.

This is particularly important for technical issues where being able to cross-reference different systems gives much better visibility. For example, if a set of bug reports arrive where graphics are displaying incorrectly and all the reports mention the same GPU manufacturer and series of video card, testing then focuses on those cards to reproduce the problem in-house. This naturally leads to a faster fix being tested, deployed, and monitored.

The key to a great bug report is to include as much information as you can. A description of the issue, along with possible reproduction steps enables rapid replication.

When it comes to a client crash, it is extremely valuable to have a bug report filed if the client crashed because of a user action. For example, changing from borderless window mode to full screen causes a repeatable crash. Crash reports are not infallible and may be missing a critical piece of the puzzle that you can provide. Filing the report using the F12 in-game menu will include the important client logs about the crash too, even after a restart.


Test lab

As part of CCP’s recent move into new offices, a new test lab was built to our specific requirements. This included increased available space, a more powerful air-conditioning system, and better storage for lots of testing hardware.

This is where fixes are tested before being deployed – especially when it comes to crash or graphical-related fixes. The lab tests against more than one CPU and GPU family, along with different versions of Windows and Mac operating systems.

The test lab is not just used for client bug fix testing, but also for performance and compatibility testing. Since EVE Online is run on so many different kinds of machines, this test lab checks that new releases work on the most popular configurations and that performance levels fall within expected values.

TestLab1

TestLab2

The test lab will continue to evolve in tandem with upgrades to the underlying code of the EVE Online client, including the move to DirectX 12 in the future.


Keeping your software up to date

Not every crash can be replicated in the test lab. Keeping your OS up-to-date is important not just for security, but also for ensuring that operating system defects get fixed. Both Windows and Mac will automatically update on a regular basis and players are encouraged to run the latest OS patches where possible.

Device drivers are a little more complex, as they sometimes require manually updating to the latest version. Reports frequently arrive from users that are running a graphics card driver that is several years old – and their issue is fixed by simply updating to the current driver.


Hardware troubles

One of the most difficult cases is the suspected hardware fault. Without having access to the physical machine, these are difficult to verify. Given enough time, every component in your computer will fail. When these failures occur, they normally show as a one-off crash report that has been caused by a situation that should be impossible for the client to find itself in.

While troubleshooting PC hardware issues can be complex, five of the more frequent issues with hardware related game crashing are:

  • GPU issues: Sometimes these will be shown as a corrupted image with unexpected colors, or a checker box effect being rendered. At other times, a game may drop back to the desktop with a “Display driver stopped responding and has recovered” message – although this is not always hardware related.
  • RAM issues: When RAM has failed, random crashing will often occur, or you will experience BSODs*. If you have a large amount of memory that has a single failed address, then it may only cause an issue very infrequently.
  • Overclocking: This has become more popular in the last few years due to motherboard manufacturers making it easier. While overclocking can offer rewarding performance increases, it does come with stability risks. If you encounter crashing issues with an overclock, it is recommended to return the system to stock values, and see if the issue persists.
  • PSU issues: Unfortunately, a bad power supply can make almost anything in your system look like it is at fault. It is recommended to use a quality power supply to ensure long term system stability. If a computer often crashes when under high load, then the power unit is something that should be investigated.
  • Overheating issues: These are usually caused by an excessive amount of dust being present or fans not sufficiently cooling due to bearing failure. Depending on what component is overheating, overheating can cause anything from poor performance to complete application crashing.

If you suspect you may have faulty hardware, thankfully there are many free programs to test your system stability. Popular recommendations are memtest86 for RAM and OCCT for general system stability. If you encounter any failures during system testing, the EVE client is likely to also experience problems at some point.


What does the future hold?

Our current crash reports include lots of useful information for debugging which significantly helps us resolve issues. However, they are unable to provide any context to wider issues such as how many players are experiencing this problem. CCP really wants to know “What percentage of players have experienced a new (previously unknown) crash so far today since the patch yesterday, and how does this compare to last week?” While we can find this information out with our current systems, it does involve the manual work of linking these events together.

In the future, we want dashboards showing us complex situations unfolding as they happen. To do this, we would need to monitor the health of all EVE Online clients in real-time. The move to Crashpad and having a crash monitoring process outside of the EVE Client is one of the first steps toward this.

Crashpad + Sentry as the endpoint for our crash reports allows us to monitor, filter, and take action quickly. Then to improve responsiveness, the next step is to integrate some of the recent improvements from their native SDK.

Thank you and remember to press F12 to file a bug report if needed.