Crash Propagator

2021六月10 - Destiny Dev Team

I joined Bungie in 2013, a fair bit before we launched Destiny. I started on the networking team and, as with most new engineers at Bungie, I was tasked with looking at bugs in order to get used to the code base and start learning about our systems. Some networking bugs are very difficult to figure out because you’ll get a symptom on a client that actually starts elsewhere in the Destiny ecosystem. One of my first big Bungie tasks was to try to tackle this problem.

The Problem

Destiny has an extremely complex networking model. At any given time, your Destiny client could be connected to many other clients (consider the Tower), two to four activity hosts, two bubble hosts, and a world server. All these machines (this can end up being over 20 distinct processes) can generate actions that affect any of the other machines, so bugs can happen in very complicated ways! Usually players or testers (or asserts in code) only notice something going wrong on a single machine, which may only have a tiny piece of the puzzle. What to do? 

Enter the crash propagator: a tool we use to intentionally cause crashes at Bungie. Sound crazy? Read on!

Let’s start with an example. Imagine an AI combatant that doesn’t seem to be responding to anything the player does, including shooting it. Now that means that the client (the machine you’re playing Destiny 2 on) has likely become desynchronized with the bubble host (the server that keeps track of the active scripts and state for the part of the world that you’re playing in). It’s called a “bubble host” because we call each area of a Destiny destination (the Moon, European Dead Zone, etc.)  that you play in without going through a transition to a new area a “bubble.” 

So, our AI seems like it’s not doing what it should, and now it’s not clear if the problem is on the local client or the bubble host, or even something else! The problem might be caused by a setting that came down from our activity host, which is the server that controls the overall state of something that plays out over multiple bubbles, like a mission or a Strike.

Here is a diagram of what the Destiny ecosystem might look like. For simplicity’s sake, we’ve only included one world server (which is where your character information is stored); in the live game you might be connected to multiple of those.

The concept of crash propagation

The above example is just one way things could go wrong. In that example, you’d likely end up with a combatants engineer puzzling over a log from a client and trying to figure out what happened. Eventually they’d probably have to reproduce the bug themselves to get any good info.

What's the alternative?

Months (even years) of problems like this led us down the path of thinking, “Well, how could we get logs from every machine on the ecosystem?” We already have a way to get logs for a machine that crashes, what if we could get logs from every machine connected to the machine I’m interested in? That inspired the idea of the crash propagator. A system that sends a message to every machine it’s connected to that causes that machine (and possibly more) to crash. This allows our internal crash reporting system to gather data from all the machines in the same way that we do when the machines crash on their own. This gives us memory dumps we can examine as well as logs captured right up to the relevant moment. Engineers can comb through that data to gather important context from each machine that was involved with the crash.

How does it work?

Destiny has a robust and mature set of crash handling code that intercepts exceptions and failures and records as much data as it can before the game shuts down. The crash propagator was added on top of this and is enabled in internal builds of the Destiny client, the bubble host, and the activity host. In certain cases we define, a crash detected on any one of those can trigger the crash propagator.

Since the game crashing means that something has gone quite wrong, our network connections might be in an undefined state. This means that one of the things the crash propagator needs is a separate, low-level method of communication with all the machines it is in contact with. This lets us still have a robust communications channel during a crash, even if our more complex connections fail. So, whenever Destiny connects to a machine (whether it’s a Destiny client or an activity host or a bubble host) the crash propagator stores the IP address of the machine it’s connecting to in case it needs to send crash communications to that machine later. The crash propagator also sets up a listener to receive messages from other machines trying to propagate a crash.

"Crashes" when you want them

Since the kind of information we gather during a crash is quite useful for debugging, we have a way to trigger it manually even when the game hasn’t really crashed. “bug_now” is our debug command that causes the game to trigger a false crash and upload a dump of memory, debug logs, and information for other tools that engineers can use to track down bugs (we also have “crash_now”, which intentionally dereferences a null pointer so we can test the full exception handling pipeline). With the advent of the crash propagator, we added a “bug_now_networking” flavor to crash the whole ecosystem!

Commonly, during a test pass or playtest, someone might see odd behavior that they think is related to networking. Someone shouts, “Quick, do a bug_now with networking!” That triggers everything in the game to come crashing down, while capturing the information we need to start an investigation. That means that the crash propagator has two ways to crash the ecosystem:
  • Engineers can specify that, for a given area of code, the crash propagator should always take over whenever anything in that code crashes. There is convenient markup that which uses stack semantics (creating a variable on the stack with a constructor that “marks” the code and a destructor that “unmarks” the code) so that it’s essentially a one-liner to opt-into the crash propagation system.
  • When investigating issues, anyone testing the game can force a propagated crash and capture with either the “bug_now” command or a complex combination of buttons on the controller that triggers the same behavior (due to the way you have to contort your hands to press these combinations, they’re called "debug claws" internally). It’s worth noting that, since the crash propagator is a debug-only feature, it’s stripped out of the code in the retail game so there’s no chance of you accidentally causing a crash by pushing the wrong buttons no matter how hard you try to make your own debug claws.

How the sausage is made… err the crashes are propagated

Once a bug command is issued, the sequence of events is:
  • We check the flags that tell us whether the crash propagator is enabled for the build.
  • We retrieve the crash propagator’s stored list of processes we’re connected to.
  • To tell which crashes were triggered together, we generate a unique ID that can later be used to search for which crashes are related.
  • We then send a crash packet with that unique ID to every machine on the list. It also includes a number of “hops” which tells the receiver how far we want the crash propagation to reach (Two machines from my current client? Three?). In situations involving multiple fireteams and people playing different activities in the same area, you might need to propagate the crash farther than just the machines you were directly connected to. Otherwise, you might not be getting the full picture.
  • When the crash packet is received by another machine, it checks the number of hops. If there is at least one hop left to go, it removes a hop (for the one we just did), and spams out new crash packets to everyone the new machine is connected to.
  • After that, the machine is ready to crash itself using our crash handling tech. If this machine isn’t the original machine that crashed or had the “bug_now_networking”, we use a special description so testers know that this crash propagated crash isn’t a “real” crash that an engineer needs to investigate – it's just context for the original crash with the matching unique ID.
After all this happens, our entire ecosystem is crashed. Well, almost. As much as the crash propagator loves crashing things, we decided that the aforementioned world server should be excluded from the crash propagation system since the work it does is sufficiently separated from the client and activity host that it would rarely produce much useful information (plus it could really put a damper on studio-wide playtests to start having world servers go down).

How it turned out

Everything in Destiny is networked and with the advent of the crash propagator, we’ve saved quite a bit of time tracking down pesky bugs from all different parts of the game including AI, gameplay, and basic networking problems. It’s a tool we routinely pull out of our toolbox during playtests.

One evolution of the system was to realize that it wasn’t always necessary to fully crash everything in the ecosystem. We now utilize separate pathways for a “lightweight” type of false crash where we only gather and upload logs and allow the game to keep running afterwards. The generated data is still tracked as an issue in our tools; it just has less information. This can still enable some types of investigations but is less disruptive to the people testing the game at the time.

We continue to improve how we debug and investigate issues across our ecosystem of servers and clients. Crash propagation was an important advance for us, and we're excited to share future discoveries and insights (and crazy bugs) with you here in the Bungie Tech Blog!

- Adam Pino