I joined Bungie in 2013, a fair bit before we launched Destiny. I started on
the networking team and, as with most new engineers at Bungie, I was tasked
with looking at bugs in order to get used to the code base and start learning
about our systems. Some networking bugs are very difficult to figure out
because you’ll get a symptom on a client that actually starts elsewhere in the
Destiny ecosystem. One of my first big Bungie tasks was to try to tackle this
Destiny has an extremely complex networking model. At any given time, your
Destiny client could be connected to many other clients (consider the Tower),
two to four activity hosts, two bubble hosts, and a world server. All these
machines (this can end up being over 20 distinct processes) can generate
actions that affect any of the other machines, so bugs can happen in very
complicated ways! Usually players or testers (or asserts in code) only notice
something going wrong on a single machine, which may only have a tiny piece of
the puzzle. What to do?
Enter the crash propagator: a tool we use to intentionally cause crashes at
Bungie. Sound crazy? Read on!
Let’s start with an example. Imagine an AI combatant that doesn’t seem to be
responding to anything the player does, including shooting it. Now that means
that the client (the machine you’re playing Destiny 2 on) has likely become
desynchronized with the bubble host (the server that keeps track of the active
scripts and state for the part of the world that you’re playing in). It’s
called a “bubble host” because we call each area of a Destiny destination (the
Moon, European Dead Zone, etc.) that you play in without going through a
transition to a new area a “bubble.”
So, our AI seems like it’s not doing what it should, and now it’s not clear if
the problem is on the local client or the bubble host, or even something else!
The problem might be caused by a setting that came down from our activity
host, which is the server that controls the overall state of something that
plays out over multiple bubbles, like a mission or a Strike.
Here is a diagram of what the Destiny ecosystem might look like. For
simplicity’s sake, we’ve only included one world server (which is where your
character information is stored); in the live game you might be connected to
multiple of those.
The concept of crash propagation
The above example is just one way things could go wrong. In that example,
you’d likely end up with a combatants engineer puzzling over a log from a
client and trying to figure out what happened. Eventually they’d probably
have to reproduce the bug themselves to get any good info.
What's the alternative?
Months (even years) of problems like this led us down the path of thinking,
“Well, how could we get logs from every machine on the ecosystem?” We
already have a way to get logs for a machine that crashes, what if we could
get logs from every machine connected to the machine I’m interested in? That
inspired the idea of the crash propagator. A system that sends a message to
every machine it’s connected to that causes that machine (and possibly more)
to crash. This allows our internal crash reporting system to gather data
from all the machines in the same way that we do when the machines crash on
their own. This gives us memory dumps we can examine as well as logs
captured right up to the relevant moment. Engineers can comb through that
data to gather important context from each machine that was involved with
How does it work?
Destiny has a robust and mature set of crash handling code that intercepts
exceptions and failures and records as much data as it can before the game
shuts down. The crash propagator was added on top of this and is enabled in
internal builds of the Destiny client, the bubble host, and the activity
host. In certain cases we define, a crash detected on any one of those can
trigger the crash propagator.
Since the game crashing means that something has gone quite wrong, our
network connections might be in an undefined state. This means that one of
the things the crash propagator needs is a separate, low-level method of
communication with all the machines it is in contact with. This lets us
still have a robust communications channel during a crash, even if our more
complex connections fail. So, whenever Destiny connects to a machine
(whether it’s a Destiny client or an activity host or a bubble host) the
crash propagator stores the IP address of the machine it’s connecting to in
case it needs to send crash communications to that machine later. The crash
propagator also sets up a listener to receive messages from other machines
trying to propagate a crash.
"Crashes" when you want them
Since the kind of information we gather during a crash is quite useful for
debugging, we have a way to trigger it manually even when the game hasn’t
really crashed. “bug_now” is our debug command that causes the game to
trigger a false crash and upload a dump of memory, debug logs, and
information for other tools that engineers can use to track down bugs (we
also have “crash_now”, which intentionally dereferences a null pointer so we
can test the full exception handling pipeline). With the advent of the crash
propagator, we added a “bug_now_networking” flavor to crash the whole
Commonly, during a test pass or playtest, someone might see odd behavior
that they think is related to networking. Someone shouts, “Quick, do a
bug_now with networking!” That triggers everything in the game to come
crashing down, while capturing the information we need to start an
investigation. That means that the crash propagator has two ways to crash
Engineers can specify that, for a given area of code, the crash
propagator should always take over whenever anything in that code
crashes. There is convenient markup that which uses stack semantics
(creating a variable on the stack with a constructor that “marks” the
code and a destructor that “unmarks” the code) so that it’s essentially
a one-liner to opt-into the crash propagation system.
When investigating issues, anyone testing the game can force a
propagated crash and capture with either the “bug_now” command or a
complex combination of buttons on the controller that triggers the same
behavior (due to the way you have to contort your hands to press these
combinations, they’re called "debug claws" internally). It’s worth
noting that, since the crash propagator is a debug-only feature, it’s
stripped out of the code in the retail game so there’s no chance of you
accidentally causing a crash by pushing the wrong buttons no matter how
hard you try to make your own debug claws.
How the sausage is made… err the crashes are propagated
Once a bug command is issued, the sequence of events is:
We check the flags that tell us whether the crash propagator is enabled
for the build.
We retrieve the crash propagator’s stored list of processes we’re
To tell which crashes were triggered together, we generate a unique ID
that can later be used to search for which crashes are related.
We then send a crash packet with that unique ID to every machine on the
list. It also includes a number of “hops” which tells the receiver how
far we want the crash propagation to reach (Two machines from my current
client? Three?). In situations involving multiple fireteams and people
playing different activities in the same area, you might need to
propagate the crash farther than just the machines you were directly
connected to. Otherwise, you might not be getting the full picture.
When the crash packet is received by another machine, it checks the
number of hops. If there is at least one hop left to go, it removes a
hop (for the one we just did), and spams out new crash packets to
everyone the new machine is connected to.
After that, the machine is ready to crash itself using our crash
handling tech. If this machine isn’t the original machine that crashed
or had the “bug_now_networking”, we use a special description so testers
know that this crash propagated crash isn’t a “real” crash that an
engineer needs to investigate – it's just context for the original crash
with the matching unique ID.
After all this happens, our entire ecosystem is crashed. Well, almost. As
much as the crash propagator loves crashing things, we decided that the
aforementioned world server should be excluded from the crash propagation
system since the work it does is sufficiently separated from the client and
activity host that it would rarely produce much useful information (plus it
could really put a damper on studio-wide playtests to start having world
servers go down).
How it turned out
Everything in Destiny is networked and with the advent of the crash
propagator, we’ve saved quite a bit of time tracking down pesky bugs from
all different parts of the game including AI, gameplay, and basic networking
problems. It’s a tool we routinely pull out of our toolbox during playtests.
One evolution of the system was to realize that it wasn’t always necessary
to fully crash everything in the ecosystem. We now utilize separate pathways
for a “lightweight” type of false crash where we only gather and upload logs
and allow the game to keep running afterwards. The generated data is still
tracked as an issue in our tools; it just has less information. This can
still enable some types of investigations but is less disruptive to the
people testing the game at the time.
We continue to improve how we debug and investigate issues across our
ecosystem of servers and clients. Crash propagation was an important advance
for us, and we're excited to share future discoveries and insights (and
crazy bugs) with you here in the Bungie Tech Blog!
- Adam Pino