Earlier today (Tuesday, February 11th), after the launch of Hotfix 22.214.171.124, we were made aware of the re-emergence of an issue which caused a small percentage of players to lose currency and materials. This comes after the first incidence of this issue, which caused all players to lose currency and materials on January 28th with the launch of Hotfix 2.7.1, and resulted in player account rollbacks. With today’s incident, we have taken similar steps and rolled back accounts to the state they were in as of 8:30 AM PST (before the launch of 126.96.36.199).
Since both of these incidents are identical in cause and the effect on our players, and because both incidents happened within a close window of time, we wanted to give you a picture of what went wrong, how we fixed it, and how we’re planning on making sure this doesn’t happen again in the future. First, let’s look at what caused this problem in the first place: a game bug involving inventory management and a series of server configurations that re-introduced the bug after it was fixed.
In Destiny 2, quests are treated similar to other inventory items, such as currency and materials. All items have a timestamp, based on when they were first added to a player’s inventory. This timestamp is used to sort quests in the order in which they were acquired. The game cleans up a player’s inventory upon each login, to make sure it is consistent with any changes to content, such as the maximum number of items of a particular type the player can carry.
Several months ago, players reported that quest log sorting wasn’t working properly, and we wanted to fix that. The team investigated and found that the clean-up process was resetting the timestamp on a subset of quests, which was breaking chronological sorting. We decided to fix this by disabling the timestamp-resetting behavior for quests. That fix was conceptually reasonable but, through subtle side effects, it ended up disabling too much of the clean-up process. The net result was that the game calculated the wrong cap quantity for stacked items (such as currencies and materials), which caused items above the cap to be lost. We knew this code was critical and, per our typical process, we had two domain experts provide code reviews for the change – but sadly, we didn’t spot the bug.
A few days later, our internal test teams caught this issue. However, we incorrectly concluded that it was caused by a tooling failure with debug workflows we use for testing, and not an actual bug within the game. Having dodged all our diligence, the issue went live in 2.7.1. Once the bug was identified in the live game, the next step was to figure out how to fix it, which leads us to the next discussion: game servers and their configurations.
Before every major release (for example, Shadowkeep), we do comprehensive stress testing to try and model user behavior and its impact on our service architecture. Because there’s no substitute for millions of real player behaviors, we supplement this testing by closely monitoring service metrics after launch.
Back in October, in order to handle increased CPU and player load for Shadowkeep’s launch, we spun up additional servers (in this case, called WorldServers); more servers, in fact, than we have ever used before for this task. Running with this many servers has had some small side effects that we were tracking but were generally invisible to players. For example, one issue was that a small percentage (less than 1 percent) of these servers would crash on start-up due to the volume of servers overwhelming one of the backing databases. Our workaround for this was to simply manually restart the crashed servers each time we detected this issue, and this appeared to address the problem without any discernable side effects for players.
Fast-forward to two weeks ago. The 2.7.1 update had the aforementioned bug that caused character data corruption and resulted in our first ever rollback of character data
. To fix that issue quickly, we applied a patch to the servers instead of trying to get a full build of the game code deployed. This involved making a change to a server setting to override the game code used to process character data and then restarting the WorldServers to pick up that change.
Fast forward again to today, February 11th, when we rolled out the 188.8.131.52 update coinciding with the launch of Crimson Days. After launch, some of the WorldServers once again crashed on startup because of a high volume of servers starting simultaneously. Once again we manually restarted those servers and thought everything was fine. We were wrong.
Unbeknownst to us, this crash resulted in those WorldServers not applying the previous character data corruption fix. This meant that a small percentage of WorldServers were running the old code and the bug that was corrupting character data. We have verification systems that detect these sorts of version misconfigurations, but the WorldServer crashes and subsequent manual restarts caused the servers to also skip the verification process. Prior to this morning, we had believed skipping these overrides and verifications to be impossible.
As a part of our standard practices of verifying a new build, we also have our test teams log in with a number of test accounts in order to verify the player experience. Because we have hundreds of servers in our retail environment, every manual test we performed was (un)lucky enough to hit the “good” servers, and all of them missed the small percentage of servers that were in a bad state. So we gave the all clear.
Today, as the game came back on after the 184.108.40.206 deployment, we started seeing player reports of lost currency. The team began investigating immediately and took the game down at 10:30 AM PST. In that time, hundreds of thousands of unique players had logged into the game or accessed their characters through a third party service. Our investigation uncovered what we thought was an impossible situation: a small number of our WorldServers had loaded without the correct configuration which fixed the corruption issues from 2.7.1. Unfortunately, anyone whose characters had been accessed using one of these out-of-date servers encountered the character-corrupting problem.
Once we determined this was the same issue that occurred on January 28th and we understood how it happened, the team decided that our best path forward – rather than trying to identify each affected account and risk missing something in the process – was to restore all character data from the backup that took place just before the 220.127.116.11 patch rolled out.
The team has identified a number of additional safeguards that should prevent this particular issue from happening in the future.