Messages not sending
Incident Report for Discord
Postmortem

We take all incidents that affect the reliability and quality of Discord’s service with a high degree of seriousness. We’d like to apologize for any issues experienced by our users today. We hope the transparency we provide in our public post-mortems leaves users with confidence that we experience the same frustration and disappointment during outages, and we work hard to improve our service as a whole every day.

All times within this document are PDT

Summary

At 15:28 a node for a service (the “guilds” service) which manages the real-time state and data for Discords millions of “guilds” (user owned servers) experienced a host error, which caused it to immediately reboot. This reboot caused many millions of events to be triggered and sent to another service (the “sessions” service) cluster. Despite backpressure and circuit-breakers within the code for this service, the influx of events triggered what we believe to be a bug within the Erlang VM. This bug ground the lower-level functions of the VM to a halt, causing nodes of the sessions cluster to become completely unresponsive.

Sequence of Events

  • 15:28 - A node for the guilds services encounters a host-error and immediately reboots.
  • 15:32 - Members of our support team notice issues via reports over multiple channels. They immediately escalate the reports to our internal incident chat, alerting online engineers to the problem. Within seconds of this escalation, engineers began investigating the scope and cause of the incident.
  • 15:33 - These engineers noticed the aforementioned offline node which caused the initial incident, and begin investigating the guild service.
  • 15:36 - Our publicly viewable status page is updated to note an incident related to “messages not sending”
  • 15:39 - Engineer's notice that nodes of the sessions cluster are completely unresponsive.
  • 15:46 - After further investigation, engineers decide to restart nodes of the sessions cluster in an attempt to restore service.
  • 15:47 - All nodes of the sessions cluster are rebooted, disconnecting millions of concurrent users.
  • 15:50 - The status page is updated to note that the outage is causing server-loading issues as well.
  • 15:50 - Engineers note that service is beginning to recover as users reconnect.
  • 15:51 - Engineers notice that a separate service (the “presence” service) does not appear to be recovering alongside the other services.
  • 15:54 - Engineers discover that a specific process running on nodes within the presence services died and has failed to recover. They decided to reboot nodes of the presence cluster to force the process to restart.
  • 15:55 - All nodes of the presence cluster are rebooted.
  • 16:00 - Service for the majority of users recovers.
  • 16:06 - The status page is updated to note that service is recovering.

Investigation and Analysis

The initial issue causing this outage can be attributed to the guilds node experiencing a host error causing it to reboot. Despite the efforts we go to when building services to ensure they remain fault tolerant and reliable in the face of individual node failures, this individual node issue caused a cascading failure within another cluster. The eventual decision to fully reboot the sessions cluster and reconnect all users was based on experience received from previous incidents, and allowed service to recover completely.

One of our engineering team’s primary focuses is the constant effort of scaling our internal services to maintain the immense growth we’ve experienced. This incident exposes what we believe to be a bug with the Erlang VM which is only triggered at extremely high event velocity, such as that experienced when a node dies forcing a cluster to rebalance. In this case we believe the best path forward is to refactor code to avoid using low-level Erlang utilities that we cannot control or limit with traditional circuit breakers and backpressure.

Action Items / Response

  • Refactor code which handles node failures to avoid triggering a massive number of events within the cluster.
  • Further investigate the unrelated issues we experienced with the presence service. We believe there is a bug which caused the failure and lack of recovery of the previously mentioned process.
  • Improve alerting around failures like these. The only alerts received by on call engineers where well after the various degradation in services was observed.

In total this outage lasted for around 20 minutes, with around 30 minutes of service degradation in various forms. As usual although we’re happy with the response and ability of our team to quickly diagnose and repair issues, we’re always looking to do better and will be improving our process and infrastructure around the above items.

Posted Jul 09, 2017 - 21:28 PDT

Resolved
The issue has been resolved. Our team has isolated the root cause of the issue and we will be posting a public post-mortem here as we collect all the necessary details. The outage started at 3:26 PM PDT and lasted until 4:01 PM PDT while our engineers triaged and fixed the issue. Our apologies for any inconveniences this issue may have caused.
Posted Jul 09, 2017 - 17:03 PDT
Monitoring
At this time we believe all issues related to message send failures and connection issues are resolved. We're monitoring while we continue to investigate the issue.
Posted Jul 09, 2017 - 16:06 PDT
Update
We're investigating why servers are not loading as well. The hamster wheel appears to be loose...
Posted Jul 09, 2017 - 15:50 PDT
Investigating
We're looking into an issue where messages are not sending.
Posted Jul 09, 2017 - 15:36 PDT