We're marking this incident as resolved because the current issue is past and we've performed remediation on the involved cluster.
However, the underlying root cause is still being investigated. We can say that this is the same thing that has happened a couple of times now -- we're seeing runaway memory utilization on a few sessions servers which leads to them running out of memory. This causes a handful to crash, which forces users to reconnect on the remaining servers -- sometimes forcing them into an out of memory case as well. In the worst instances of this behavior, this cascades to the entire cluster.
We currently have a lot of data and a couple hypotheses, but we're still working to fully understand the root cause. Once we do we'll write up a postmortem. Given the number of times this has happened in the past month, we are treating this as an important issue and putting a lot of the team on understanding what has happened and how to properly resolve it for the long-term.
Posted Jan 02, 2019 - 17:26 PST
We are continuing to investigate the root cause and work through remediating the sessions machines with low remaining memory. We will update again when we have more information, but the service is at the moment stable.
Posted Jan 02, 2019 - 14:21 PST
We continue to see issues with sending direct messages and seeing the online status of your friends. We are still working on the issue.
Posted Jan 02, 2019 - 13:29 PST
We've identified an issue affecting our sessions service. We are taking action to remediate the problem.
Posted Jan 02, 2019 - 13:11 PST
We are aware of an incident affecting connecting to Discord and sending messages for some users. The team is online and investigating.