We've rolled out fixes that should address the platform instability. We observed a few interesting failure modes here, and have rolled out mitigations for each of them - however the root cause of the initial failures are still under investigation.
Discord runs a cluster of redis nodes which we use for caching of stuff, namely push notification settings and partial user objects. This cache cluster sees a throughput of peak in the order multi-million queries per second to power Discord. During peak hours today and throughout the past few hours, throughput on 4 of our cache nodes dropped, and continued to stay in a degraded state. We ejected some cache nodes from the cluster in an attempt to remediate the issue, however, after ejecting 2 and continuing to observe similar failures on other nodes, we decided not to continue to eject nodes, as a continued reduction in cache-cluster availability would have more adversely affected service availability.
Generally, when a cache node is degraded, or unavailable, we have circuit breakers in place which degrade low-priority tasks while preserving platform availability. However, there was a bug in our circuit breaking code, and 2 bugs in the downstreams of the circuit-breaker that caused a thundering herd on our upstream data-store of these objects (mongodb), causing it to become overloaded and flap in availability. We've since deployed fixes to our API service layer to prevent the thundering herd issue that adversely affected platform stability.
We are still however observing issues with our cache nodes that still causing intermittent blips - however - hopefully those blips should not net in service disruption/degraded performance. In concert, we are investigating the root cause of the redis node throughput issues with the Google Cloud Platform team.
Posted Feb 12, 2019 - 17:11 PST
Service is restored, engineers are working to address the root cause of the failure.
Posted Feb 12, 2019 - 14:50 PST
Root cause has been identified and remediations are being applied
Posted Feb 12, 2019 - 13:42 PST
Engineers are actively responding and restoring service.
Posted Feb 12, 2019 - 13:20 PST
We are currently investigating an increased error rate and latency for API Requests