This won't be a long postmortem per our usual standards, but since this is the fourth time this has happened in the past two weeks I wanted to let you know what's been going on.
In summary, there is a known issue in our stack when certain user behavior happens. We end up seeing Cassandra performance tank on the nodes that are serving that partition. This leads to our API servers being busy sending very slow requests to those partitions. This ends up causing the API servers to back up with requests which ends up affecting even users who aren't on the slow partitions.
The fix for this generally is to implement what is called a 'circuit breaker': a timeout mechanism that blacklists the offending partition so further requests fail quickly and don't impact other users. We had rolled out some code to do that earlier in the week, but due to a bug with the implementation it didn't trigger during today's outage. We're fixing that now.
Furthermore, we're also updating our procedures to make sure that when we implement these kinds of things going forward we'll have a manual testing step to ensure it actually works as intended. We wrote unit tests and everything looked good, but nobody actually verified that the functionality worked.
We use Discord a lot and we know you do too. We're sorry for the interruptions that this has caused.
Jan 19, 10:37 PST
All graphs look normal. We're continuing to observe and will post more information about what happened when we've finished analyzing the root cause.
Jan 19, 10:02 PST
We've identified an issue with our Cassandra store for messages and are remediating. We expect service to be restored shortly.
Jan 19, 09:53 PST
We're aware of an issue sending messages at the moment. The team is investigating.
Jan 19, 09:48 PST