At Discord we take service uptime very seriously, and are constantly working with our hosts, bandwidth providers and vendors to ensure that things don't go down.
In this incident, a router misconfiguration caused Anycast traffic to be pulled to CloudFlare's SOF PoP causing connectivity issues across Europe and North America. Approximately 20% of users that were connected at the time of the incident were unable to connect for up to for seven minutes. Following that period, 5% of the users that were previously disconnected experienced DNS resolution issues for up to 27 minutes after service was restored.
CloudFlare's System Reliability team identified the issue almost immediately, and was able to resolve it quickly. CloudFlare’s network team is taking steps to improve tooling to disable a carrier globally when manual intervention is required. Engineering work continues on methods to automatically route around major disruptive incidents.
Check out CloudFlare's blog on the matter: https://blog.cloudflare.com/a-post-mortem-on-this-mornings-incident/