API Errors / Latency

Incident Report for Discord

Resolved

Service looks stable and everyone is back online and chatting. We are still working on recovering the search APIs.
Posted Dec 07, 2019 - 14:35 PST

Monitoring

Traffic levels have been restored to their usual levels, and most users are back in. Our search API is currently down and our engineers are working on recovering that, but otherwise service looks to be stable.

We are following up with google for a postmortem of the underlying issue, and are monitoring.
Posted Dec 07, 2019 - 14:01 PST

Update

We've started opening the flood-gates and are letting users back in. Recovery is proceeding as planned - and we will continue to ramp up traffic levels and monitor.
Posted Dec 07, 2019 - 13:36 PST

Update

Google has updated their status page:

> Current data indicates that approximately 0.01% of PD-SSD volumes in us-east1-b are affected by this issue, from a peak of approximately 8.45%.
> At this time, the issue is contained, and we can confirm that this is only impacting us-east1-b.

We are currently beginning to work on service recovery on our end. Please standby!
Posted Dec 07, 2019 - 13:06 PST

Update

Google's Engineering team continues to work on mitigating the issue, and have posted the following update:

> Mitigation work is currently underway by our engineering team.
> We do not have an ETA for mitigation at this point.

We have all hands on deck awaiting the resolution of this issue to begin restoring service.
Posted Dec 07, 2019 - 11:58 PST

Update

We continue to observe elevated IO latency. Google has finally gotten to updating their status page. We continue to await resolution.

https://status.cloud.google.com/incident/compute/19012
Posted Dec 07, 2019 - 11:35 PST

Update

Google is currently investigating an issue with SSD Persistent Disk in our region (what our database clusters store their data on). We are awaiting their resolution.

Given the IO starvation, we are expecting continued API latency - as most if not all of our datastores are currently degraded. We will post updates as we get them.
Posted Dec 07, 2019 - 10:54 PST

Identified

Multiple engineers are online, investigating the issue. We are noticing anomalously high iowait across most of our database clusters all other instances leading us to believe that this is a google cloud issue. We are working on a mitigation, however latency is expected to be high as we are currently starved for IO resources.
Posted Dec 07, 2019 - 10:37 PST

Investigating

We're experiencing an elevated level of API errors and are currently looking into the issue.
Posted Dec 07, 2019 - 10:22 PST
This incident affected: API.