Engineering has root-caused this issue to a Google Cloud component called "Traffic Director" which is responsible for configuring our load balancing layer. In its malfunction, it caused our internal load balancing layer to not have a valid configuration, which caused a loss of availability of the API. Engineering took measures to remediate by moving to alternate proxy configuration that did not use traffic director. It took us a bit to switch to using a different load balancing topology, but we were able to do so to restore service before Traffic Director issues were resolved.
Posted Mar 08, 2022 - 13:01 PST
Monitoring
Typing events have been re-enabled. At this point all functionality has been restored and the service appears to be operating as designed. Oncall Engineering will continue to monitor the service and will work to understand the root cause of this issue with our provider partners.
Posted Mar 08, 2022 - 12:41 PST
Update
Oncall Engineering has brought the media infrastructure back online, media embeds should be functional again. Message acknowledgement has also been re-enabled. We are continuing to work to restore full functionality. Typing events are still disabled at this time and we are investigating reports from bot developers regarding unexpected 403 errors.
Posted Mar 08, 2022 - 12:19 PST
Identified
Remediations are working and traffic is coming back online. While we work to restore full service some functionality will remain intentionally disabled until the service stabilizes, typing events and message acknowledgement. Other functionality remains to be restored, media embeds may not work correctly at this time.
Posted Mar 08, 2022 - 11:49 PST
Update
We are working with our providers to correct the root cause. We believe the cause is upstream of our service and our providers are working on determining and correcting the issue upstream. In the interim we are implementing a set of remediations to work around the issue. As the service comes back up some functionality will be intentionally disabled, namely, typing events and message acknowledge.
Posted Mar 08, 2022 - 11:43 PST
Update
While we continue to investigate the root cause, work has begun on restoring service by working around the issue. Oncall Engineering will begin allowing more traffic through as we restore service.
Posted Mar 08, 2022 - 11:29 PST
Update
Oncall Engineering continues to investigate the root cause of this issue. We have engaged our partners and are preparing contingencies to restore service.
Posted Mar 08, 2022 - 10:59 PST
Update
We are continuing to investigate the issue impacting the API to find root cause.
Posted Mar 08, 2022 - 10:29 PST
Investigating
While monitoring this issue a new issue has occurred causing an major outage of the API. Oncall Engineering is working to correct this situation.
Posted Mar 08, 2022 - 10:12 PST
Update
As part of recovery, the root cause was also detected in our streaming service. A controlled restart was performed of this service which would have caused a temporary disruption of streaming, this should be operating correctly at this time.
Posted Mar 08, 2022 - 10:08 PST
Monitoring
Remediations appear to have restored service to normal operation, Oncall Engineering will monitor for full recovery
Posted Mar 08, 2022 - 09:54 PST
Identified
The root cause has been determined, remediations have been executed to restore service.
Posted Mar 08, 2022 - 09:53 PST
Investigating
We are currently investigating an increase in API Errors and Push Notification Errors.
Posted Mar 08, 2022 - 09:16 PST
This incident affected: API and Push Notifications.