Elevated Reports of Participant Connection Latency and Errors In US East Region

Incident Report for LiveKit

Postmortem

Summary

LiveKit Cloud validates incoming API and connection requests via an internal authentication service, accessed over gRPC. The service has been in place since 2022 and is designed for resilience: it runs as multiple redundant instances, has pod-level health monitoring that automatically reaps unhealthy pods, and is backed by in-memory caches so it can survive transient database failures.

On 2026-05-28, between 13:55 UTC and 15:45 UTC, a percentage of requests and new connections in our US East region failed or timed out. The root cause was a rare failure mode on a single instance of the authentication service. The instance remained reachable and its TCP connections stayed alive, but it began responding to gRPC requests extremely slowly, in a way that did not trip our existing pod-level health checks or cause gRPC clients to fail over.

We sincerely apologize to customers whose traffic was disrupted. We've let you down, and we are taking this very seriously. In addition to the corrective actions outlined below, we are performing a thorough audit of our systems for other failure modes we may not have anticipated.

Timeline and Impact

(all times in UTC on 2026-05-28)

  • 13:55: Elevated rate of authentication failures observed.
  • 14:15: On-call team notified; investigation begins.
  • 14:46: Error rate continues to climb; incident escalated to Sev 1.
  • 15:35: US East drained; error rate begins to subside.
  • 16:32: Root cause identified; faulty instance shut down.
  • 16:32: Webhooks backlogged during the incident begin to deliver.

Between 2026-05-28 13:55 UTC and 15:35 UTC, a percentage of API requests and new connection attempts in our US East region failed or timed out. This included new participant connections, SIP connection requests, and agent connections originating in US East.

The blast radius of this incident was substantially reduced by our local in-process auth cache. Each service maintains an in-memory cache of recently-seen authentication details, which allowed the majority of requests to continue flowing whenever they landed on a server that had already cached the relevant credentials. The failures concentrated on requests that landed on servers without those credentials cached, which disproportionately impacted customers with lower overall traffic to US East.

Existing realtime sessions in US East continued to operate, and all traffic in other regions was unaffected. The impact subsided once US East was drained and traffic was rerouted to neighboring regions.

There was also a secondary effect for customers who subscribe to outgoing webhooks. During the incident, webhook deliveries that depended on the affected auth path were queued internally while waiting for a valid signing token. Once the faulty instance was removed and the previously stuck operations unblocked, the queued webhooks were delivered in a burst rather than smoothly over time. Customers whose webhook endpoints have limited throughput may have observed a short-lived spike that exceeded their normal load.

Root Cause

The trigger was a rare hardware failure mode on one instance of our authentication service. Typically, when hardware fails, the underlying machine is shut down either by the hypervisor or by Kubernetes (via failed health checks), and gRPC clients reconnect to a healthy instance. In this case, the faulty machine did not terminate. It remained reachable and continued to respond, but extremely slowly. The TCP connection to that instance stayed up, so gRPC clients that had pinned their requests onto that connection continued to use it, and pod-level health checks did not fire. As a result, a percentage of services in US East were unable to fail over to healthy instances and saw their auth requests time out.

Two additional fallback mechanisms did not engage as fully as we had designed:

  1. Client-driven cross-region failover. LiveKit clients can automatically fail over from one region to a neighboring healthy region. However, the failover path itself depended on passing the same authentication gate, so it failed for the same reason.
  2. Local in-process auth cache refresh path. As noted in the Impact section, the in-process cache absorbed a significant share of requests during the incident. However, services always attempt to refresh auth details against the auth service first to ensure they have the most recent data, and fall back to the cache only when that refresh fails or times out. The refresh attempt has a 5-second timeout. Many client SDKs use a default request timeout shorter than 5 seconds and gave up before the cache fallback could engage, even when the relevant credentials were present in the cache.

When the impact was first observed, our initial signal pointed to database connection timeouts in US East. Because of that, we did not immediately drain the region. Two reasons informed that decision:

  1. only a portion of requests were failing (some connections were still succeeding)
  2. if the underlying database had truly been the cause, draining US East would have shifted that load onto neighboring regions and risked spreading the impact rather than containing it.

After confirming that the database was healthy and that the issue was isolated to US East, we drained the region. Error rates began dropping immediately as traffic moved to neighboring regions.

Corrective Actions & Prevention

This incident exposed a machine failure mode we had not yet designed for. The following changes have been implemented and will be rolled out in the next week:

  • Detect and time out slow gRPC connections from internal clients. Internal gRPC clients will detect connections that are alive but unresponsive, and tear them down so they can re-establish against a healthy instance.
  • Improved load balancing between gRPC servers. Retries from internal clients will spread across alternative instances, so a single slow server cannot black-hole a meaningful share of traffic.
  • Reduce inline auth refresh timeout to 1 second. This allows the local in-process auth cache fallback to engage well within the timeout budget that client SDKs typically use.
  • Allow client-driven cross-region failover to function when auth has failed. Region failover will no longer depend on passing the auth gate, so a degraded auth service in one region cannot prevent clients from reaching a healthy region.

We sincerely apologize for the disruption to customers whose traffic was affected. Thank you for your patience, and we welcome any additional feedback from customers who were impacted.

Posted May 30, 2026 - 11:19 PDT

Resolved

Connection errors have remained cleared since the US East drain at 15:44 UTC, and as of 19:30 UTC, US East is back online. We'll share a detailed RCA in the coming days.
Posted May 28, 2026 - 13:13 PDT

Update

Service is operating normally with traffic routing through other regions. We are working on bringing US East back online and will share a detailed RCA once complete.
Posted May 28, 2026 - 11:27 PDT

Update

Connection errors have fully cleared since the US East drain completed at 15:44 UTC. We'll continue monitoring. We will be sharing a detailed RCA.
Posted May 28, 2026 - 09:27 PDT

Monitoring

US East drain is complete and all new traffic is now routing to other regions. Error rate is trending down and we'll continue monitoring.
Posted May 28, 2026 - 08:50 PDT

Identified

We are observing database connection timeouts in the US East region and are currently seeing an impact to all services in the region. We are currently draining the region and routing away all traffic to other regions.
Posted May 28, 2026 - 08:35 PDT

Investigating

We are currently investigating reports of spikes in participant connection latency and errors
Posted May 28, 2026 - 07:53 PDT
This incident affected: Regional Real Time Communication (US East - Real Time Communication).