LiveKit Cloud validates incoming API and connection requests via an internal authentication service, accessed over gRPC. The service has been in place since 2022 and is designed for resilience: it runs as multiple redundant instances, has pod-level health monitoring that automatically reaps unhealthy pods, and is backed by in-memory caches so it can survive transient database failures.
On 2026-05-28, between 13:55 UTC and 15:45 UTC, a percentage of requests and new connections in our US East region failed or timed out. The root cause was a rare failure mode on a single instance of the authentication service. The instance remained reachable and its TCP connections stayed alive, but it began responding to gRPC requests extremely slowly, in a way that did not trip our existing pod-level health checks or cause gRPC clients to fail over.
We sincerely apologize to customers whose traffic was disrupted. We've let you down, and we are taking this very seriously. In addition to the corrective actions outlined below, we are performing a thorough audit of our systems for other failure modes we may not have anticipated.
(all times in UTC on 2026-05-28)
Between 2026-05-28 13:55 UTC and 15:35 UTC, a percentage of API requests and new connection attempts in our US East region failed or timed out. This included new participant connections, SIP connection requests, and agent connections originating in US East.
The blast radius of this incident was substantially reduced by our local in-process auth cache. Each service maintains an in-memory cache of recently-seen authentication details, which allowed the majority of requests to continue flowing whenever they landed on a server that had already cached the relevant credentials. The failures concentrated on requests that landed on servers without those credentials cached, which disproportionately impacted customers with lower overall traffic to US East.
Existing realtime sessions in US East continued to operate, and all traffic in other regions was unaffected. The impact subsided once US East was drained and traffic was rerouted to neighboring regions.
There was also a secondary effect for customers who subscribe to outgoing webhooks. During the incident, webhook deliveries that depended on the affected auth path were queued internally while waiting for a valid signing token. Once the faulty instance was removed and the previously stuck operations unblocked, the queued webhooks were delivered in a burst rather than smoothly over time. Customers whose webhook endpoints have limited throughput may have observed a short-lived spike that exceeded their normal load.
The trigger was a rare hardware failure mode on one instance of our authentication service. Typically, when hardware fails, the underlying machine is shut down either by the hypervisor or by Kubernetes (via failed health checks), and gRPC clients reconnect to a healthy instance. In this case, the faulty machine did not terminate. It remained reachable and continued to respond, but extremely slowly. The TCP connection to that instance stayed up, so gRPC clients that had pinned their requests onto that connection continued to use it, and pod-level health checks did not fire. As a result, a percentage of services in US East were unable to fail over to healthy instances and saw their auth requests time out.
Two additional fallback mechanisms did not engage as fully as we had designed:
When the impact was first observed, our initial signal pointed to database connection timeouts in US East. Because of that, we did not immediately drain the region. Two reasons informed that decision:
After confirming that the database was healthy and that the issue was isolated to US East, we drained the region. Error rates began dropping immediately as traffic moved to neighboring regions.
This incident exposed a machine failure mode we had not yet designed for. The following changes have been implemented and will be rolled out in the next week:
We sincerely apologize for the disruption to customers whose traffic was affected. Thank you for your patience, and we welcome any additional feedback from customers who were impacted.