Cloud Agents experiencing deployment failures in US East

Incident Report for LiveKit

Postmortem

Incident

On May 6, around 04:00 UTC we started to receive internal alarms about cluster availability from a single cluster in our us-east hosted agents region. Upon investigation we found that etcd was unresponsive for this cluster. Without etcd and control plane availability, new actions in the cluster were not processed. Anything already running previous to the start of the incident continued to run. However, builds, autoscaling and other similar events sent to this cluster failed.

Around 04:30 UTC we escalated the issue to our cloud provider. Resolution took longer than expected for several reasons, including additional complexities in getting etcd and kube control plane back into a good state. We have multiple clusters in the region but we did not want to strain other clusters with the complete workload from the affected cluster. In order to ensure stability in other clusters we decided to add additional capacity to our fleet. Around 05:00 UTC we began setting up additional clusters and started making plans to migrate new deployments. Around 08:00 UTC a scale up operation began to revive the affected cluster. Around 09:15 UTC control plane and etcd resources began to recover. During the recovery process some existing workload became unstable, but our reconciler resolved the issue soon after. Around 09:40 UTC everything had recovered.

Post-incident

We've found better methods of escalation and recovery to ensure a similar delay doesn't happen in the future for cases like this. We've also been working on tuning our workloads so that a similar incident doesn't reoccur. We'll be adding additional monitoring and alarms for several specific scenarios that we've uncovered in our investigation. In addition, we're continuing our work to stand up additional compute resources so that individual cluster failures won't block things like deployments and scaling actions from completing. We've been working on a few initiatives along these lines already, but will be increasing the priority to ensure these items are completed soon.

Posted May 09, 2026 - 00:37 PDT

Resolved

US-East Cloud Agents is now fully functional.

Posted May 06, 2026 - 02:41 PDT

Monitoring

US-East recovered as of 08:55 UTC and has been serving traffic normally since. New and existing agent deployments are working as expected.

We are moving to monitoring while we confirm sustained stability. A full post-incident review will follow.

Posted May 06, 2026 - 02:17 PDT

Update

We are continuing to work on restoring the Kubernetes API server in US-east. Starting at 08:15 UTC, we are observing impact to existing agent deployments in addition to new deployments. Mitigation is in progress.

Posted May 06, 2026 - 01:47 PDT

Update

etcd service in our US East Kubernetes cluster is currently down, resulting in API server unresponsiveness and failures for new deployments and redeployments. We are actively working on mitigating this with our data center provider.

Posted May 05, 2026 - 22:34 PDT

Update

Our team is actively working to resolve deployment failures in US East. Existing deployed agents should not be affected in any way.

Posted May 05, 2026 - 22:00 PDT

Update

New agent deployments are temporarily unavailable in the US East region, and our team is actively working on this. Dispatches to previously deployed agents are not affected.

Posted May 05, 2026 - 21:22 PDT

Update

We are currently investigating intermittent deployment failures on Cloud Agents in US East.

Posted May 05, 2026 - 21:06 PDT

Investigating

We are currently investigating degraded performance on Agents Hosted on LiveKit Cloud in US East.

Posted May 05, 2026 - 21:01 PDT

This incident affected: Regional Cloud Agents (US East - Cloud Agents).