Issues with SIP inbound calls

Incident Report for LiveKit

Postmortem

Summary

This incident was caused by a bug in our deployment stack. A bug in the deployment automation step unintentionally removed SIP load balancers from the deployment. This impacted inbound SIP calls globally for around 20 minutes. The load balancers were recreated once the team discovered the impact.

Timeline

Timestamps are in UTC

17:57 - a breaking change was merged in our deployment automation software. This change missed a Jsonnet generation step in our deployment automation.
18:01 - a code change was merged, triggering CI to deploy that change to staging. this step utilizes our deployment automation software.
18:09 - due to the missing Jsonnet generation, the deployment config showed that all Jsonnet controlled resources were scheduled to be removed [first-outage start]
18:11 - oncall was alerted to SIP load balancers failing pings
18:12 - incident was created internally to triage the issue
18:19 - we triaged the breakage to be caused by the deployment config change made at 18:09
18:23 - deployment config change has been reverted, inbound calls started recovering at this point [first-outage end]
18:25 - we’ve traced the breakage to the breakage in deployment automation code.
18:30 - another CI deploy merged in another config update with missing Jsonnet resource [second-outage start]
18:32 - upon discovery, we disabled CI internally so no other change would hit the same issue.
18:35 - the second config update was reverted [second-outage end]

Root cause analysis

We use an internal software stack to manage all of our software deployments across 21 clusters globally (including staging and production). The deployment software uses a couple of different templating engines in order to create the specific configuration for each service in every cluster. Jsonnet was introduced to the stack a couple of months ago, but it remains a fairly new component in the stack. The SIP load balancers were one of the services that utilized Jsonnet.

When a code change was made to the deployment stack, it missed the Jsonnet generation step in the change. This caused the generated cluster-specific configuration to miss SIP load balancers entirely.

Typically, any deployment that touches production resources is reviewed by an engineer, where the engineer looks at the diff of the change generated before approving the configuration deployment. In this case, while the change was targeted for staging, it impacted production resources due to the bug in the deployment software itself. The combination of the staging target and the bug caused the change to be pushed without engineering review.

Mitigations

After the incident, we performed an extensive review of our deployment systems and have identified a few shortcomings that we will address in the coming days.

  • improve error handling in the deployment stack, so the deployment is aborted if a missing dependency is found (complete)
  • CI-triggered deployment is only able to affect non-production resources
  • any production-related config change would need explicit engineering review
  • CI-triggered deployment with significant resource deletion will be disallowed by default
Posted Oct 29, 2025 - 16:59 PDT

Resolved

This incident has been resolved. Inbound SIP has been fully recovered as of 11:38am PT and all metrics look normal.

We'll be sharing a detailed post-mortem shortly.
Posted Oct 28, 2025 - 13:15 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Oct 28, 2025 - 11:34 PDT

Identified

We are investigating an issue with inbound SIP calls. Service should start to be restored soon
Posted Oct 28, 2025 - 11:26 PDT
This incident affected: Global SIP.