Root Cause
During a failover event on the SIP load balancer in US East, UDP packets were incorrectly treated as part of an existing "connection" by the underlying VNIC stack's connection tracking (conntrack) mechanism. As a result, these packets continued to be forwarded to the previous (now non-existent) node, causing SIP INVITEs over UDP to be dropped for the duration of the incident.
Technical Details
The SIP load balancers use Virtual Network Interface Cards (VNICs) to enable high availability. VNICs allow the public IP to remain unchanged while traffic is redirected to different physical instances (e.g., during pod failures, software updates, or maintenance). This supports seamless failover to standby instances and instance cycling.
On March 2, 2026, an instance cycling process was initiated to apply security patches. Standard procedure involves adding additional IPs to DNS before rotating instances to maintain capacity. However, an operator error led to direct patching without first scaling up capacity. This operation should have been safe, as the VNIC is expected to redirect traffic to the new instance.
- For TCP traffic, failover worked as intended.
- For UDP, the VNIC's conntrack behavior differed: UDP flows are tracked using a four-tuple (source IP:port - destination IP:port). New incoming SIP INVITEs from the same client IP/port combination were treated as belonging to the same pre-failover "connection" causing them to be forwarded to the old node—even after it was terminated.
- This issue specifically affected traffic from providers like Twilio, whose clients reused the same source IP/port for sequential INVITEs, resulting in dropped inbound calls.
Timeline
- 2026-03-02 10:30:00 UTC: Operator initiated instance cycling.
- 2026-03-02 10:55:00 UTC: Monitoring systems detected the issue and paged the on-call engineer.
- 2026-03-02 11:27:00 UTC: Traffic was redirected away from the affected US East region, mitigating the impact.
Monitoring & Detection
LiveKit employs two layers of end-to-end monitoring for the SIP infrastructure:
- Simulated SIP pings: These send periodic OPTIONS packets to verify reachability of the SIP load balancers.
They failed to detect the issue because each ping used a different source IP/port, avoiding the problematic conntrack entries.
- End to end SIP calls: These use external providers (e.g., Twilio) to place real calls, verifying successful establishment and bidirectional audio flow. They run at a lower frequency.
They successfully detected the drops (as Twilio traffic matched the faulty conntrack behavior), but detection was delayed due to the lower check interval.
Mitigations
To prevent recurrence and improve resilience, we will:
- Increase redundancy at the SIP load balancer layer to tolerate up to 2 out of 3 nodes failing without service impact.
- Enforce stricter, standardized operating procedures for instance cycling (e.g., mandatory capacity addition via DNS updates before rotation; additional peer review or automation safeguards).
- Increase the frequency of end-to-end SIP call monitoring to enable faster detection of UDP-specific issues.
This incident highlights subtle differences in how stateful conntrack handles UDP vs TCP during VNIC failover for protocols like SIP. These changes will help ensure more robust handling of similar maintenance operations in the future