This incident was caused by a bug in our deployment stack. A bug in the deployment automation step unintentionally removed SIP load balancers from the deployment. This impacted inbound SIP calls globally for around 20 minutes. The load balancers were recreated once the team discovered the impact.
Timestamps are in UTC
17:57 - a breaking change was merged in our deployment automation software. This change missed a Jsonnet generation step in our deployment automation.
18:01 - a code change was merged, triggering CI to deploy that change to staging. this step utilizes our deployment automation software.
18:09 - due to the missing Jsonnet generation, the deployment config showed that all Jsonnet controlled resources were scheduled to be removed [first-outage start]
18:11 - oncall was alerted to SIP load balancers failing pings
18:12 - incident was created internally to triage the issue
18:19 - we triaged the breakage to be caused by the deployment config change made at 18:09
18:23 - deployment config change has been reverted, inbound calls started recovering at this point [first-outage end]
18:25 - we’ve traced the breakage to the breakage in deployment automation code.
18:30 - another CI deploy merged in another config update with missing Jsonnet resource [second-outage start]
18:32 - upon discovery, we disabled CI internally so no other change would hit the same issue.
18:35 - the second config update was reverted [second-outage end]
We use an internal software stack to manage all of our software deployments across 21 clusters globally (including staging and production). The deployment software uses a couple of different templating engines in order to create the specific configuration for each service in every cluster. Jsonnet was introduced to the stack a couple of months ago, but it remains a fairly new component in the stack. The SIP load balancers were one of the services that utilized Jsonnet.
When a code change was made to the deployment stack, it missed the Jsonnet generation step in the change. This caused the generated cluster-specific configuration to miss SIP load balancers entirely.
Typically, any deployment that touches production resources is reviewed by an engineer, where the engineer looks at the diff of the change generated before approving the configuration deployment. In this case, while the change was targeted for staging, it impacted production resources due to the bug in the deployment software itself. The combination of the staging target and the bug caused the change to be pushed without engineering review.
After the incident, we performed an extensive review of our deployment systems and have identified a few shortcomings that we will address in the coming days.