On 2021-12-10 between 01:03 UTC and 04:30 UTC the Web, API, REST API, Job Queue, SCM Integrations and Notifications (GitHub Commit Status Notifications, Email Notifications) were unavailable or in a degraded state. The Agent API was in a partially degraded state between 01:01 UTC and 03:30 UTC.
This incident began at 00:59 UTC when we deployed a new feature which adds the ability to create jobs using a new matrix syntax. The feature was deployed with a feature flag in place to control access to the feature while it was tested. Multiple engineers reviewed the Pull Request that introduced the feature.
At 01:08 UTC our monitoring systems reported a spike in database connection errors, and we began investigating the cause and customer impact. By 01:21 UTC we had posted a status page saying we’d seen a spike in API errors and it was being monitored, our internal monitoring was showing the errors decreasing. The post to the status page was initially delayed as we were determining the services impacted by the database connection errors. Shortly afterwards we observed another increase in database connection errors and continued investigating.
At 1:23 UTC, to eliminate the newly deployed feature as the cause for the database connection errors we created a revert PR. We attempted to push this to production but were delayed by the fact that we use Buildkite internally to deploy Buildkite and the site was unresponsive. At 1:40 UTC, we rolled back to a previous known good commit. However, we didn’t see a noticeable improvement in API response times or error rates.
We noticed high amounts of database load on our primary database at 1:27 UTC. The Matrix feature had added new columns to our jobs table, we pursued the theory that these new columns were causing changes to query planning on the jobs table. By 2:01 UTC we had dropped the new columns on the jobs table. We didn't observe any noticeable change in jobs table performance.
At 2:01 UTC we noticed that a Postgres ANALYZE had started running against our largest table (jobs) at the beginning of the incident. An ANALYZE can change the performance of the query planner and in the past we’ve manually run an ANALYZE to reset the table statistics and improve performance. We started an ANALYZE on the jobs, builds and artifacts tables, which significantly improved performance for many (but not all) queries.
At 03:08 UTC we deployed an optimisation to a query that was causing a high amount of database load by sequentially scanning the artifacts table. The optimisation forced the query to use an existing index. Neither of these improvements showed a significant effect on API response times or error rates. We continued to investigate artifacts and jobs table query performance.
At 04:17 UTC we discovered a build that contained over 60,000 jobs, when it’s unusual for a build to contain more than 1000 jobs. REST API responses for this build included all jobs and building the response hit request timeout limits. At 04:30 UTC we temporarily revoked the API token for the customer associated with the build. API error rates and latencies were restored to their normal levels at 04:30 UTC. Fetching and rendering these jobs in API responses exhausted our API servers and database capacity.
We discovered a small number of builds with more than 60,000 jobs that were created or running during 00:59 UTC and 01:40 UTC (correlating with the time period the matrix feature was deployed in production). Following this line of investigation, we were able to locate a bug in the new matrix feature code that caused builds using parallelism groups and retry rules, to grow the number of jobs in the builds when a single job was retried. A retry of a single job in a parallelism group of N jobs, would then create N new jobs. This created a feedback loop of endless job creation within a build. Despite having reverted the matrix feature code at 01:40 UTC, the size of these parallelism groups was still growing at 04:53 UTC due to the normal automated retry behaviour being applied to large parallelism groups. At 04:50 UTC we canceled all in progress builds with more than 10,000 jobs.
From 04:30 UTC system operation and performance was restored for all but one customer.
Between 04:50 UTC and 07:18 UTC we resolved job dispatch performance issues for a single customer and confirmed no other customers were impacted. The incident was resolved at 07:18 UTC.
We’re improving our internal monitoring systems to detect outages earlier, so we’re able to respond more quickly to service disruptions. Our incident response processes are being updated to communicate incidents earlier via our status page.
In response to the retry logic failure mode discovered during this incident, we’ve increased related test coverage. To avoid unbounded growth of jobs within a build we are implementing a limit on jobs per build based on real usage patterns, which will prevent unbounded job growth without affecting customers’ builds.
We acknowledge that this incident forms part of a recent pattern of outages, which have stemmed from a range of different technical root causes. This has been a huge year for Buildkite, and we need to ensure that our processes keep up with our rate of growth and hiring. In response, in the new year we'll conduct and publish a Reliability Review into the driving factors of this pattern and how we're addressing them.