From 17:33 to 18:20 (UTC) on March 5th, Buildkite Pipelines experienced degraded performance impacting all customers. Between 17:33 and 19:10 some customers experienced severe performance degradation and periods of time when no builds would have progressed.
The primary impact seen by customers was a delay in the time it took for a Job to be started by an agent, as agents experienced latency and elevated error rates when communicating with the Buildkite API. The below graph shows the average latency experienced by customers between when a build is created and the first job in that build started.
Additionally the Buildkite website experienced increased latency and error rates during this time.
We run several Aurora RDS Clusters for our Pipelines databases, each with a single reader and writer instance. At 17:33 (UTC), a hardware failure of the writer instance on one of these clusters resulted in an automatic failover to the reader. This meant that for 9 minutes all database queries were directed to a single database instance, which became overloaded causing queries to time out. This had a knock-on effect of overloading our Agent API which was starved of capacity by the number of requests waiting for a response from the database.
Even when the database instance that failed recovered at 17:42, the number of concurrent queries to the new writer instance was too high for it to self recover. Our engineers had been paged automatically at 17:38 due to the high number of errors and at 18:15 we began shedding load to the impacted database instance to reduce concurrency, which restored service to most customers.
Our team gradually re-enabled service for customers on the affected database, and by 19:06 job started latency had recovered for the remaining customers. We were still experiencing low level error rates at this point, due to two bugs in the Ruby on Rails framework. After a manual restart of the services the error rate recovered and service was fully restored at 19:19.
Hardware failures are a normal part of running a platform such as Buildkite and this incident has given us insights into how we can better design for this type of failure. This was the first time we’d seen a hardware failure of this kind during peak load and we didn’t anticipate that such failure wouldn’t self recover once the database cluster was back to a healthy state.
We have made improvements to the resilience of our platform by improving isolation between database shards since the January Severity 1 incident, but we have more work to do, and this incident reiterates the importance of that work. In particular, we brought forward a project to improve load isolation within our Agent API after the earlier incident. Once complete, that isolation will substantially mitigate cross-shard impact in the case of similar incidents.
The lessons from the earlier incident were invaluable during this incident as we already had the processes in place to shed load, enabling us to restore service more quickly. We have made the following changes to avoid a recurrence of this issue:
Additionally we are investigating ways to reduce the impact of high concurrency on our database, which causes excess time spent by the LWLock:lockmanager. This was also a contributing factor to the aforementioned January incident. When the time spent obtaining “non-fast path locks” reaches a critical point the database gets into a state which can only be recovered by shedding load. By reducing the number of partitions queries have to scan to find data we can reduce the amount of locks that need to be obtained, preventing the database from reaching this critical point.
One of the Ruby on Rails bugs we encountered today has been reported upstream, the second is a bug we have seen before but hasn’t as yet been reported. These cause the Rails database connection pool to sometimes get into an inconsistent state when a database stops responding even for a brief period of time. We will work with the Rails maintainers to get these resolved.
Introducing horizontal sharding of our databases has substantially improved the scalability and reliability of our system, but, as with any change, has brought with it new challenges. More databases means hardware failures are going to be more common, and we need to handle those failures gracefully. On this occasion, we were not able to do so, and we acknowledge the impact this had on your use of Buildkite.