Over the course of two months, Buildkite experienced three interrelated database issues, resulting in periods of degraded performance and some downtime. We deeply regret the impact this had on our customers, and would like to share what happened and what we’ve done about it.
The three database issues were:
On 19 March 2020, we discovered that our AWS RDS PostgreSQL database storage would not scale beyond 16 TiB. This contradicted documentation and announcements stating a 64 TiB limit. Our database monitoring and forecasting showed approximately eight weeks runway before we exceeded 16 TiB.
We contacted AWS support, and within a few days they explained the issue and proposed a solution. Our database had been provisioned in 2015, at which time RDS was using a 32-bit filesystem which was limited to 16 TiB. Amongst several theoretical solutions, the one viable solution involved us creating a read replica of the database which the RDS team would modify to a 64-bit filesystem, ready to be verified and promoted to primary.
On 26 March 2020 we handed the read replica to the RDS team; within a few days it was ready to scale storage up to a maximum of 64 TiB. We performed integrity testing based on PostgreSQL’s data checksums while preparing for a weekend scheduled maintenance window to promote the database to primary.
On 18 April 2020 we promoted the new database to primary, buffering inbound webhooks to AWS S3 during maintenance so that builds triggered during that time would run afterwards. We published this as https://www.buildkitestatus.com/incidents/vklxjkn52fpf.
The replica promotion went to plan, and the storage limitation was resolved.
On 21 April 2020, our monitoring system paged us due to high database CPU utilization and degraded query/API performance. We posted an incident at https://www.buildkitestatus.com/incidents/0yywdz6b6kqs
Our first theory was a thundering herd caused by a coinciding GitHub incident (https://www.githubstatus.com/incidents/516zy0q1ns06). They stopped sending notifications, so our customer builds weren’t triggered, causing our customer buildkite-agent fleets to scale in, and then GitHub processed their notification backlog causing elevated agent & build activity.
Our second theory was some kind of hardware issue, causing our usual workload to manifest as much higher CPU utilisation. This was our first weekly peak load on the new database instance, and the previous database instance had handled the same load without issue. We had experienced an instance-specific performance issue a few months prior, where a planned database failover resulted in degraded performance which was ultimately traced back to a bug in the RDS/EBS storage system.
Neither of these theories ended up explaining the issue, but they did cloud our subsequent investigation.
On 22 April 2020 (the following day) we were still seeing degraded performance, and had not found a root cause. To reduce customer impact we doubled our database instance class; from db.r5.12xlarge (48 vCPU, 384 GiB RAM) to db.r5.24xlarge (96 vCPU, 768 GiB RAM). This maintenance was done outside business hours for most customers, and was published as https://www.buildkitestatus.com/incidents/0yywdz6b6kqs
The extra 48 CPU cores solved customer-impacting performance issues, but CPU utilization was still higher than expected. Over the following week we identified some expensive API usage patterns from particular customers, and worked with them to satisfy this use-cases in more efficient ways. Based on metrics and past performance, we believed we could revert back to the original database instance class.
On 29 April 2020 we reverted the database to db.r5.12xlarge. This was again done outside business hours for most customers, with webhooks buffered to S3 for subsequent processing, and was tracked in https://www.buildkitestatus.com/incidents/f6492w22g0rf.
The following peak traffic period resulted in degraded API and job dispatch performance, which we published in https://www.buildkitestatus.com/incidents/2b64j3hkqv45. This confirmed that the underlying problem had not been resolved, and that it wasn’t specific to one database instance.
On 30 April 2020 we upgraded the database instance type once again. We discovered a db.r5.16xlarge instance class (64 vCPU, 512 GiB RAM) had been recently introduced, but required a PostgreSQL minor version upgrade before we could use it. That two-step upgrade meant a few extra minutes of downtime, once again done outside most business hours and with webhooks buffered for subsequent processing, which we published in https://www.buildkitestatus.com/incidents/gxncbp2rmxqq.
Performance on the db.r5.16xlarge started looking more acceptable, but not ideal. Over the following week we shipped various mitigations to the application and schema to reduce database load and improve API performance. We also ran various experiments using clones of the production database restored from backup. However, CPU utilization remained higher than expected, and we still had some periods of degraded performance, leading us to publish the problem as an ongoing intermittent incident at https://www.buildkitestatus.com/incidents/h0vnx4gp7djx.
We continued searching for a root cause.
Meanwhile, we had needed to interrupt several VACUUM operations to do database maintenance. The main purpose of vacuuming is to remove dead tuples from index and table data. However VACUUM serves another more critical purpose; facilitating transaction ID re-use. PostgreSQL’s MVCC model breaks down if more than 2^32 / 2 (~2 billion) transaction IDs are used between vacuums. The resulting transaction ID wraparound is a catastrophic event resulting in forced shutdown and prolonged downtime for a large database.
We closely monitor our unvacuumed transaction ID usage, and given our usual VACUUM duration we were alert but not alarmed.
On 6 May 2020 we became actively concerned about the possibility of a PostgreSQL transaction ID wraparound.
Each previously interrupted VACUUM had failed to release some amount of dead tuples, making each subsequent VACUUM slower than the last. Due to an unresolved issue, PostgreSQL has a hard limit of ~180 million dead tuples scanned before triggering an additional index vacuum phase. By this stage our indexes had become extremely bloated, causing the index vacuum phase to take two days. With ~500 million dead tuples, we estimated a total duration of 10 days. That put it a few days prior to our estimated transaction wraparound, but much too close for comfort.
We immediately brought the whole Engineering team to work on three fronts:
We remained confident that we wouldn’t reach this point, but wanted options and a solid plan. We designed a much softer landing than sudden database shutdown, and agreed on a threshold of remaining transaction IDs before we would execute this last resort plan.
Thankfully we didn’t need to explore too far down any of these paths.
We searched for transaction ID usage hot-spots which could be changed to consume fewer IDs.
These activities reduced our transaction ID usage and also reduced database CPU utilization.
The ideal outcome would be some way to confidently finish vacuuming much sooner. We ran several experiments on production database clones. Completion time and customer impact was evaluated for various combinations of vacuum parameters, database instance classes, and PostgreSQL versions.
We also discovered a new VACUUM parameter introduced in PostgreSQL 12 called INDEX_CLEANUP; setting it to false skips most of the slow vacuuming behaviour and focuses on transaction ID management. We were running PostgreSQL 11, but had plans to upgrade to 12 in the coming weeks. We could bring those plans forwards if needed.
The results of the experiments showed that upgrading to PostgreSQL 12 and using the new INDEX_CLEANUP parameter was the only option we were confident would finish before transaction ID wraparound. This option required 15-30 minutes downtime, which we decided was regrettable but necessary.
On 7 May 2020 we scheduled, planned and executed the upgrade from PostgreSQL 11 to 12 that we had been planning to do in the coming weeks. This was done outside business hours for most of our customers, with inbound webhooks buffered to S3 for subsequent processing. It was published as part of https://www.buildkitestatus.com/incidents/h0vnx4gp7djx
The upgrade itself went to plan, matching our expectations based on the trial upgrades we ran on production clones. We expected cold-start degraded performance for up to five minutes after starting the upgrade instance, but it extended beyond that. After 15 minutes, we stopped background workers and put the site back into maintenance mode to alleviate extremely high database load. We drilled into metrics and explored hypotheses such as PostgreSQL 12 changes to CTE query planning. It took us longer than it should have to realise that we had forgotten to run ANALYZE after the major version upgrade, so the query planner was not able to effectively use indexes.
After analyzing all tables, performance on the PostgreSQL 12 instance was excellent. The new emergency vacuum mode ran in just 12 hours, rather than the 10 day forecast of our earlier autovacuum. We were then able to use PostgreSQL 12’s new REINDEX CONCURRENTLY feature to quickly reduce index bloat. After that a proper vacuum of our largest table took a very reasonable 21 hours.
Ever since the initial alert of degraded performance, we had been constantly testing theories and searching for a root cause. We understood the unexpected storage limitation, and we understood the risk of transaction ID wraparound, but we still did not understand the query performance degradation.
We had previously seen correlation with an upstream outage and a new database instance, but these never led to causation.
Eventually we found a third correlation.
Our logs showed that an ANALYZE of our largest table finished just before the CPU utilization increased. That also correlated with a huge increase in pages per second read for that table, although from RAM, not disk, so it didn’t show up in IO stats. This new finding prompted us to take yet another look at some key query plans.
We finally discovered that ever since that fateful ANALYZE, one of our primary-key-scoped queries was no longer using the primary key index, and instead scanning a far less efficient index. The ANALYZE operations that should have happened in the intervening weeks had been prevented by long-running vacuums holding table locks. The ANALYZE we belatedly ran after the PostgreSQL 12 upgrade fully resolved the root cause of the performance issue.
We deeply regret the impact this chain of events had intermittently throughout April and May. It has taught us many things about PostgreSQL and our internal processes. We have made a number of changes to ensure it won’t happen again, and we’re commencing a project to make further improvements to our database resilience.
We have a number of notable changes in place already:
Here’s what we will keep working on:
Thank you to Buildkite’s amazing customer community for being patient, understanding, and supportive while we got to the bottom of these issues. We now fully understand the unexpected storage limitation, query performance degradation, and risk of transaction ID wraparound that we experienced.