Performance of the Agent API and Buildkite web UI was degraded between 2023-11-21 19:19 and 21:11 UTC, then again between 22:22 and 22:50 UTC.
Elevated error rates and request latency meant that customers were unable to run many of their builds.
When investigating the degraded performance, we identified a number of customer builds entering into runaway loops writing agent metadata. This spike in activity led to high load on our primary pipelines database and negatively impacted the performance of all other Buildkite agent interactions.
After we interrupted the runaway loops, database performance remained poor. We identified that the extreme level of metadata writes left two partitions of our builds table in an unhealthy state, resulting in poor performance of any related queries. We manually vacuumed these partitions, which took database load back to normal levels, and allowed for normal Buildkite operations to resume.
At 22:22 UTC, an hour after first resolving this incident, we noted the same degraded performance pattern emerging, and identified the same runaway metadata write loop as the cause. We interrupted these runaway loops again and worked with the customer to find and fix the cause of this issue within their own builds. Ordinary Buildkite operations fully resumed by 22:50 UTC.
We’ve implemented a rate limit on per-build agent metadata writes, which will ensure pathological builds do not create the widespread impact we saw earlier today. This has been in place as of 2023-11-22 03:00 UTC. Our default limit (1,000 writes/build/minute) should be enough to encompass all ordinary usage of agent metadata, but if you experience any issues, please contact Buildkite support.
We’re also adjusting the data fetching mechanism for the builds listing pages on our web interface. These pages update themselves in real time in response to new builds or updates to your existing builds, and they will now fetch data more selectively, ensuring that they do not exacerbate any issues during periods of degraded performance. This change will be live within the next 24 hours.