On 2023-10-13 between 08:45 and 11:28 UTC (2h43m) some customers were unable to run builds because jobs were not being assigned to available agents. For customers using Clusters, some queues may have kept working while others were impacted.
Buildkite received reports from several customers that jobs were not running despite available agents. Our monitoring did not alert us to any issues, there were no high-volume errors, and our metrics and tracing showed that, for most customers, jobs were being dispatched to agents, although the total quantity was lower than usual. No code/operational changes correlated with the timing of the behavior change. We paged more engineers to investigate why some customers were experiencing this issue. After some time, the problem was traced back to a process responsible for recovering from failures/errors/timeouts in assigning jobs to agents. That code had been recently updated, but did not manifest as a problem until some job assignments failed, and weren’t able to self-heal as they normally would. The change was behind a feature flag which we were able to immediately revert, after which the system immediately recovered for all customers.
We have added automated alerting on statistical anomalies in the volume of jobs being assigned to agents, which we believe would have flagged this issue and reduced the time it took us to identify the issue. Additionally, we have increased the visibility of long-running job assignments.
We have also enhanced the observability of our feature flag system to make it easier for incident responders to quickly see which flags were recently changed, which would also have reduced time to identify the issue in this case.