On Sat, 22 July 2023 10:06:20 UTC, our internal primary key for connected agents exceeded the maximum value for PostgreSQL `integer` type (signed 32-bit integer; 2³¹-1 → 2,147,483,647). This was neither unexpected nor directly problematic; the primary key was stored as `bigint` (signed 64-bit integer) with ample capacity. However an associated foreign key mapping build jobs to agents was stored as the smaller `integer` type. As a result, jobs could not be assigned to agents that connected after that moment.
Within 4 minutes our automated monitoring had alerted us to the situation. Our team began work on migrating that foreign key to bigint, while also investigating faster remediations. Neither the SQL standard(s) nor PostgreSQL support unsigned integer types. However we discovered that PostgreSQL sequences do support negative start & increment values, and that Rails/ActiveRecord is okay with this, so we would be able to use the negative half of the signed 32-bit integer type as a temporary workaround. These identifiers are internal only — Buildkite uses separate UUIDs for public reference — so there was no risk of external systems rejecting the unusual negative IDs.
At 11:40 UTC, after verifying the solution in non-production environments and reconfiguring our table partitioning, we altered this ID sequence to start at -1 and “increment” by -1. Agents that connected after that moment worked correctly. By 12:08 UTC we had disconnected any remaining agents that had connected during the 10:06–11:40 window, so that they would reconnect with new IDs.
Alongside our ongoing database sharding project, we have been using time-ordered UUID primary keys for new tables. However we will retain existing numeric primary keys for some time, and this incident highlighted a blind spot in our monitoring. We have taken several actions to prevent it happening again:
Our team would like to apologize to the customers that were impacted by this incident during the weekend. We are continuing to invest heavily in the scalability, reliability and resilience of our systems.