From 2023-07-12 23:48 to 2023-07-13 01:14 (UTC) customers experienced elevated latency and error rates to the Agent API. This impacted agents trying to accept and update jobs, upload artifacts and job logs; and request OIDC tokens.
At 23:43 some file uploads to a third-party provider start to experience significant latency (up to 15 seconds before timing out). Because these requests were occurring within a database transaction, over the next few minutes the number of open transactions began to climb.
By 23:49 the database connection limits we place, via PgBouncer, were reached and requests to our primary pipelines database from agent API servers become significantly degraded. These limits help to contain the impact to just agent API requests, so other systems (Web, REST and GraphQL) continue to behave as normal.
At 23:59 our engineers received a page and began investigating
At 00:20 we identified the file uploads were much slower than normal
At 00:46 we identified the root cause as a networking issue
At 01:14 we rolled out a change to route traffic for uploads via a different network endpoint, which resulted in an almost immediate recovery.