Elevated error rates

Incident Report for Buildkite

Postmortem

Service Impact

At 03:27 on November 12th UTC our Redis cache cluster experienced a failover during routine maintenance. This resulted in the writer node becoming unavailable, and the replica node was automatically promoted as the new writer. This caused an error spike, peaking at 70% of HTTP requests returning errors at 03:48 UTC before rapidly falling until recovery at 03:54.

During the error spike requests to our Web interface and APIs experienced errors and some Jobs experienced delays in being assigned to Agents of up to several minutes.

Incident Summary

03:26 UTC we applied routine maintenance to our Redis cluster. This would normally result in little or no downtime however, for reasons as yet unknown, our application did not handle the event gracefully.
03:31 We declared an incident due to a small, but definite increase in errors communicating to Redis.
03:48 The error rate began to rapidly spike.
03:49 We canceled the queued maintenance on other Redis clusters.
03:54 The error rate rapidly returned to a baseline and we started seeing recovery.
03:55 As an additional precaution, we restarted our application to ensure all connections were updated to the new writer node.
04:17 The incident was marked as resolved.

Changes we're making

We are investigating what caused our application to not failover to the new writer node as expected. Previously we had upgraded our client library to fix a bug with failovers when using AWS ElastiCache, but this incident indicates there is still work to do to ensure routine maintenance causes minimal impact to our systems. We will also be updating our Redis cluster upgrade process to include a review of relevant Redis client updates.

Posted Nov 13, 2024 - 00:33 UTC

Resolved

The issue is now fixed. This incident has been resolved.

Posted Nov 12, 2024 - 04:21 UTC

Monitoring

The fix has been deployed. We are now monitoring the issue.

Posted Nov 12, 2024 - 04:16 UTC

Investigating

We are investigating elevated error rates across our services.

Posted Nov 12, 2024 - 04:00 UTC

This incident affected: Web.