At 03:27 on November 12th UTC our Redis cache cluster experienced a failover during routine maintenance. This resulted in the writer node becoming unavailable, and the replica node was automatically promoted as the new writer. This caused an error spike, peaking at 70% of HTTP requests returning errors at 03:48 UTC before rapidly falling until recovery at 03:54.
During the error spike requests to our Web interface and APIs experienced errors and some Jobs experienced delays in being assigned to Agents of up to several minutes.
We are investigating what caused our application to not failover to the new writer node as expected. Previously we had upgraded our client library to fix a bug with failovers when using AWS ElastiCache, but this incident indicates there is still work to do to ensure routine maintenance causes minimal impact to our systems. We will also be updating our Redis cluster upgrade process to include a review of relevant Redis client updates.