Major outages of services

Incident Report for Buildkite

Postmortem

Last week on the 4th of March, Buildkite had a major outage of all services (Web, API and Agent API), with downtime that lasted just under 3 hours. During this time no builds could run and all incoming web hooks were lost.

This sort of downtime is unacceptable, and we apologise for the inconvenience this has caused. We know ourselves how frustrating it can be when services we use everyday go down, especially when companies go quiet afterwards. So here we'll describe what happened, and what steps we're taking to ensure it doesn't happen again.

Mar 04, 2015 - 07:22 AEDT

We receive a PagerDuty alert from Pingdom alerting us that one of our application servers had run out of memory. Generally when this happens, our process monitors restart the offending Unicorns, but in this case, they didn't seem to be working. The entire team was notified, and everyone jumped onto the company Slack to figure out what the problem was.

At this point we started digging into what was causing the boxes to run out of memory so fast. We figured it must be a spike in traffic, because New Relic was showing a massive increase of inbound traffic.

Mar 04, 2015 - 07:48 AEDT

We started the investigation by looking through our Papertrail logs, but it seemed like what ever was causing our problems, was chewing through our Papertrail quota, which made them useless.

After about 20 minutes of digging manually through logs on the servers and exceptions from Bugsnag, we identified the problem as being an agent sending us a large build log. After a little bit more digging, we were able to identify which agent and which build.

The agent has dumb approach to streaming logs to Buildkite. It's actually very dumb. We send the entire log to Buildkite every second. For small logs (the average log size is around 50-100kb) this approach is fine...

However... in this case, a customer's build log grew to a few gigabytes. So we were getting hit by gigabytes of traffic every second. We disabled the offending agent from our end but it was still sending us the build log. At this point we identified that our NGINX config was allowing POST bodies up to 4gig. We reduced it to a more reasonable number.

Mar 04, 2015 - 08:39 AEDT

Once we had shut down the agent, and alerted the customer that we had done this, we started getting alerts from our database that writes were failing. Just like Murphy's law, our database had run out of disk space. Currently we store build logs in our database (another dumb thing), and this really large build log pushed our database space over the edge.

We didn't have CloudWatch rules setup to alert us when our RDS instance was close to running out of space, so this hit us out of nowhere.

We had no choice at this point but to bring everything down, and scale up our database sever to a higher capacity. We turned off our web processes, which caused the maintenance page to be served.

Unfortunately, ELB saw the maintenance page as an error, and took the application servers out of rotation. Which meant that when you tried to access buildkite.com, you just got a blank page, instead of a maintenance page, bringing you here.

Mar 04, 2015 - 10:17 AEDT

The database migration finishes, we check to make sure everything is operational, and start brining services back online. By this point, Buildkite had been down for just under 3 hours. We started communicating with the customer that had the large build log, and started working with them to fix their build problems.

What we're doing about it

There are many things that we've learnt from this downtime. We've labeled each item with it's current status.

[Done] We enabled burst-log billing in Papertrail, for unusual spikes in log
[Done] Update our NGINX config to reject any POST over a certain size
[Done] Added new CloudWatch rules to alert us when our database cluster is running low on disk space
[Started] Improve the information we output in our production logs to be more streamlined so we can debug problems faster.
[Started] Refactor how logs are streamed to Buildkite. This will be available in our upcoming 1.0 release of the agent.
[Todo] Review how long we retain build logs, and consider purging logs older than 90 days. Also, consider an alternative storage location.
[Todo] Allow a maintenance page to be brought up during downtimes.
[Todo] Receive webhooks in a manner that would allow them to still be captured during downtime and replayed afterwards.

We're working hard to ensure downtime like this doesn't happen again. If you'd like to talk more about the downtime, don't hesitate to contact me directly by emailing keith@buildkite.com.

Again, Buildkite and the team personally apologise for the downtime, and the delay in getting this portmortem written.

Thank you for your understanding.

Posted Mar 11, 2015 - 13:30 UTC

Resolved

Our database is functional again and all of our services are running again. We've started working on a postmortem and will post one as soon as it's ready.

Posted Mar 03, 2015 - 23:28 UTC

Monitoring

Our systems are coming back online and we're monitoring the situation to make sure everyone's builds are running smoothly again.

Posted Mar 03, 2015 - 23:17 UTC

Identified

We're currently experiencing issues with our database and are working hard to resolve the issues.

Posted Mar 03, 2015 - 21:39 UTC

Monitoring

We've identified the issue as being a giant spike of incoming traffic. We know the source and we're working to mitigate.

Posted Mar 03, 2015 - 20:48 UTC

Identified

We've identified that our services are currently unavailable. We're looking into this issue now.

Posted Mar 03, 2015 - 20:22 UTC