Between 2023-06-02 00:53 and 04:12 (UTC) a very limited number of customers experienced DNS resolution failures when browsing to buildkite.com and related sub-domains (api.buildkite.com, agent.buildite.com and graphql.buildkite.com). Most customers, including agents running in AWS and GCP, were not impacted due to inconsistent propagation of DNS changes and default behaviour to not verify DNSSEC signatures.
Over several months we had been working to improve our edge capabilities, including the introduction of a Content Delivery Network (CDN) and Web Application Firewall (WAF) to provide improved speed, reliability and security. As part of this work we decided to consolidate our Domain Name System (DNS) to the same provider as our new CDN, enabling us to make the cutover incrementally and reduce the risk of future migrations.
Our preparation for the DNS cutover involved ensuring the records in both the systems were the same and performing testing that DNS records were resolving correctly from the new system (using manual and automated verification).
At 00:53 we migrated our name servers to a new provider by updating our registrar. All testing by the team indicated no problems.
02:17 A customer reported buildkite.com is failing to load in a browser, but their agents are working correctly.
02:45 We complete rolling back the migration
03:25 We investigate possible failure to disable DNSSEC on our old Zone as the underlying cause
03:50 A second customer reports DNS resolution issues
03:58 We manually flush NS, DS and DNSKEY records on Cloudflare and Google's public resolvers
04:05 We confirm our suspicion that the issue relates to misconfigured DNSSEC
04:12 We manually flushes cache at opendns public resolvers
04:14 We setup DNS monitoring from multiple locations and confirm the issue is resolved
04:28 We confirm that customers running with AWS VPC defaults (including the Elastic Stacks) aren’t impacted by these changes, as they disable DNSSEC signature verification
Future changes of this type will follow a standard process including:
Collating all steps in one document, such as:
Seeking out critical review from subject matter experts in the technologies involved