April 13,2023
Outage Time: 16:00 to 16:20 AEST (20mins)
What happened?
The Centry admin was down and users are having 500 errors. The team noticed everything started to go offline and could not talk to the DB, Redis instances, and even to some of other AWS services endpoint.
Root Cause
The issue was caused by the deployments for workspaces within the btccorp-prod-vpc-shared-nodes VPC in prod. It involved adding DHCP option set to the DNS resolver which caused all DNS resolution to fail
Recovery Actions
The deployments to add new DHCP option sets were reverted and the the services started to come back online.
We need to understand the assignment of dhcp option sets at the account level, this shouldn’t have interfered with the prod vpc but looked like it did, will investigate more tomorrow on this behaviour
Learnings and Remediations