API connectivity outage

Incident Report for Banxa Status Updates

Postmortem

April 13,2023
Outage Time: 16:00 to 16:20 AEST (20mins)

What happened?
The Centry admin was down and users are having 500 errors. The team noticed everything started to go offline and could not talk to the DB, Redis instances, and even to some of other AWS services endpoint.

Root Cause
The issue was caused by the deployments for workspaces within the btccorp-prod-vpc-shared-nodes VPC in prod. It involved adding DHCP option set to the DNS resolver which caused all DNS resolution to fail

Recovery Actions

The deployments to add new DHCP option sets were reverted and the the services started to come back online.
We need to understand the assignment of dhcp option sets at the account level, this shouldn’t have interfered with the prod vpc but looked like it did, will investigate more tomorrow on this behaviour

Learnings and Remediations

Team will be more careful when adding such entries.
We will also review the existing dhcp option set as the new one that was created was assigned to a totally different VPC than the prod one.

Posted Apr 14, 2023 - 10:39 AEST

Resolved

This incident has been resolved.

Posted Apr 13, 2023 - 16:48 AEST

Investigating

We are currently investigating this issue.

Posted Apr 13, 2023 - 16:21 AEST

This incident affected: API Services.