Increased response times and intermittent failures

Write-up

Problem:

The Partner API service experienced a sudden increase in response time to over 1000 ms starting at approximately 01:03 AM on May 08, 2026 AEST. All API endpoints were uniformly affected, with request latencies increasing by several seconds.

Impact:

Response times fluctuated but consistently breached the 1000 ms threshold.
Per-request durations increased across multiple API endpoints.
No errors were observed; however, significant latency degradation occurred.

Root Cause Analysis:

Elevated network latency between the Partner API service and the caching layer caused slower response times. The issue was traced to unusually large cache keys stored for the GET /v1/orders and GET /v2/orders endpoints, which were heavily requested by multiple partners. This inadvertently overloaded the Redis cache infrastructure.

Steps Taken to Resolve:

Adjusted the caching strategy for all endpoints to reduce the load on the cache.
Applied temporary rate limiting to targeted requests to allow the application to recover.

Further Actions:

Segregation of order endpoints from transaction services to be better manage resources.

Investigating alternative caching solutions for further improvement.

Write-up

Increased response times and intermittent failures

Partial outage

View the incident

Problem:

Impact:

Response times fluctuated but consistently breached the 1000 ms threshold.
Per-request durations increased across multiple API endpoints.
No errors were observed; however, significant latency degradation occurred.

Root Cause Analysis:

Steps Taken to Resolve:

Adjusted the caching strategy for all endpoints to reduce the load on the cache.
Applied temporary rate limiting to targeted requests to allow the application to recover.

Further Actions:

Segregation of order endpoints from transaction services to be better manage resources.

Investigating alternative caching solutions for further improvement.