API Outage on Jan 14, 2022 (Retroactive)
Between the hour of 18:00 CET and 18:58 CET on January 14, 2022, all API users encountered 502 response errors or response timeouts when calling any API endpoint. This also affected the portal. The returns portal was not affected.
The event was triggered by an operational issue on our primary database cluster causing connections from the API layer to the database to time out. Due to ongoing traffic the API processes queued following requests until the maximum thresholds were reached, not being able to serve requests anymore. The configured autoscaling of the infrastructure did not trigger in time, as for the existing load sufficient capacity was reached and CPU load remained low.
This incident was detected at 18:01 CET when the uptime monitors against the API services was triggered and the ops team was paged. Work on the incident began at 18:14 CET.
At 18:27 CET deployment was retriggered, causing the API to partially recover at 18:32 CET as the first hosts were rotated into load. The connection issue persisted causing the API to fall below under its threshold for healthy hosts again at 18:42 CET.
At 18:49 CET the team opted to completely rotate the API infrastructure, with the effect of bringing the API back up starting at 18:54 CET, achieving a 100% success rate at 18:58 CET.
Lessons learned and corrective actions will be addressed in the following development sprints.