The incident began at ca 14:00 with long response times for StormApi on Azure. First alert at Norce was triggered at 14:03 and incident response team started working on the issue. The symptoms was long response times and/or timeouts for StormApi causing problems for customers. The problems was resolved at ca 15:25.
The incident response team had indications of problems at ca 13:30 which didn’t affect customers to a large degree and that problem resolved itself without major disturbances.
The problem was caused by connectivity/throughput problems between StormApi and the backend cache service provided by Azure. The root cause is currently not known but it is likely that periodic connectivity issues caused requests to be stacked up consuming all available network resources on the StormApi compute nodes.
The incident team tried to relieve nodes from load to allow them to catch up traffic, but the problem restarted immediately when traffic was routed back to the nodes. To solve the problems we scaled out StormApi to significantly larger compute resources, which allowed initial cluster to recover. Ca 1,5 hours after StormApi was fully back online we gradually moved back to normal setup.
The root cause to the problem is that intermittent network issues with backend cache causes severe problems for StormApi, the development team are working on rewriting how the backend cache is used to ensure issues like this does not negatively impact application performance.
Statuspage was updated much too late, the incident response team have received feedback to prioritise Statuspage updates.