Incident 2024-04-11

Summary

The incident began at ca 14:00 with long response times for StormApi on Azure. First alert at Norce was triggered at 14:03 and incident response team started working on the issue. The symptoms was long response times and/or timeouts for StormApi causing problems for customers. The problems was resolved at ca 15:25.

The incident response team had indications of problems at ca 13:30 which didn’t affect customers to a large degree and that problem resolved itself without major disturbances.

Problem

The problem was caused by connectivity/throughput problems between StormApi and the backend cache service provided by Azure. The root cause is currently not known but it is likely that periodic connectivity issues caused requests to be stacked up consuming all available network resources on the StormApi compute nodes.

Resolution

The incident team tried to relieve nodes from load to allow them to catch up traffic, but the problem restarted immediately when traffic was routed back to the nodes. To solve the problems we scaled out StormApi to significantly larger compute resources, which allowed initial cluster to recover. Ca 1,5 hours after StormApi was fully back online we gradually moved back to normal setup.

Root cause

The root cause to the problem is that intermittent network issues with backend cache causes severe problems for StormApi, the development team are working on rewriting how the backend cache is used to ensure issues like this does not negatively impact application performance.

Learnings

Statuspage was updated much too late, the incident response team have received feedback to prioritise Statuspage updates.

Posted Apr 11, 2024 - 18:27 CEST

Resolved

This incident have been resolved. An incident report will be published as a post-mortem here on statuspage.

Posted Apr 11, 2024 - 16:44 CEST

Monitoring

The connectivity issue seems to have been resolved, the consequence of the connectivity issues to backend cache caused significant perfomance issues and timeouts. We continue to monitor and investigate if there's changes we can implement to better handle situations like this.

Posted Apr 11, 2024 - 16:05 CEST

Identified

We continue to investigate the issue, our analysis is that it's related to network connectivity issues with backend cache.

Posted Apr 11, 2024 - 15:05 CEST

Update

We are continuing to investigate this issue.

Posted Apr 11, 2024 - 14:37 CEST

Investigating

We are currently investigating issue

Posted Apr 11, 2024 - 14:37 CEST

This incident affected: Storm API.