We're experiencing StormApi issues
Incident Report for Storm Commerce
Postmortem

Incident Norce Storm Monday Feb 5th 2024

During the evening of Monday the 5th primarily StormApi started to suffer from degraded performance at ca 19:30 CET. At ca 19:33 our monitoring systems alerted us of the problem.

The symptoms of the problem was slow response times and to an extent, timeouts or errors.

Norce incident team started investigate the problem at 19:37 and after a brief impact analysis, Norce [Storm] Statuspage was updated at 19:53 and subscribers was notified.

The problem continued throughout the evening with differing availability and response times until ca 23:00 where services were back to normal.

Root cause

The root cause of the problem was a high load on certain endpoints which requires extensive use of azure resources (depending on setup in Norce Storm) which accumulated and eventually requests "stacked up" and the compute nodes ran out of sockets/resources and performance dropped, certain requests timed out or received an error. We could also detect degraded network performance between the Iver site and Azure which escalated the problem.

Actions during the incident

To handle the problem while investigating the root cause, we scaled up compute nodes for StormApi by a factor of x2, that helped at least marginally, but did not provide a satisfactory service level.

At ca 22:40 we could pinpoint the requests that caused the majority of the problem and could block usage of that certain type of request which brought StormApi back to normal service levels.

At ca 23:30 the network was back to normal and could likely handle the load, but we decided to continue blocking the requests to ensure stability.

Future actions

The development team have released one fix to ensure it's not as azure resource intensive, we're also changing the behaviour of Norce Storm for the endpoint to ensure that processing is limited independent of configuration in Norce Storm.

Downtime

After reviewing availability of the service, StormApi were available during large parts of the incident period. Storm received orders and successfully served requests during the period, however, due to the intermittent behaviour and the amount of failed requests, we will consider Norce StormApi to have "major outage" 2h 43m during the period between 19:33 and 23:00.

Posted Feb 09, 2024 - 08:15 CET

Resolved
The problem have been resolved and we're back to normal operation. The cause of the problem was a high load on StormApi in combination with a degraded network connection between Iver and Azure services caused requests to stack up, go more slow and eventually hang. We have worked throughout the evening to reduce the traffic to azure resources to minimize impact of events like this.

During the coming days we will post a Post Mortem/Incident Report here on status page detailing our future actions to prevent a problem like this.
Posted Feb 05, 2024 - 23:40 CET
Update
We continue to restore services, we still have instability and degraded performance, but online purchase processing should be restored. We still experience slightly degraded performance and a degree of failed requests.
Posted Feb 05, 2024 - 22:56 CET
Identified
We continue to work with the issue but have restored the services partially.
Posted Feb 05, 2024 - 22:30 CET
Update
We continue to investigate the issue, we are experiencing network issues between Storm/StormAPI and Azure resources, which are causing significant problems in combination with a high load. The incident team is implementing changes to mitigate the problem.
Posted Feb 05, 2024 - 21:29 CET
Update
The load is causing almost full outage, we continue to work on the issue.
Posted Feb 05, 2024 - 20:13 CET
Update
We are experiencing very high load on StormApi and a high degree of outage. We continue to investigate.
Posted Feb 05, 2024 - 20:05 CET
Investigating
We are currently investigating the issue.
Posted Feb 05, 2024 - 19:53 CET
This incident affected: Storm API.