During the evening of Monday the 5th primarily StormApi started to suffer from degraded performance at ca 19:30 CET. At ca 19:33 our monitoring systems alerted us of the problem.
The symptoms of the problem was slow response times and to an extent, timeouts or errors.
Norce incident team started investigate the problem at 19:37 and after a brief impact analysis, Norce [Storm] Statuspage was updated at 19:53 and subscribers was notified.
The problem continued throughout the evening with differing availability and response times until ca 23:00 where services were back to normal.
The root cause of the problem was a high load on certain endpoints which requires extensive use of azure resources (depending on setup in Norce Storm) which accumulated and eventually requests "stacked up" and the compute nodes ran out of sockets/resources and performance dropped, certain requests timed out or received an error. We could also detect degraded network performance between the Iver site and Azure which escalated the problem.
To handle the problem while investigating the root cause, we scaled up compute nodes for StormApi by a factor of x2, that helped at least marginally, but did not provide a satisfactory service level.
At ca 22:40 we could pinpoint the requests that caused the majority of the problem and could block usage of that certain type of request which brought StormApi back to normal service levels.
At ca 23:30 the network was back to normal and could likely handle the load, but we decided to continue blocking the requests to ensure stability.
The development team have released one fix to ensure it's not as azure resource intensive, we're also changing the behaviour of Norce Storm for the endpoint to ensure that processing is limited independent of configuration in Norce Storm.
After reviewing availability of the service, StormApi were available during large parts of the incident period. Storm received orders and successfully served requests during the period, however, due to the intermittent behaviour and the amount of failed requests, we will consider Norce StormApi to have "major outage" 2h 43m during the period between 19:33 and 23:00.