Major outage in Storm Frontend API

Incident Report for Storm Commerce

Postmortem

The incident with StormAPI was first registered at 03:24 and all systems were available at 10:29 (7h5m downtime). Our sourcing vendor started error tracing and restart instructions immediately, and Storm started analysis early morning and together we were able to determine the root cause at ca 10:20.

The root cause of the problem was an error with a Client Certificate Revocation List which had gone stale. The Client Certificate Revocation list is a list of client certificates (used to access StormAPI) which have been revoked, which is done when access to a Storm Application is removed.

Due to the revocation list becoming stale, all requests using client certificates was denied (i.e. all requests to StormAPI). Updating the revocation list is a manual procedure when access to an Application is removed, and was not done before the expiry time, due to no Certificates needing revocation.

To prevent this problem from recurring, we will automate publishing the certificate revocation list. We’ve already been working on implementing a new authentication mechanism on modern standards, not based on Client Certificates which we estimates will be available during Q2.

This problem is not related to the incident last weekend.

Regards

./Anders Heintz, CTO

Posted Feb 22, 2020 - 12:42 CET

Resolved

The incident is resolved, we will be back with post mortem.
Posted Feb 22, 2020 - 10:32 CET

Update

The team is working together with our infrastructure partner to find root cause according to our standard protocol, ruling out potential causes.
Posted Feb 22, 2020 - 10:02 CET

Investigating

We are experiencing a major outage in Storm frontend API on the production environment.

We are investigating the issue.
Posted Feb 22, 2020 - 08:15 CET
This incident affected: Storm API.