Atlas DSS serving HTTP 503 errors on identity creation calls
Incident Report for GMO GlobalSign
Postmortem

Background Details

At approximately 14:00 UTC on Wednesday 28th October 2020, high queues for certificate issuance caused DSS identities (which are issued synchronously) to time out after 6 seconds, returning an HTTP 503 error to customers.

After an initial investigation, Infrastructure engineers parked remaining queued certificates at 15:20 allowing partial recovery of the service.

Full service was restored at about 16:10 UTC

Timeline (all times in UTC)

Wednesday 28th October

14:00                   UK Infrastructure Team receive a number of alerts from OCSP renewal failing.
14:00 - 14:29     Investigations show that failures relate to CAs which are no longer in service. A configuration rollout to OCSP is performed and the first OCSP services restarted. They are unable to renew their certificates successfully.
14:29                   DSS error rate moving averages seen to be increasing. Infrastructure Management escalate possible issue to support.
14:30 – 15:30    Issue identified to be related to queued requests from OCSP servers. Active OCSP pool is heavily reduced to minimise issuance requests while still serving inbound OCSP requests
15:30                   Infrastructure Engineers confirm that queue consists of OCSP signer requests and DSS identity requests which have already timed out. Queued certs in 1/3 queues are parked, allowing partial recovery of the service. OCSP servers are brought online slowly
16:00                   Due to unsatisfactory recovery speed, another 1/3 queues are parked, vastly reducing timeouts and recovering the service to a normal state by 16:15.
16:20                   All certs in the “parked” queues are re-submitted for issuance to ensure every requested certificate has been issued.

Root Cause Analysis

On 27th October, the key material for a number of CA certificates was destroyed in a planned Key Ceremony. The issuance configuration profiles linked to these CAs were marked as disabled as part of this process.

At approximately 14:00 UTC, the Atlas OCSP services started renewing their signing certificates. This is normal behaviour, but included a number of certificate requests for 6 of the ICAs which had been destroyed the previous day.

Unfortunately, due to a problem with the issuer management API, the frontend (DSS/HVCA) API servers had not been notified that the disabled profiles had been updated. This meant that they continued to forward requests for these ICAs to the issuance queues.

There is a feature of the Atlas certificate signing service which causes it to “back off” if it experiences a high number of failures in a short amount of time. This is to prevent a server that (for instance) is experiencing hardware issues from causing problems for customers when other instances are available to provide service.

Due to the high number of certificate requests that couldn’t be issued, all issuance servers backed off, heavily reducing issuance throughput and causing a queue to build up. This queue was further inflated when the Infrastructure team moved to resolve the issue by removing the retired CAs from the OCSP configuration. This caused an immediate inflow of certificate requests. Queue time increased to the point where OCSP certificate requests for valid CAs were timing out and being re-requested, continuing to flood the issuance queues.

DSS identity issuance is presented to the client as a synchronous API call, but the issuance itself is an asynchronous process. When the latency for issuing a certificate hit 6 seconds due to time spent in the queue, an internal timeout caused identity creation calls (POST /identity) to fail with an HTTP 503 error.

Note that automatic OCSP certificate renewal takes place far in advance of the expiry of the current certificates, so coupled with careful management by the Infrastructure Team, OCSP responders were never affected by the issue.

Preventative Measures

The Atlas Development Team have analysed the back-off code in the issuance service and have examined the possible events that could trigger this code path. Changes are being made to prevent disabled/absent CAs from triggering a back-off event.

The team are also improving the functionality of the issuer management API to ensure that updates for profiles for CAs which no longer exist are pushed to the front-end API servers.

In the meantime, the Infrastructure and Compliance teams have evaluated the procedures being used during destruction of CAs. This is an unprecedented requirement, but we have identified measures to safeguard the service during development of the above mitigations.

Posted Nov 08, 2020 - 10:04 UTC

Resolved
At approximately 14:00 UTC on Wednesday 28th October 2020, high queues for certificate issuance caused DSS identities (which are issued synchronously) to time out after 6 seconds, returning an HTTP 503 error to customers. After an initial investigation, Infrastructure engineers parked remaining queued certificates at 15:20 allowing partial recovery of the service. Full service was restored at about 16:10 UTC
Posted Oct 28, 2020 - 14:05 UTC