Friday 14th October 2022

Deployments Deployments issues

Due to the pulsar incident, some deployments may fail from time to time.

Some hypervisors are behaving strangely. We are watching and fixing them.

EDIT 10:20:00 UTC: Deployments are currently unavailable while we work around the issue.

EDIT 11:31:00 UTC: Deployments issues are fixed. We continue to monitor the situation. If you have troubles redeploying an application, please contact our support.

POSTMORTEM: The Pulsar outage that started around 04:30 UTC (see https://www.clevercloudstatus.com/incident/574) got in the way of:

  • the deployment process, breaking some notifications at 09:30 UTC.
  • the uptime of some persistent VMs (like databases) (See https://www.clevercloudstatus.com/incident/576), making the monitoring trigger deployments.

The pulsar notification system is being gradually deployed on our infrastructure, having passed the tests on our preproduction zone. We do have a fallback method for notifications. However, the issue was weird enough that the pulsar notification was not cleanly failing. They rather timed out after a long time, preventing the fallback to trigger. We stopped all deployments at 10:20 UTC. We worked on quickly adding an emergency flag to prevent the hypervisors from using pulsar for notifications. This way, we can bypass it and go straight to the fallback method.

To avoid this issue, we are working on the following:

  • monitor the pulsar logs before it impacts the rest of the production.
  • try to mitigate the long timeout issue on the notification actors, allowing for a quicker fallback.