Some systems are experiencing issues

Past Incidents

Wednesday 19th October 2022

Deployments Deployments slowness issue

We observe slow deployment times, we are investigating why.

** EDIT 18:10 UTC ** : The issue has been identified and actions to solve this issue has been performed

Tuesday 18th October 2022

No incidents reported

Monday 17th October 2022

No incidents reported

Sunday 16th October 2022

No incidents reported

Saturday 15th October 2022

No incidents reported

Friday 14th October 2022

Deployments Deployments issues

Due to the pulsar incident, some deployments may fail from time to time.

Some hypervisors are behaving strangely. We are watching and fixing them.

EDIT 10:20:00 UTC: Deployments are currently unavailable while we work around the issue.

EDIT 11:31:00 UTC: Deployments issues are fixed. We continue to monitor the situation. If you have troubles redeploying an application, please contact our support.

POSTMORTEM: The Pulsar outage that started around 04:30 UTC (see https://www.clevercloudstatus.com/incident/574) got in the way of:

  • the deployment process, breaking some notifications at 09:30 UTC.
  • the uptime of some persistent VMs (like databases) (See https://www.clevercloudstatus.com/incident/576), making the monitoring trigger deployments.

The pulsar notification system is being gradually deployed on our infrastructure, having passed the tests on our preproduction zone. We do have a fallback method for notifications. However, the issue was weird enough that the pulsar notification was not cleanly failing. They rather timed out after a long time, preventing the fallback to trigger. We stopped all deployments at 10:20 UTC. We worked on quickly adding an emergency flag to prevent the hypervisors from using pulsar for notifications. This way, we can bypass it and go straight to the fallback method.

To avoid this issue, we are working on the following:

  • monitor the pulsar logs before it impacts the rest of the production.
  • try to mitigate the long timeout issue on the notification actors, allowing for a quicker fallback.
Pulsar Pulsar add-ons issues

The pulsar cluster hosting the pulsar add-ons is undergoing issues. We are investigating.

POSTMORTEM (all times are UTC) : Around 04:30: Timeouts in inter-nodes connections started to show up in the logs. They did not lead to alerts in the monitoring Around 05:00: We start getting issues in our infrastructure from software using that cluster.

11:30 : we disable the brokers to analyze the issue.

14:42: The incident is now resolved. If you still encounter any problems, please contact our support.

Infrastructure [RETROACTIVE] [PAR] Some databases instances went down.

At 04:30 UTC: a pulsar cluster started to behave strangely (See https://www.clevercloudstatus.com/incident/574 ) At 05:30 UTC: on PAR, notification services on the hypervisors try to send messages in a loop, filling the system with stuck processes. At 07:00 UTC: the OS of these hypervisors start to kill processes to make room. It impacted some applications and databases. We start working on shutting down the stuck processes and restarting the broken instances. At 10:00 UTC: we finish restarting all the broken instances.

Thursday 13th October 2022

No incidents reported