Past Incidents

Sunday 12th March 2023

MongoDB shared cluster Free MongoDB cluster on PAR unreachable

(All times in UTC)

16:30 we started seeing alerts about high load on the primary node. 17:00 we started getting report about the cluster being unreachable. 18:00 after checking the cluster, we decided to restart the primary node.

Data may have been lost as the node was not writing / replicating correctly. We are still waiting for the primary node to restart. The secondary does not seem to elect itself as primary.

19:30 the secondary finally got promoted as primary. We are blocking users with unfair use of the cluster. 22:45 we detect that the node we restarted failed to get back in the cluster. We decide to remove it entirely and re-create that node from scratch. 2023-03-13 10:00 the node has fully reached the "SECONDARY" state. We put it back into production.

Measures have been taken to prevent future unfair use from users.

Saturday 11th March 2023

API Main API is down

(All times in UTC)

11:30 Our main API keeps stopping to respond. We are investigating it. This impacts the following, in an irregular fashion:

clever ssh may not succeed
Some deployments may not go through

Applications should keep running, but some monitoring deployments may fail.

12:55 The API seems to have stabilized. The database seems to have had a huge load. We are investigating the queries responsible for that load and try to improve them.

Friday 10th March 2023

Infrastructure [PAR] Investigating network issues

We are currently investigating network issues on our Paris zone.

EDIT 17:15 UTC: The issue is now resolved. A part of our infrastructure in Paris couldn't access some public DNS servers anymore, leading to multiple DNS queries failing. An upstream network provider made a change that fixed the problem around 16:52 UTC.

Thursday 9th March 2023

No incidents reported

Wednesday 8th March 2023

No incidents reported

Tuesday 7th March 2023

API Core API is experiencing issues

Clever Cloud Core API is currently experiencing performance issues. We are investigating it.

EDIT 16:03 UTC: We are seeing improvements, we continue to monitor the situation and keep investigating the root cause. We continue to add more data collection around the various points of contention.

Infrastructure [PAR] An hypervisor went down

An hypervisor went down, we are investigating. Applications are being redeployed.

Update 11:11 AM UTC: The hypervisor has been rebooted, add-ons should be reachable. Root cause of the issue will be determined later. In the meantime, applications hosted on that hypervisor are still redeploying. We continue to monitor the situation.

Update 03:13 PM UTC: the same hypervisor went down again. It has been rebooted. Add-ons should be reachable. In the meantime, applications hosted on that hypervisor are still redeploying. We continue to monitor the situation.

Monday 6th March 2023

No incidents reported