Some systems are experiencing issues

Past Incidents

Monday 28th June 2021

Infrastructure PAR: An hypervisor is currently unavailable

An hypervisor is currently unavailable. Applications are currently restarting. Add-ons hosted on that hypervisor are currently unavailable. We are looking into the root cause.

EDIT 14:45 UTC: The server won't reboot as of now, we are not yet sure of the reason. We continue to look into it. In the meantime, you can create a new add-on and import last night's backup. Please contact our support team for any further assistance

EDIT 14:58 UTC: The server still won't reboot, we continue to investigate the reason.

EDIT 15:08 UTC: A ticket has been opened to the manufacturer. The server is still unreachable as of now.

EDIT 15:12 UTC: A server replacement is currently being discussed. In the meantime, we advise you to import last night's backup into a new add-on. If the hypervisor ever comes back, you will be able to access your old add-on and possibly access the data between last night's backup and now, allowing you to merge them if possible. Current ETA is 24 hours.

EDIT 16:38 UTC: No server replacement will happen, we'll have more information to share tomorrow once the manufacturer gets back to us.

EDIT 16:54 UTC: Clarification: No server replacement will happen tonight. There are no sign of disk / data corruption, it seems to only be an hardware problem, which we can't fix right now.

EDIT 29/06/21 09:30 UTC: A maintenance on the server should happen in the next few minutes. The goal is to replace the problematic hardware piece. More information to come.

EDIT 13:17 UTC: The maintenance has been performed and a hardware piece has been changed but it didn't fix the issue. We continue investigating.

EDIT 13:26 UTC: The initial hardware replacement was the network card. Another replacement, this time the motherboard, has been planned for tomorrow. We do not yet have the exact time.

EDIT 30/06/21 11:09 UTC: The motherboard has been changed, additional checks are being performed.

EDIT 13:03 UTC: The motherboard replacement did not improve the situation. The server reboots fine without the network card, which has already been changed. A full server replacement is being considered by the manufacturer.

EDIT 18:23 UTC: Our infrastructure provider has been able to provide us with a temporary replacement server which is now up and running. Add-ons and custom services are all up and running. Do note that this is a temporary replacement, once the manufacturer gives us back the fixed server or a fully working permanent replacement, we will have to switch to it (meaning a shutdown of a few minutes). Affected customers will be e-mailed about this.

Deployments Invalid build cache

Some applications may fail to deploy because they try to compile on a runtime instance when a build instance has been configured. Explicitely triggering a rebuild should fix the issue.

Sunday 27th June 2021

No incidents reported

Saturday 26th June 2021

No incidents reported

Friday 25th June 2021

No incidents reported

Thursday 24th June 2021

No incidents reported

Wednesday 23rd June 2021

No incidents reported

Tuesday 22nd June 2021

Infrastructure PAR: connectivity issue / high latency

2021-06-22

We are currently having connectivity issue or high latency to some part of our Paris infrastructure. Our network provider is aware of the issue and is currently investigating.

10:03 UTC: It seems like the issue is only affecting one of the datacenter. Applications that use services deployed on another datacenter might suffer from connectivity issue or increased latency

10:15 UTC: We are removing the IPs of the affected datacenter from all DNS records of load balancers (public, internal and Clever Cloud Premium customers) and are awaiting more info from our network provider.

10:19 UTC: Packet loss and latency have been going down from 10:12 UTC and it seems to be back to normal now. We are awaiting confirmation of the actual resolution of the incident.

10:23 UTC: We are working on resolving issues caused by this network instability and making sure everything works fine.

10:25 UTC: Logs ingestion is fixed. We are working on bringing back Clever Cloud Metrics.

10:31 UTC: IPs removed from DNS records at 10:15 UTC will be added back once we have confirmation that the network issue is definitely fixed.

10:41 UTC: Full loss of connectivity between the two Paris datacenters for a few seconds around 10:39 UTC. We are still experiencing packet loss now. Our network provider is working with the affected peering network on this issue.

10:45 UTC: The two Paris datacenters are unreachable depending on your own network provider.

10:49 UTC: Network is overall very flaky. Our network provider and peering network provider are still investigating.

10:57 UTC: According to our network provider, many optical fibers in Paris are deteriorated. Some interconnection equipment might be flooded. We are waiting for more information.

11:02 UTC: (Network and infrastructure inside each datacenter are safe. The issue is clearly happening outside the datacenters.)

11:13 UTC: Network is still flaky. Overall very slow. We are still waiting for a status update from our network and peering providers.

11:20 UTC: Network seems better towards one of the datacenters. We invite you to remove all IPs starting by "46.252.181" from your DNS.

11:42 UTC: Still waiting for information from our network providers. Still no ETA.

12:16 UTC: Network loss between the datacenters has lowered a bit. Console should be more accessible.

12:21 UTC: Connections are starting to come back UP. We are still watching and waiting for more information from our network providers.

12:30 UTC: Info from provider: over the 4 optical fibers, 1 is "fine". They cannot promise this one will stay fine. They are still working on it. Teams have been dispatched on the premises.

13:15 UTC: Network is still stable. We are keeping Metrics down for now as it uses a significant amount of bandwidth between datacenters.

13:48 UTC: A second optical fiber is back UP. According to our provider, "it should be fine, now". The other two fibers are still down. The on-site teams are analysing the situation.

13:41 UTC: You can now add back these IPs to your domains:

@ 10800 IN A 46.252.181.103
@ 10800 IN A 46.252.181.104

15:35 UTC: We are bringing Clever Cloud Metrics back up. It's now ingesting accumulated data in the queue while the storage backend was down.

16:45 UTC: Clever Cloud Metrics ingestion delay is back to normal.

17:16 UTC: The situation is currently stable but may deteriorate again. We are closely monitoring it. A postmortem will be published in the following days. If the issue comes back, this incident will be updated again. Sorry for the inconvenience.

17:31 UTC: A 30 seconds network interruption happened between 17:22:42 and 17:23:10, it was an isolated maintenance event done by the datacenter's network provider.

2021-06-23

07:01 UTC: This incident has been set to fixed as everything has been working fine, as expected, since the second optical fiber link has been restored except for the incident mentioned in the previous update. Do note that as of now we are not at the normal redundancy level as the other two optical fiber links are still down. We will update this once we have more information.

10:23 UTC: We have confirmation that a non-redundant third optical fiber link has been added at 00:30 UTC, this is only meant to add bandwidth capacity, it does not solve the redundancy issue. However, our network provider also tells us that their monitoring shows that the redundant link just came back up; although this may be temporary and the link may not be using the usual optical path.

16:13 UTC: The redundant link that came back at 10:23 UTC is stable. It may be re-routed to use another physical path at some point but we can now consider that our inter-datacenter connectivity is indeed redundant again.