Clever Cloud Status

Infrastructure PAR: connectivity issue / high latency

2021-06-22

We are currently having connectivity issue or high latency to some part of our Paris infrastructure. Our network provider is aware of the issue and is currently investigating.

10:03 UTC: It seems like the issue is only affecting one of the datacenter. Applications that use services deployed on another datacenter might suffer from connectivity issue or increased latency

10:15 UTC: We are removing the IPs of the affected datacenter from all DNS records of load balancers (public, internal and Clever Cloud Premium customers) and are awaiting more info from our network provider.

10:19 UTC: Packet loss and latency have been going down from 10:12 UTC and it seems to be back to normal now. We are awaiting confirmation of the actual resolution of the incident.

10:23 UTC: We are working on resolving issues caused by this network instability and making sure everything works fine.

10:25 UTC: Logs ingestion is fixed. We are working on bringing back Clever Cloud Metrics.

10:31 UTC: IPs removed from DNS records at 10:15 UTC will be added back once we have confirmation that the network issue is definitely fixed.

10:41 UTC: Full loss of connectivity between the two Paris datacenters for a few seconds around 10:39 UTC. We are still experiencing packet loss now. Our network provider is working with the affected peering network on this issue.

10:45 UTC: The two Paris datacenters are unreachable depending on your own network provider.

10:49 UTC: Network is overall very flaky. Our network provider and peering network provider are still investigating.

10:57 UTC: According to our network provider, many optical fibers in Paris are deteriorated. Some interconnection equipment might be flooded. We are waiting for more information.

11:02 UTC: (Network and infrastructure inside each datacenter are safe. The issue is clearly happening outside the datacenters.)

11:13 UTC: Network is still flaky. Overall very slow. We are still waiting for a status update from our network and peering providers.

11:20 UTC: Network seems better towards one of the datacenters. We invite you to remove all IPs starting by "46.252.181" from your DNS.

11:42 UTC: Still waiting for information from our network providers. Still no ETA.

12:16 UTC: Network loss between the datacenters has lowered a bit. Console should be more accessible.

12:21 UTC: Connections are starting to come back UP. We are still watching and waiting for more information from our network providers.

12:30 UTC: Info from provider: over the 4 optical fibers, 1 is "fine". They cannot promise this one will stay fine. They are still working on it. Teams have been dispatched on the premises.

13:15 UTC: Network is still stable. We are keeping Metrics down for now as it uses a significant amount of bandwidth between datacenters.

13:48 UTC: A second optical fiber is back UP. According to our provider, "it should be fine, now". The other two fibers are still down. The on-site teams are analysing the situation.

13:41 UTC: You can now add back these IPs to your domains:

@ 10800 IN A 46.252.181.103
@ 10800 IN A 46.252.181.104

15:35 UTC: We are bringing Clever Cloud Metrics back up. It's now ingesting accumulated data in the queue while the storage backend was down.

16:45 UTC: Clever Cloud Metrics ingestion delay is back to normal.

17:16 UTC: The situation is currently stable but may deteriorate again. We are closely monitoring it. A postmortem will be published in the following days. If the issue comes back, this incident will be updated again. Sorry for the inconvenience.

17:31 UTC: A 30 seconds network interruption happened between 17:22:42 and 17:23:10, it was an isolated maintenance event done by the datacenter's network provider.

2021-06-23

07:01 UTC: This incident has been set to fixed as everything has been working fine, as expected, since the second optical fiber link has been restored except for the incident mentioned in the previous update. Do note that as of now we are not at the normal redundancy level as the other two optical fiber links are still down. We will update this once we have more information.

10:23 UTC: We have confirmation that a non-redundant third optical fiber link has been added at 00:30 UTC, this is only meant to add bandwidth capacity, it does not solve the redundancy issue. However, our network provider also tells us that their monitoring shows that the redundant link just came back up; although this may be temporary and the link may not be using the usual optical path.

16:13 UTC: The redundant link that came back at 10:23 UTC is stable. It may be re-routed to use another physical path at some point but we can now consider that our inter-datacenter connectivity is indeed redundant again.

No incidents reported

Deployments Deployments delayed

From 08:41 UTC to 08:52 UTC, deployments have been queued up and very few deployments were starting.

This was due to an update that has now been rolled back.

Infrastructure PAR: Network accessibility issue

Post Mortem

(The original incident text can be found at the end)

A network issue caused 17 minutes of full unreachability of the Paris zone which in turn caused some applications to go down and our deployment system to slow down while restarting affected applications as well as several other services.

Timeline

10:12 UTC: The whole PAR network is unreachable from outside, cross-datacenter network is down as well.

10:16 UTC: The on-call team is warned by an external monitoring system.

10:21 UTC: Our network provider informs us that they are aware of the issue.

10:29 UTC: The network is back.

10:30 UTC: The monitoring systems are starting to queue a lot of deployments. The load of one monitoring system in charge of one of the PAR datacenters increases significantly. Other systems such as Logs, Metrics, and Access Logs (collection and query) are also impacted and unavailable. Some applications relying on FSBucket services (mostly PHP applications) are also having communication issues with their FSBuckets. This might have made some applications unreachable and their I/O very high, sometimes leading to Monitoring/Scaling deployments. This particular issue was detected later during the incident.

10:35 UTC: Our network provider confirms to us that the issue is fixed.

10:50 UTC: Deployments are slow to start because many of them are in queue.

11:00 UTC: The load of the faulty monitoring system being too high causes it to see more applications down than there actually are, and to queue even more deployments for applications that were actually reachable.

11:15 UTC: Clever Cloud Metrics is back, delayed data points have been ingested. Writing to the ingestion queue is still subject to problems.

11:20 UTC: We notice the build cache management system is overloaded, slowing down deployments and failing those that rely on the build cache feature. The retrying of these failed deployments adds even more items to the deployment queue.

11:28 UTC: We start upscaling the build cache management system beyond its original maximum setting.

11:52 UTC: We believe an issue found in the past few days within the build cache management system is responsible for the slowness/unreachability of the build cache service. This issue caused a thread leak which had been triggering more upscalings than usual. A fix was being tested on our testing environment but was not yet validated. We try to push this fix to production.

12:48 UTC: The fix pushed to production at 11:52 UTC is not effective. We upscale the build cache management system again.

13:00 UTC: Logs collection is back. Logs collected before this time were lost. Queries are also available but might still fail sometimes or return delayed logs.

13:05 UTC: We prevent the overloaded monitoring system from queuing up more deployments and empty out its internal alerting queue.

13:10 UTC: We rollback a change made on the database a few days ago, which we believe is the root cause of the ongoing issue.

13:16 UTC: The build cache management system database load starts to go up. This is caused by the application being more effective at making requests to the database thanks to the previous rollback.

13:18 UTC: The build cache management system database is overloaded.

13:33 UTC: We start looking into optimizing requests and clearing up stale data.

13:59 UTC: We manage to bring the build cache management system database load down.

14:05 UTC: The build cache management system is still overloaded/slow despite its database now working properly. A deployment is queued with an environment config change but is slow to start. We restart the application manually to apply this change.

14:10 UTC: The change of configuration is effective, the deployment queue starts to empty itself but there are still a lot of deployments in the queue.

14:15 UTC: An older deployment, performed without the environment change which was waiting to be processed, finishes successfully, leading to about half of the build cache requests failing.

14:17 UTC: We start reapplying the fix manually on live instances while a new deployment with the correct environment is started. The deployment queue size is going down.

14:29 UTC: The deployment queue is filling up again.

14:53 UTC: We realize the faulty monitoring system is still queuing deployments despite its alerting queue being empty and the alerting action being disabled.

14:57 UTC: We completely restart the faulty monitoring system and make sure it stops queuing deployments.

15:10 UTC: We are now certain the previously faulty monitoring system stopped queuing deployments for false positives. The deployment queue is back to normal and the deployment system is more reactive.

15:15 UTC: We start cleaning stuck deployments and making sure everything is working fine.

15:42 UTC: We start redeploying all Paris PHP applications which have not been deployed since the network came back.

16:00 UTC: Some PHP deployments seem to be failing due to a connection timeout to their PHP session stored on an FSBucket. We abort the PHP deployment queue to avoid any more errors.

16:10 UTC: The connection was only broken on one hypervisor and is now fixed. We also make sure every other hypervisor can contact all FSBucket servers on the PAR zone.

16:15 UTC: The PHP deployments queue is started again, with a lower delay between deployments.

16:42 UTC: Clever Cloud Metrics / Access logs ingestion is now fixed. Queries should be returning up-to-date data. Access logs were stored in a different queue and have been entirely consumed.

17:05 UTC: The PHP deployments queue is now completed. All other applications in the PAR zone, which had not been redeployed since the network came back, have also been queued for redeployment to fix any connection issue to their FSBucket add-ons.

19:10 UTC: A few applications which have the “deployment with downtime” option enabled were supposed to be UP but had no running instances. Those applications are now being redeployed.

Network incident details

Foreword: Clever Cloud has servers in two datacenters in the Paris zone (PAR). In this post-mortem, they are named PAR4 and PAR5.

A routine maintenance operation made by our Network Provider on PAR4 started a few minutes before the incident. This maintenance was about decommissioning a router that shouldn’t impact the network. Various checks and monitoring were in place, as usual, and a quick rollback procedure was planned in case anything went wrong.

The decommission triggered an unexpected election of another router, which then triggered a lot of LSA (link-state advertisement) updates between all the routers of the datacenter, sometimes doubling them. Those updates created new LSA rules on other routers, which first made them slower to update and routing traffic. Some of the routers then hit a configuration limit on the number of LSA rules. When hitting the limit, the router went into protection mode and shut itself down. This shutdown triggered other LSA updates on other routers which then also hit their LSA limit and entered in protection mode. This isolated PAR4 site from the network.

An internal equipment that had a link between PAR4 and PAR5 also propagated those LSA updates onto PAR5 routers, replicating the exact same scenario.

To fix this, our Network Provider disconnected some routers, lowering the number of LSA announcements across the network and bringing the routers back online.

Actions

Network provider

Actions taken

The equipment that had links between the two datacenters has been isolated and is now in its own network. This makes sure LSA updates aren't inadvertently sent to the second datacenter.
An isolation timeout has been lowered from 5 minutes to 1 minute, making the system react faster to failures.

Actions planned in a few days

Forbid any non-primary router to be elected as a leader to avoid any issue. According to their support contract with their suppliers, our network provider has officially sent a bug report to the manufacturer of the router which did not behave as expected and they are awaiting a fix and any relevant information.
Routers will now reject LSA rules when they hit their limit instead of going into protection mode. This will allow having a degraded network at first, instead of directly having a broken network. There are currently 4 different brands of routers and each one of them will be tested separately.
Other security measures have been taken. Additional monitoring and logs will also be added

Clever Cloud

Actions taken

Build cache management system database interaction performance improved + database performance itself improved
A deployment system bug with urgent queues is fixed, which allows us to deploy some applications before others (internal and Clever Cloud Premium customers)

Actions planned

Further improve performance and resilience of the build cache management system.
Improve the monitoring of the alerts queue, and the number of unreachable deployments being processed
Improve the visibility of urgent alerts among a high number of alerts
Improve the monitoring of the logs storage system
Improve the monitoring of the connectivity between FS buckets servers and hypervisors
Improve the monitoring of applications that should be up without having any instances
Improve our communication on our status page to post updates more frequently

Original incident details

We are currently experiencing a network accessibility issue on our PAR zone. We are investigating.

EDIT 12:21 UTC+2: Our network provider is looking into the issue.

EDIT 12:28 UTC+2: Deployments on other zones might not correctly work. But traffic shouldn't be impacted.

EDIT 12:30 UTC+2: Network connectivity seems to be back. We are awaiting confirmation of incident resolution from our network provider.

EDIT 12:35 UTC+2: Our network provider found the issue and fixed it. Network is back online since 12:30 UTC+2. Investigation will be conducted to understand why the secondary link hasn't been used.

EDIT 12:42 UTC+2: A postmortem will be made available later once everything has been figured out.

EDIT 12:50 UTC+2: The deployment queue is currently processing, queued deployments might take a few minutes to start

EDIT 13:00 UTC+2: Logs may also be unavailable depending on the applications

EDIT 13:20 UTC+2: Deployment queue still has a lot of items, the build cache feature is currently having troubles which slows down deployments.

EDIT 14:33 UTC+2: Deployments queue is now lower but there are still some issues with some of them. Logs are also partially available

EDIT 15:30 UTC+2: The build cache feature still has troubles, we are currently working on a workaround. Logs should now be back but there is a delay in processing which might affect availability on the Console / CLI. They might be a few minutes late.

EDIT 16:04 UTC+2: Some applications linked to FSBuckets systems might have lost their connection to the FSBucket, increasing their I/O and possibly rebooting in a loop for either Monitoring/Unreachable or Monitoring/Scalability. This can cause response timeouts, especially for PHP applications

EDIT 16:16 UTC+2: Build cache should be fixed, meaning that deployments should take less time

EDIT 16:53 UTC+2: There is still a lot of Monitoring/Unreachable events that are being sent, making a lot of application redeploy for no good reason. We are still working on it.

EDIT 17:18 UTC+2: The issue with Monitoring/Unreachable events has been fixed. The size of the deployments queue should go down.

EDIT 18:07 UTC+2: Most issues haves been cleared up. PHP applications may still be experiencing issues, we are working on it. If you are experiencing issues on non-PHP applications, please contact us.

EDIT 19:05 UTC+2: All PHP applications have been redeployed. If you are still experiencing issues, please contact us. All other applications which have not already been redeployed since the beginning of the incident will be redeployed in the next few hours (to make sure no apps are stuck in a weird state).

Past Incidents

Tuesday 22nd June 2021

2021-06-22

2021-06-23

Monday 21st June 2021

Sunday 20th June 2021

Saturday 19th June 2021

Friday 18th June 2021

Thursday 17th June 2021

Wednesday 16th June 2021

Post Mortem

Timeline

Network incident details

Actions

Network provider

Actions taken

Actions planned in a few days

Clever Cloud

Actions taken

Actions planned

Original incident details