Clever Cloud Status

Pulsar cluster is unavailable due to ZooKeeper instabilities

2024-10-21T15:12:00+00:00

Following yesterday's incident (https://www.clevercloudstatus.com/incident/911), we took actions to solve the root cause issue.

When we are performing some actions the ZooKeeper cluster becomes unstable and fails. Access logs, logs and deployments stack are affected as well as all services interacting with the Pulsar cluster.

16:20 UTC : The cluster pulsar is up and running, the deployment stack, logs is running as well, we are restarting the access logs stack.

16:30 UTC : The access logs stack is up and running.

Pulsar cluster is an unhealthy state

2024-10-20T16:27:00+00:00

The monitoring report that pulsar is in a unhealthy state, we are investigating.

16:38 UTC: there seems to be an inconsistency in the underlying bookkeeper cluster. We are looking into it.

16:40 UTC: we are now looking into the zookeeper service that seems to fail.

17:30 UTC: we have fix the zookeeper issue, and we begin the recovery process of the cluster bookeeper and then pulsar.

18:10 UTC : we are rolling open the access to the pulsar cluster.

18:45 UTC : we have rolled open the access to the pulsar cluster to half of our hypervisors.

19:15 UTC : the pulsar cluster is running and available for everyone. We are running the recovery process of the platform to ensure that every applications is up and running as well.

21:30 UTC : we have finished to redeploy applications. We are investigating the access logs stack that got offloaders errors on pulsar-side.

22:10 UTC : we have finished to restart the access logs stacks.

[PAR] Security maintenance on 4 hypervisors

2024-10-18T14:43:51+00:00

For security reasons, we will update the kernel of 4 Hypervisors in the Paris (PAR) region, more precisely in the PAR6 datacenter. Services (in particular databases) hosted on those hypervisors will be impacted : they will be unavailable between 5 and 10 minutes. Impacted hypervisors are:

hv-par6-008 hv-par6-011 hv-par6-012 hv-par6-020

Affected clients are directly and individually contacted by email with the list of impacted services, and options to avoid any impact. The maintenance will be planned in 2 operations of 2 hypervisors each, during the week of 18 to 22 Novembre 2024 between 22:00 and 24:00 UTC+1.

[PAR] Hypervisor down

2024-10-10T17:33:00+00:00

An hypervisor located in the PAR zone seems unreachable, we are investigating.

EDIT Thu Oct 10 18:33:25 2024 UTC: HV is back online, and all related services.

[Paris] Network instabilities to access services

2024-10-07T13:21:00+00:00

We are investigating network instabilities that happened between 15:21 UTC+2 and 15:23 UTC+2 on the Paris region. During that time, you may have encountered timeouts to join services hosted on the region. Service is currently operational.

EDIT 20:00 UTC+2: The instabilities were due to a sudden increase of traffic towards the region.

cleverapps.io has not been properly renewed

2024-10-03T12:06:00+00:00

The certificate of cleverapps.io has not been properly renewed at 12h06 UTC. A manual regeneration of the certificate in on the way.

EDIT: The certificate has been renewed at 12h33 UTC, it has been applied and propagated to all load-balancers.

[Paris] Network maintenance

2024-10-02T13:58:00+00:00

A scheduled network maintenance will be carried out in the Paris region on Wednesday, October 2, 2024. This upgrade will affect non-production links, and no impact on production systems is expected.

Start Date & Time: 2024-10-02 20:00 UTC

End Date & Time: 2024-10-02 21:00 UTC

We will provide regular updates throughout the maintenance period.

EDIT 20:38 UTC: The maintenance is now starting.

EDIT 21:45 UTC: The maintenance is still ongoing. Most of the operations are over, verification are currently taking place.

EDIT 22:20 UTC: The maintenance is now over. No impact detected.

Deployments are experiencing issues

2024-10-02T12:27:00+00:00

We are experiencing issues with the deployment pipeline.

EDIT 12:43 UTC: the system has returned to normal operation. Our team is continuing to investigate the root cause to ensure stability moving forward. Further updates will be provided as necessary.

EDIT 13:12 UTC: fixed.

Pulsar instabilities

2024-09-30T08:57:00+00:00

Following a maintenance operation to reduce load on pulsar cluster. The cluster has an the issue with some configurations, we are investigating the reason.

Platform instability

2024-09-25T15:09:00+00:00

We have detected some latency and a few instabilities to connect to our platform, we are investigating.

EDIT 17h40 - Root cause has been identified, and network is now stabilized. We are closely monitoring the platform to be sure this incident is closed

Reading logs is experiencing issues

2024-09-23T12:48:00+00:00

We detected an issue on log reads.

EDIT 13:00 UTC: identified and patched. We are currently deploying the fix.

EDIT 13:15 UTC: fixed.

Pulsar instability

2024-09-20T17:05:00+00:00

We are experiencing a pulsar outage, which impacts logs and access logs and other components of the platform. Preliminary root cause seems like a zookeeper problem. We are working on it.

EDIT Fri Sep 20 18:16:00 2024 UTC Deployments have been disabled. We are still investigating the Zookeeper outage, causing Pulsar outage.

EDIT Fri Sep 20 19:57:09 2024 UTC: The zookeeper quorum is back online, and therefore Pulsar. Deployments have been enabled, we are watching the situation.

EDIT Fri Sep 20 22:09:40 2024 UTC: Pulsar cluster is still unstable, deployment have been disabled.

EDIT Fri Sep 20 23:36:10 2024 UTC: Deployments queue is back, we are ramping up logs's data usage to avoid bursting Pulsar too much.

EDIT Sat Sep 21 00:49:10 2024 UTC: Pulsar cluster is now stable. Applications should now have their logs available in the console / CLI as well as the drains. Access logs lag is currently catching up. We continue to monitor the situation.

Pulsar instability

2024-09-19T19:39:00+00:00

We are experiencing a pulsar outage, which impacts logs and access logs and other components of the platform. Preliminary root cause seems like a zookeeper problem. We are working on it.

EDIT Thu Sep 19 20:49:09 2024 UTC: since 20:20, ZK quorum is up, and all services connected to Pulsar are now back online

EDIT Fri Sep 20 07:43:00 2024 UTC: we are still impacting by zookeeper outage, we are investigating the issue, the logs and access logs stack are currently unavailable

EDIT Fri Sep 20 08:04:00 2024 UTC: we have found the issue on pulsar side that was trying to write indefinitely metadata on zookeeper. we have restarted the broker that had the issue. We are watching, the situation is going back to normal

EDIT Fri Sep 20 08:20:00 2024 UTC: we are still watching the metrics from the pulsar cluster, the situation is going back to normal. we are recoverying from lag on the access logs ingestion, current eta is around 12:30 utc.

EDIT Fri Sep 20 13:15:00 2024 UTC: we have fully ingested the access logs, the cluster pulsar is working normally.

[WSW] region instability

2024-09-18T22:23:00+00:00

Wed Sep 18 22:22:29 2024 UTC: Several hypervisors have been rebooted in WSW. They came back 40min ago, and we are fixing several services who are not online.

EDIT Wed Sep 18 22:30:47 2024 UTC: we have been impacted by https://bare-metal-servers.status-ovhcloud.com/incidents/j7f4kpv9f17z. All services are now online

[Paris] Network upgrade

2024-09-16T13:05:00+00:00

On September 18, 2024, our network provider will carry operations to improve network resiliency on the Paris region. No service interruption is to be expected during that upgrade. This is a follow up of https://www.clevercloudstatus.com/incident/893.

Start date: 2024-09-18 19:00 UTC

End date: 2024-09-18 23:00 UTC

EDIT 2024-09-18 19:18 UTC: The maintenance is starting.

EDIT 2024-09-18 20:53 UTC: The maintenance is now over. No service interruptions noted.

WSW region hypervisors rebooted

2024-09-16T06:41:00+00:00

At 06:41 UTC, we got an alert that all the WSW region stopped responding. At 06:44 UTC, we got hold on the hypervisors. The first check showed they had been rebooted. At 06:50 UTC, all customers services were up and running. At 07:15 UTC, we finished all the checks that the region is fine.

Here’s the matching OVHCloud status: https://bare-metal-servers.status-ovhcloud.com/incidents/hw285l60sq7h It looks like an electrical incident happened on the racks that hold our servers.

[SGP][SYD] Network latencies

2024-09-13T08:39:00+00:00

Our monitoring system has report us high latencies to interact with SYD and SGP region. We are investigating the issue.

EDIT 08:50 UTC : The latencies goes back to normal, we are still watching the issue.

[Paris] Network Issues

2024-09-11T18:13:00+00:00

We are experiencing network issues on the Paris region and are working to identify them.

EDIT 18:21 UTC: the situation seems back to normal. We are still working to identify the reasons;

EDIT 18:23 UTC: we are working to restore impacted components.

EDIT 18:28 UTC: while preparing an intervention in one of our data centers in Paris, we encountered an unfortunate network rerouting. Services are now fully operational again.

EDIT 20:40 UTC: Updated wording to include "Paris region" for impacted location.

Deployments: build cache upload failures

2024-09-10T19:05:00+00:00

The build cache upload of deployments has an elevated error rate since 19:05 UTC. The root cause has been identified. This may prevent your deployments to finish correctly.

EDIT 22:25 UTC: The service is now fully operational again. Builds that failed because of this issue should be restarted. Please contact our support team if you need any assistance.

[Paris] Network upgrade

2024-09-10T09:55:00+00:00

On September 11, 2024, our network provider will carry operations to improve network resiliency on the Paris region. No service interruption is to be expected during that upgrade.

Start date: 2024-09-11 19:00 UTC

End date: 2024-09-11 23:00 UTC

EDIT 2024-09-11 19:36 UTC: The maintenance is starting.

EDIT 2024-09-11 23:00 UTC: The maintenance is now over. No additional impact besides the ones described in the following incident: https://www.clevercloudstatus.com/incident/895

A hypervisor on MTL region just rebooted itself

2024-09-04T17:00:00+00:00

At 17:00 UTC, a hypervisor (hv-mtl2-012) stopped responding. The on-call team got an alert and starting the investigation. It seems that the hypervisor just rebooted itself.

We are trying to find the reason and making sure that all the services on that server restarted correctly.

UPDATE 17:24 UTC: the team just finished checking all the services: they are now up and running.

update: OVHCloud’s status confirms what we saw (server rebooting for no reason): The problem impacts other servers (not ours) as well. Fortunately for us, we made sure to avoid choosing our OVH servers in the same racks. We’ll wait for the result of their investigation.

UPDATE 2024-09-05 08:55 UTC: The incident has been resolved on OVH side.

MTL: MySQL and PostgreSQL DEV clusters unavailable

2024-08-30T12:00:00+00:00

Due to a maintenance from our infrastructure provider, the MySQL and PostgreSQL DEV clusters of the Montreal (MTL) region will be unavailable on Tuesday, September 3, 2024 starting at 12:00 UTC.

The maintenance is expected to take around 1 hour. During that time, the MTL MySQL and PostgreSQL DEV add-ons will not be available.

This incident will be updated to reflect the maintenance status.

[30/08/2024 15:00 CET] Both cluster are available

MTL: FSBuckets maintenance

2024-08-29T14:06:00+00:00

Due to a hardware maintenance from our provider planned in the next few days, we will need to migrate the FSBucket service of the Montreal (MTL) region on Monday, September 2, 2024 starting at 08:00 UTC.

The maintenance is expected to take less than 1 hour. During that time, the FSBucket service will be read-only. Write operations will be denied. Read operations will continue to work as expected.

All applications linked to an FSBucket add-on on the Montreal region will be redeployed so they can reconnect to the server with read/write rights.

This incident will be updated to reflect the maintenance status.

EDIT 2024-09-02 08:08 UTC: The maintenance is starting. FSBucket are now read-only.

EDIT 2024-09-02 08:24 UTC: Applications are redeployed and should now be able to access their FSBucket.

EDIT 2024-09-02 09:10 UTC: All applications have been redeployed since 08:40 UTC and the maintenance is over. We are still having an issue with the web interface, we are looking into it.

EDIT 2024-09-02 12:16 UTC: The web interface issue has been fixed.

MTL: Git repositories maintenance

2024-08-29T14:01:00+00:00

Due to a hardware maintenance from our provider planned in the next few days, we will migrate the Git repositories service of the Montreal (MTL) region on Friday, August 30, 2024 starting at 08:00 UTC.

The maintenance is expected to take less than 1 hour. During that time, the Git repositories service will be read-only. Git push operations will be denied. Pull operations will continue to work as expected.

This incident will be updated to reflect the maintenance status.

EDIT 2024-08-30 08:30 UTC: The maintenance is now over. Applications Git deployment URL have changed from push-n1-mtl-clevercloud-customers.services.clever-cloud.com to push-n2-mtl-clevercloud-customers.services.clever-cloud.com. SSH identity should be the same. Using the old domain will keep working for backward compatibility.

A hypervisor on gra1hds is not responding properly

2024-08-26T20:40:00+00:00

A hypervisor is not responding. A VM seems to be stealing all the cpu.

We are force rebooting this hypervisor.

21:09 status: the server refuses to reboot. We asked the OVHCloud support for help.

A technician is having a look at that server. We are waiting for the result of their analysis.

21:33 status: the technician came back to us and signaled a hardware issue. We are waiting for further update and actions.

2024-08-27 07:15 : OVHCloud support finished replacing the motherboard and give us back the server. It fails to reboot outside of rescue. While some are working on getting the kernel to boot, others are moving all the data outside to restore the impacted services for our customers.

09:50 : all services are back up and running for our customers.

Add-ons Reverse-proxies partially down

2024-08-26T14:25:00+00:00

(Times are in UTC)

At 14:24, two of the add-ons reverse proxies of the PAR region stopped responding. After investigation, we found out that the two failed to reconfigure correctly, due to a "stucked" port: the port was considered still used and fail to switch between the old process and the new.
At 14:34, we decided to fully reboot these two reverse proxies. It successfully fixed the issue.

The consequence of this incident is that some applications that were trying to use one of these two reverse proxies (of a total of 7 proxies) lost their connection to the database for 10 minutes.

Read errors on telemetry cluster

2024-08-16T21:13:00+00:00

The monitoring has detected errors on read queries of the telemetry cluster. We are investigating.

EDIT 21:30 UTC : We found out that the issue is related to indexes of the time series database, we are investigating the reason of the error.

EDIT 21:40 UTC : Some indexes had errors and have been rebooted, the estimate time to recover indexes is around 01:00 UTC.

EDIT 01:00 UTC : Indexes are still rebooting, the new estimate time is 03:00 UTC.

EDIT 02:47 UTC : Indexes are back online and query is available.

EDIT 07:30 UTC : We are running some maintenance operation, the query may be hanging a bit.

EDIT 08:00 UTC : We have shutdown the query to get some place to our maintenance query to run as fast as possible. We have found the root cause issue and we are fixing it, but to resolve read errors, we also need to achieve some clean up in parallel.

EDIT 09:40 UTC : We have turn on the query again, we have still maintenance queries running in the background.

EDIT 13:00 UTC : We have turn off the query, we are struggling the reads with the maintenance queries. To reduce the time of the recovery process, we took the decision to shutdown the read queries to keep the maximum compute space to the maintenance ones.

EDIT D+1 08:00 UTC : We have turn on the query again, the maintenance queries has finished during the night.

[MTL] Unreachable hypervisor

2024-08-12T14:17:00+00:00

We are investigating the loss of an hypervisor on the MTL region.

EDIT 16:36 UTC+2: The machine seems to have an hardware problem. Our provider is investigating the issue.

EDIT 17:36 UTC+2: We've been informed that this server was concerned by this maintenance: https://network.status-ovhcloud.com/incidents/ldl56trpj3kk. We are looking at how much time they need to complete this maintenance.

EDIT 17:48 UTC+2: The hypervisor has been rebooted by OVH. We are currently checking its state and restarting services.

EDIT 18:03 UTC+2: The incident is now over.

OVH Regions are impacted with the provider network backbone issues

2024-08-03T13:43:00+00:00

We may be impacted by https://network.status-ovhcloud.com/incidents/nnhpfdw50vsn which we are investigating. Only OVH regions based services are concerned.

Update 13:33 UTC: we are indeed impacted by OVHcloud's Backbone incident. Some network routes cannot reach OVHcloud's datacenters. We are working on it. More info can be found on https://x.com/olesovhcom/status/1819742478586528146

Update 14:42 UTC: network seems more reliable now. We are still watching the network links

Update 15:16 UTC: The services are getting operational according to OVHcloud and we are not seeing network issues anymore.

Premium astreinte telecom is unreliable

2024-08-03T13:15:00+00:00

Phone number for on duty call of some customer experience a problem in our provider of telecommunication subsystems. Phone rings, but after there is impossible to talk on the phone. Customers with a problem, need to send directly by mail support@clever-cloud.com And they will be called back.

Global outage

2024-08-02T10:58:00+00:00

We are experiencing a global outage. We observed a network split in addition to an event bus outage. The effect has been inpactful for some core services.

EDITS :

2:00 PM CEST - Core services are being recovered and Deployments are being reloaded. This will synchronize back load balancers for customer's application trying to reach their new deployments.
2:08 PM CEST - Some services are being shut to accelerate the recovery process. Expect disturbed experience for observability and deployments for a few minutes
2:29 PM CEST - Criticial Core services are OK. Deployments are being rolled out.
3:07 PM CEST - Some workload queues have still difficulties to be processed. Some components may still be in an unstable state. Current effort is to identify them, then reload them.
3:40 PM CEST - Some hypervisors have experienced some crashes. Recovery process is occuring and will take a couple of minutes
3:56 PM CEST - Some hypervisors seems still experiencing network issues.
4:16 PM CEST - Apps are being deployed for premium customers. All apps are going to be deployed. Anyone can accelerate the process for its own application by manually deploying them.
4:24 PM CEST - In the meantime, we continue to identify noisy VMs that have been impacted by the outage
5:15 PM CEST - Metrics API is being restarted.
6:20 PM CEST - Last deployments are being rolled out. Reminder : accelerate by triggering a redeploy action
6:30 PM CEST - Still a few hundreds of VMs are consuming very high CPU rates and being cleaned.
6:35 PM CEST - We estimate approximately 40min to have full recovered all deployment of applications (MANUALY REDEPLOY FOR FASTER RECOVERY)
7:05 PM CEST - All IPSec links should be back online

Access logs ingestion and processing unavailable

2024-08-02T08:36:00+00:00

Following https://www.clevercloudstatus.com/incident/877, we have difficulties to process access logs, you may observe holes and lags.

Deployment failure are observed in PAR

2024-08-02T08:20:00+00:00

Following https://www.clevercloudstatus.com/incident/877, some deployments are failing. We currently working on a solution.

EDIT: 10H31 UTC - A workaround has been found to ensure that deployments work again

Pulsar connection issues

2024-08-02T08:19:00+00:00

Connections issues (producers/consumes) during cluster upgrade
It can lead to fail in app redeployement

[PostgreSQL] Trouble to create DEV add-on

2024-08-01T09:50:00+00:00

Order of DEV add-on is currently locked. No impact on existing add-on.

We are investigating

[EDIT 12:00 CEST]: we have identied and fix the lock

CEPH-NORTH-HDS: rebalance in progress

2024-08-01T09:19:00+00:00

Because of the hardware issue describe in https://www.clevercloudstatus.com/incident/874, we need to rebalance data on Cellar North. Customer may experience higher latency than usual.

GRA-HDS: Hypervisor unreachable

2024-08-01T08:40:00+00:00

An hypervisor on the GRA-HDS region is unreachable. We are working on it.

EDIT Thu Aug 01 09:13:09 2024 UTC: hypervisor has been rebooted. A hardware issue has been detected. All applications have been redeployed and there was no customer databases on the hypervisor.

Deployment outage

2024-07-23T09:30:00+00:00

Application deployement take unbound time to proceed.
We are investigating the issue

09:30 UTC We notice deployment perturbation. 10:02 UTC We have found a cause of the perturbation and fixed it.

14:30 UTC We notice other issues with the deployment system. 16:30 UTC After further investigations, we found the cause of the perturbations. We applied a temporary fix. The deployments are back on tracks!

We are working on a stronger fix for the deployments.

Metrics latencies and errors

2024-07-23T06:30:00+00:00

We identified a bottleneck on our FoundationDB cluster for warp10-c2. Writing is impacted and might occur a lag in read metrics. We enabled sampling on data.

6:32 UTC: We identified unusual usage that was harming the system

6:35 UTC: Unusual usage stopped, the storage layer is starting to recover

6:58 UTC: Storage layer fully recovered, we still investigate & watch over the system

7:45 UTC: System is back to normal

WSW region hypervisors unexpected reboot

2024-07-12T23:35:00+00:00

At 2024-07-12 23:35 UTC, we received an alert about WSW hosts not responding. We checked and coud not ping any of our servers.

At 23:43 We pinged again. A ssh connection to the hypervisors allowed us to see the servers had an uptime of 1 minute. We checked that all services running on the servers restarted correctly and fixed those that were not correctly running. Applications have been redeployed by the monitoring. At 23:55 everything seemed to be back to normal.

We don’t know yet why the servers were rebooted.

Pulsar cluster is encoutering issues

2024-07-10T22:35:00+00:00

Some Pulsar brokers are having issues connecting to the underlying zookeeper. We are investigating the reason.

There was an issue with zookeeper sessions. It is now fixed.

[PAR] Update of Load Balancer IP Addresses

2024-07-10T10:07:40+00:00

We've updated load balancer IP addresses for applications and websites hosted on Clever Cloud. The new IP addresses now in use are:

91.208.207.214
91.208.207.215
91.208.207.216
91.208.207.217
91.208.207.218
91.208.207.220
91.208.207.221
91.208.207.222
91.208.207.223

Important:

We are going to remove 4 IPs that you must stop to use between now and August 23rd, 2024:

46.252.181.103
46.252.181.104
185.42.117.108
185.42.117.109

After this date, your applications and websites will no longer be able to use these IP addresses.

We still recommend to use CNAME DNS records when it's possible. To ensure that there is no disruption to your applications and websites, please make sure that your apex domain names are updated to point to the new IP addresses. You can update your apex domain names by editing the DNS records for your domain.

Impact:

There should be no downtime for your applications or websites as a result of this change. However, if you do not update your apex domain names before August 23rd, your applications and websites may be unavailable.

What you need to do:

Review your apex domain names and ensure that they are pointing to the new IP addresses. If you are unsure how to update your apex domain names, please contact your domain registrar or Clever Cloud support.

For more information:

Please refer to the Clever Cloud documentation for more information about load balancers and DNS records: https://developers.clever-cloud.com/doc/administrate/domain-names/#using-personal-domain-names You can take a look at the changelog entry about this change: https://developers.clever-cloud.com/changelog/2024-06-28-new-ip-list-paris
You can also contact Clever Cloud support if you have any questions.

Warsaw region (WSW) planned hardware maintenance with full service unavailability

2024-07-04T10:17:00+00:00

Time slot: 09/07/24 from 07:00 AM UTC to 09:00 AM UTC

Our infrastructure provider will perform hardware maintenance impacting our whole Warsaw region. As there is an electrical outage risk for servers, we will follow their advice to shut down the region during the maintenance that may last up to 2 hours (from 7:00 AM UTC to 09:00 AM UTC). In case your services cannot bear such unavailability, we advise you to migrated them to another region such as Clever Cloud Paris before the maintenance.

You can request assistance by reaching out to our support.

EDIT 2024-07-09 10:05 UTC: The maintenance has completed. All service are up and running on the WSW region.

Partial logs drains unavailability

2024-07-03T14:45:00+00:00

Some logs drains are not correctly sent to their target. The issue has been identified and is being resolved.

EDIT 17:00 UTC+2: Drains should be available again. The incident is now over.

Addons logs are unavailable

2024-07-02T06:58:00+00:00

Addon logs are not available, there is an outage on the Elasticsearch cluster

08:23 es sink has been paused to restore logs Drains

system restored

Partial logs drains unavailability

2024-06-21T08:30:00+00:00

Some logs drains are not correctly sent to their target. The issue seems to have started since 2024-06-21 10:30 UTC+2. We identified the issue and are working towards a fix.

EDIT 16:53 UTC+2: Logs drains should now be fully functional since 14:13 UTC+2 and are stable since then. If you still have missing logs from your drains, please open a ticket and we will investigate it further.

[PAR] Network instabilities

2024-06-16T06:48:00+00:00

The monitoring has detected a cut in network traffic for the Paris datacenters, we are investigating the issue.

EDIT 06:05 UTC : The network traffic is come back as its casual rate. We have seen a cut outside Clever Cloud network, we are investigating why.

EDIT 06:15 UTC : We have seen a second cut, we have identified that a network provider is doing maintenance operation which seems to be the cause.

EDIT 07:21 UTC : We have seen a third cut.

EDIT 07:40 UTC : We have contacted our network provider and confirmed that cuts are coming from the maintenance. For now, we are aware of 5 cuts due to the maintenance.

EDIT 08:10 UTC : We are not expecting more network cuts as the maintenance window is over, but we are watching.

Small network outage in front of our datacenters

2024-06-14T20:30:00+00:00

At 20:30 UTC, our monitoring registered a wave of network reconnections and downtime of IPSec tunnels for a few minutes.

We checked all the tunnels and restarted the ones that did not restart automatically. We checked the load balancers and did not see anything strange except the spike on reconnections.

After investigating, our probes revealed a very low rate of packets from the internet for 5 minutes.

Logs drains: Delivery issue

2024-06-06T08:56:00+00:00

We are encountering delivery issues with the Logs Drains platform. We are currently investigating the issue. Logs drains delivery may be delayed until this issue is resolved.

EDIT 09:30 UTC: We may have found the origin of the issue and implement a fix. We are monitoring the fix. Currently, logs drain are delivered without delay.

EDIT 14:09 UTC: The situation is now stable. Incident is closed.

Main API unavailability

2024-06-04T13:20:00+00:00

Our main API was unavailable for a few minutes between 13:20 UTC until 13:23 UTC. We are looking into it. Deployments started during that period may be impacted.

EDIT 2024-06-04 13:32 UTC: The root cause has been found and fixed. Deployments that were started during that period may have failed. You should be able to retry them. Please contact our support team if you still face any issues.

[Global] Access logs ingestion lags

2024-06-03T16:15:00+00:00

We have lag on the ingestion pipeline, we are investigating the issue.

EDIT 20:00 UTC : We are still investigating the issue.

EDIT 22:00 UTC : We are still seeing lag on the ingestion pipeline, we have found that there is a bottleneck on the offload process of the tiered storage of pulsar that we are fixed, but we now need to wait pulsar to finish its offload process.

EDIT 2024-06-07 15:00 UTC : The pulsar cluster have finished the offload tasks and we are recovering the lag starting yesterday around 16h utc.

EDIT 2024-06-10 08:00 UTC : We have finsihed to recovers the access logs lag since saturday 14:00 utc.

[Singapore] Network issues

2024-05-30T09:40:00+00:00

We are experiencing some network issues due to our infrastructure provider network misconfiguration. We are currently observing high latency and packet loss when trying to reach machines in Singapore. A ticket is being created

EDIT 10:30AM UTC : A ticket has been opened with our infrastructure provider

EDIT 12:45PM UTC : Network seems more stable since 12:20PM UTC

EDIT 17:00PM UTC : The ticket with our infrastructure provider has been closed

[Global][Dedicated] Application load balancers software maintenance

2024-05-24T12:54:00+00:00

Maintenance Window: 2024-05-27T09:00:00Z - 2024-05-29T20:00:00Z (UTC)

Scope:

We will roll out software updates on PAR region and dedicated load balancers

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues).

EDIT 2024-05-28 13:15 UTC : We are beginning the maintenance. We are starting with cleverapps.io.

EDIT 14:10 UTC : We have updated cleverapps.io load balancers, we are updating the par region one.

EDIT 15:30 UTC : We have updated three of nine load balancers of par region, the updates are still running

EDIT 16:30 UTC : We have update six of nine load balancers of par region, the are still running for the last ones.

EDIT 17:00 UTC : We have finished to update the par load balancer, we will do the dedicated one starting tomorrow

EDIT 2024-05-28 08:00 UTC : We have seen an increase of tls error on par region, we have rollback 8 of 9 instances of this load balancer on the previous version which is not affected. We keep one instance with the issue the time to dig and found the the root cause.

EDIT 12:30 UTC : We have found the issue and written a patch, we are releasing it and then we will deploy the new version. The issue was limited to services under *.services.clever-cloud.com certificate only.

EDIT 13:30 UTC : We have deployed the new release on par region, we will start very soon with others regions and cleverapps.io

EDIT 14:50 UTC: We have deployed the new release on every region including dedicated ones and cleverapps.io. We will begin the dedicated load balancers very soon.

EDIT 17:20 UTC : We are finishing the last dedicated load balancers for today and we will terminate the others tomorrow.

[Global] Missing telemetry from access logs

2024-05-24T10:00:00+00:00

We had an issue with the compute pipeline of telemetry from access logs which did not perform the computation since the end of the last week. We have fixed the issue, but we could not recover missing computations.

Metrics latencies and errors

2024-05-24T09:14:00+00:00

We identified an increase of errors on read queries, writing isn't impacted.

[Global] Application load balancers software maintenance

2024-05-22T07:00:00+00:00

Maintenance Window: 2024-05-22T09:00:00Z - 2024-05-24T20:00:00Z (UTC)

Scope:

We will roll out software updates on every region of Clever Cloud for all application load balancers

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues).

EDIT 13:06 UTC : We are beginning the rolling of the WSW region.

EDIT 15:00 UTC : We have updated the WSW region. We are beginning the rolling of SGP, MTL and SYD regions.

EDIT 16:25 UTC : We have updated the MTL region.

EDIT 17:00 UTC : We have updated the SYD and SGP region. Next region tomorrow

EDIT D+1 08:10 UTC : We will begin the update of MEA and GRA-HDS.

EDIT D+1 09:20 UTC : We have finished the update of GRA-HDS, we are beginning RBX and RBX-HDS. The rolling update of MEA is still running.

EDIT D+1 11:15 UTC : We have finished the update of MEA, RBX and RBX-HDS.

EDIT D+1 15:00 UTC : We are beginning the update of SCW and dedicated regions.

EDIT D+2 13:00 UTC :We have finished the update of SCW and dedicated regions. We will perform the update of the PAR regions and dedicated application load balancers on a new status here : https://www.clevercloudstatus.com/incident/855

[Paris] Add-on proxy unreachable

2024-05-15T12:25:00+00:00

An add-on reverse proxy was unreachable from 14:23 UTC+2 to 14:27 UTC+2. During that time, connections to add-on services might have timed out or failed with various errors.

The issue has been resolved.

Metrics latencies and errors

2024-05-13T08:40:00+00:00

We identified an increase of errors on read queries, writing isn't impacted.

UTC 08:53 : Read queries have been disabled in order to solve the issue UTC 09:23 : Read queries are now back to normal.

Some Clever Cloud services domains appear as fraudulent in Microsoft Edge

2024-05-06T13:40:00+00:00

The following domains seem to have been reported as unsafe to Microsoft Defender SmartScreen:

cellar-c2.services.clever-cloud.com
cellar-fr-north-hds-c1.services.clever-cloud.com
cellar-fr-north-c1.services.clever-cloud.com

This means that Microsoft Edge users might have issues downloading files stored on our Cellar services.

We are currently in discussion with them to get the domains unblocked. In the meantime, do not hesitate to click the "This domain is safe" link in the Defender screen.

EDIT 2024-05-07

We have setup a workaround. To avoid this workaround to be used by malicious users, we won't disclose it here.

Please come to us if you need to use it.

EDIT 2024-06-28

The domains were un-flagged a few days ago. The incident is now over.

One database load-balancer crashed

2024-05-04T23:31:00+00:00

(Times are in UTC)

At 2024-05-04 23:31 a database load balancer lost its network routes. The alert about that was set as low priority and did not wake up the on-call agent. At 01:23, another service failed because of that load-balancer issue. This time, the failure triggered a high priority alert.

The on-call agent investigated the issue and saw that the load-balancer was responsible for the other service's failure. They fixed the network issue. Every impacted service got back online around 01:45.

Clever Cloud's PAR region has 8 of those load balancers. Only the services that were trying to connect to this one got downtime. Some customers applications redeployed themselves and connected to another one, quickly fixing the issue.

On 2024-05-06, we made the first alert a high priority one. It should already have been high priority. We also made sure that every other "load balancer is unreachable" alerts were high priority ones.

[Global] Increasing timeouts and errors on api.clever-cloud.com

2024-05-02T15:50:00+00:00

We are observing timeouts and errors on api.clever-cloud.com. We are investigating the issue.

EDIT 16:00 UTC : We have found an issue, we are patching it and redeploying the api.

EDIT 16:10 UTC : We have deployed a new version of the api

EDIT 16:20 UTC : The issue seems to be solved, we are keeping a eye on it

EDIT 16:30 UTC : The issue is solved we did not observe errors and timeouts

Heptapod: Email notifications failures

2024-04-24T15:39:00+00:00

Some emails issued by the heptapod service weren't correctly delivered to their recipients the last few days. The underlying issue has been fixed and the mail backlog is currently being processed. Additional monitoring will be put in place to monitor the email queue.

We will update this incident once the backlog is fully processed.

EDIT 2024-04-25 16:00 UTC: The backlog has been fully ingested. The incident is now over.

[Global] Metrics infrastructure improvement

2024-04-22T15:56:00+00:00

An operation on the metric cluster is pending which will make it more resilient to spikes and load. It shouldn't impact read queries of metrics, it can generate lag in the writing path.

EDIT UTC 18:29 : Operation is done, services weren't disturbed.

[Global] Access logs ingestion issue

2024-04-21T03:00:00+00:00

Beginning at 5h00 UTC, we seen a drop in the rate of access logs consumption which seems to be caused to difficulty to produce them. We are investigating the issue. You may see delays to retrieve your access logs.

EDIT 10:30 UTC : We are performing a rolling restart of the underlying pulsar brokers, you may seen disconnection.

EDIT 16:00 UTC : The rolling restart is performed. We still have ingestion issues we will keep investigating

EDIT D+1 08:50 UTC : We have still ingestion issues on few partitions which may be related to an underlying trouble, we are digging into it.

EDIT D+2 14:00 UTC : We have found the underlying issue and solve it, we are consuming the remaining lags.

EDIT D+3 13:00 UTC : We are still consuming the remaining lags, the current eta of full recovery is targeting tomorrow during the night

EDIT D+4 06:00 UTC : We have done consuming the remaining lag.

Platform email services delay

2024-04-18T09:55:00+00:00

We are currently experiencing a disruption in our email services due to an unforeseen issue, emails will be delayed until this issue is resolved. Our team is actively working to restore access as quickly as possible. We will keep you updated on our progress and notify you as soon as services are fully operational again.

EDIT 20:04 UTC+2: We are still working on the issue.

EDIT 2024-04-19 12:17 UTC+2: The issue has been fixed, we continue to monitor the situation.

Metrics: Lag in queries results

2024-04-17T17:00:00+00:00

Metrics queries results are lagging a bit, we have identified the underlying issue and issued a preliminary fix. We are monitoring the result. Grafana dashboards or results obtained from the metrics API might be missing some recent values until this is resolved.

EDIT 2024-04-18 10:53 UTC+2: The issue is still present. We've been force to sample incoming data until we figure out the underlying issue.

EDIT 2024-04-18 12:07 UTC+2: Our storage layer has been stabilized, we still apply a sampling on incoming data. Queries should be working properly.

EDIT 2024-04-18 21:13 UTC+2: The situation has improved, sampling on incoming data has been disabled. We continue to monitor the system but queries should now return the correct data without lag.

EDIT 2024-04-18 23:37 UTC+2: This incident is now over.

PAR: Hypervisor unreachable

2024-04-17T13:04:00+00:00

An hypervisor on the Paris region was unreachable and rebooted. We are looking into it and making sure it restarts all of its services.

EDIT 15:40 UTC+2: All services are up again since ~15:30 UTC+2. We continue to monitor the situation. If you still have issues, please contact our support.

Cellar on Paris is experiencing trouble

2024-04-13T21:35:00+00:00

Ceph (the software we are running Cellar on) is rebalancing some shards due to a change in its storage capacity. Some requests might fail while doing so.

edit: after a few alerts, everything has been running smoothly.

[GLOBAL] Metrics query unavailable

2024-04-11T08:15:00+00:00

The metrics query is currently unavailable as some indexing shared are offline. We are working to get them up as quickly as possible. There is no impact on ingestion pipeline and storage layer.

UTC 11:00: Queries are available

[GLOBAL] Metrics query unavailable

2024-04-08T12:00:00+00:00

The metrics query is currently unavailable as some indexing shared are offline. We are working to get them up as quickly as possible. There is no impact on ingestion pipeline and storage layer.

UTC 14:35 Services are back to normal

[Global] Access logs ingestion lags

2024-04-08T09:00:00+00:00

We have some lags on the ingestion of access logs. We are working on it.

EDIT 12:00 UTC : we have consume the lag for 75% of the access logs, we are working on the remaining ones.

EDIT 14:45 UTC : we are in-sync for 80% of access logs, we are consuming the remaining ones (eta 22h to consume the lag)

EDIT 16:41 UTC : we have consumed all remaining access logs

Cleverapps.io certificate has expired

2024-04-07T15:03:00+00:00

The *.cleverapps.io wildcard certificate failed to renew. We are currently renewing it.

Edit 15:05 UTC: we renew the certificate. It should appear on the load balancers in a few minutes. Edit 15:11 UTC: the certificate has been deployed on all cleverapps.io load balancers. Incident is over.

Cellar North: Requests slowness / timeouts

2024-04-04T09:31:00+00:00

We identified an issue where certain requests to the Cellar service on the North region might have timed-out or were slower than usual to respond. The issue has been fixed but we are looking for the underlying cause and we keep monitoring the situation.

EDIT 11:56 UTC+2: A storage node was unexpectedly unresponsive and incurred timeouts in various parts of the storage cluster. It started around 10:31 UTC+2 and went unnoticed until 11:26 UTC+2. Additional monitoring will be put in place to better handle this situation.

[GLOBAL] Metrics query unavailable

2024-03-26T09:52:00+00:00

The metrics query is currently unavailable as some indexing shared are offline. We are working to get them up as quickly as possible. There is no impact on ingestion pipeline and storage layer.

EDIT 13:30 UTC : Indexing components are online and query is available

[Metrics] Query instability

2024-03-25T15:03:00+00:00

A cleanup process has triggered some durability lag on our storage layer. You may experience query instability.

Mon Mar 25 20:32:34 2024 UTC: all components are back to normal

Logs drains are down

2024-03-21T21:37:00+00:00

(times in UTC)

Around 21:00, a part of the logs drains stack broke in a way that our monitoring did not see right away. It started to fill up the disk of the underlying RabbitMQ. At 21:37, We were alerted by the lack of space on RabbitMQ. We started investigating it around 22:10. At 22:57: the log drain stack is back up! However, to fix the RabbitMQ, we had to drop the pending queues. Our logs are still collected in our new logs infrastructure, but all drains lost the logs between 21:00 and 22:57.

Cellar North: Requests slowness

2024-03-21T15:25:00+00:00

We are currently investigating requests slowness on the Cellar north service.

EDIT 15:52 UTC: The issue has been identified and is being worked on. Timeouts should now be very sporadic since 15:38 UTC but some timeouts may still appear. We continue working on the issue.

EDIT 17:30 UTC: The service is now stable for the past hour, we will continue to monitor it for the next few hours.

[DEV] MTL cluster unavailable

2024-03-21T14:30:00+00:00

The MySQL dev add-on cluster was unreachable. This should now be fixed

[Global] Database load balancers maintenance

2024-03-20T16:16:00+00:00

Scope:

Database Load Balancer (configuration update)

Expected Impact:

Brief disconnections or connection drops during the update process.
Potential minor performance fluctuations.

Additional Information:

We will deploy a patch on the load balancer that reduce memory consumption and enable more telemetry.
Please report any issues with a method for reproducing the problem
This maintenance is a direct follow up of the incident to propagate the patch https://www.clevercloudstatus.com/incident/826

EDIT 16:25 UTC : We have patched RBX, RBXHDS and MTL regions

EDIT 16:25 UTC : We are rolling out the patch on PAR region.

EDIT 16:45 UTC: We have patched PAR region, we start the WSW region

EDIT 16:55 UTC: We have patched WSW region, we start the SYD region

EDIT 17:05 UTC: We have patched SYD region, we start the GRAHDS region

EDIT 17:05 UTC: We have patched GRAHDS region, we start the SCW region

EDIT 17:15 UTC: We have patched SCW region, we start the SGP region.

EDIT 17:35 UTC: We have patched the SGP region as well. The maintenance is over.

Cellar North: Requests slowness

2024-03-14T18:18:00+00:00

We are currently investigating requests slowness on the Cellar north service.

EDIT 19:46 UTC+1: The underlying storage system is currently having issues and is rebalancing the data. No data loss is to be expected but timeouts may occur. We are looking to stabilize the system.

EDIT 20:09 UTC+1: The underlying storage system has stabilized the last 5 minutes. We keep an eye to make sure everything is okay

EDIT 21:47 UTC+1: The service is now stable. We will need to perform additional maintenance to fully fix the underlying issue. We will create the maintenances in the following days accordingly.

[RBX][RBXHDS] database load balancer instabilities

2024-03-14T11:32:00+00:00

One instance of the load balancer has been unreachable, we have performed a patch and rebooted it. We are watching the load balancer.

EDIT 14:00 UTC : We have updated the load balancer configuration, you may have seen some connections cut during the reload.

EDIT 14:30 UTC : We have seen the same instabilities on RBX HDS database load balancer configuration, we have applied the patch that thr RBX database load balancer.

EDIT 2024-03-19T10:00:00Z : After a few days of observation, the currently deployed patched does not handle the issue correctly, we are are working on improving it.

EDIT 2024-03-19T11:30:00Z : We have deployed a new version of the patch that should handle the issue, we are watching the metrics to validate it

EDIT 2024-03-19T18:30:00Z : We have rolled out the patch as it handles the issue correctly. We are still watching metrics. However, we will need a few days to validate the behavior.

EDIT 2024-03-19T16:00:00Z : The patch is validated as we do not see the issue occurring.

[RBX] Network instabilities

2024-03-08T11:15:00+00:00

We have seen network instabilities which implies slowness and errors. We are investigating the issue.

We suspect that we may have impacted by one of this maintenance of our infrastructure provider. See:

https://network.status-ovhcloud.com/incidents/ym5hrzs2cn3b
https://network.status-ovhcloud.com/incidents/np3j0flx9w24

EDIT 14:00 UTC : we are confident that was not link to the infrastructure provider and was an isolated incident. We are still monitoring the issue, but it seems to be solved.

[PAR] Load balancer maintenance

2024-03-05T17:41:00+00:00

Maintenance Window: 2024-03-06T09:00:00Z - 2024-03-06T11:00:00Z (UTC)

Scope:

Application Load Balancer (software upgrade)

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

We will deploy a patch on the load balancer control plane that affect sticky sessions.
Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues).

EDIT 11:00 UTC : We have updated the cleverapps load balancers, we will soon restart it. We will proceed to another upgrades this afternoon.

EDIT 14:20 UTC : We are beginning the updates of paris load balancers.

EDIT 18:00 UTC : We are still updating load balancers

EDIT 19:30 UTC : We have stopped the updates process. we will continue the updates process tomorrow

EDIT 10:00 UTC : We are beginning updates of load balancers.

EDIT 10:45 UTC : We have finished the updates.

[RBX] Load balancer maintenance

2024-03-05T13:00:00+00:00

Maintenance Window: 2024-03-05T13:00:00Z - 2024-03-05T17:00:00Z (UTC)

Scope:

Database Load Balancer (software and hardware upgrade)
Application Load Balancer (software and hardware upgrade)

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

Software upgrade already record on cleverapps.io (https://www.clevercloudstatus.com/incident/803) and on Paris load balancers (https://www.clevercloudstatus.com/incident/807 and https://www.clevercloudstatus.com/incident/805)
Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues and/or psql / redis / mysql queries for database load balancer).

EDIT : The maintenance window has been update to next tuesday.

EDIT 14:00 UTC : We are beginning the maintenance.

EDIT 16:00 UTC : We have finished to install new hardware alongside the existing one on rbxhds, we will beginning to switch the traffic on database and software load balancers. we also start the installation of load balancers on rbx region.

EDIT 16:20 UTC : We are switching the instance of database load balancer.

EDIT 16:45 UTC : We have fully switch the load balancer of rbxhds region, we have finished to install alongside the current one, load balancers on rbx region. We will begin to switch the traffic to the new instances.

EDIT 17:15 UTC : We have finished to switch the traffic from old load balancers to new ones. The maintenance is over.

Trouble to connect to Montreal Elastic-search

2024-02-27T08:00:00+00:00

We have detected trouble to access elasticsearch add-on on the Montreal regions.

We are working on it

[2024-02-27 08:50 UTC] we detected the root cause and fix it. Everything should now be ok

[Metrics] query instability

2024-02-26T12:03:00+00:00

We are detecting a high number of errors on our query layer. The impacted components have been restarted and are currently reloading.

Metrics query issue

2024-02-23T19:55:00+00:00

A query batch has ddos some queries components. They are currently reloading. Query is unavailable while they are loading.

Query is back online

[MTL] Connection instability towards add-ons

2024-02-23T08:43:00+00:00

We are detecting an higher number of errors than expected on the reverse-proxies dedicated to add-ons in the MTL zone. A really small percentage of users may experience trouble to connect to their database. We are investigating.

EDIT 17:45 UTC: we have passed a configuration to try to mitigate the issue. We are watching.

EDIT 18:00 UTC : we have done a rolling reboot of load balancers to give them more capacity.

[MTL] PostgreSQL shared cluster is down

2024-02-22T16:09:00+00:00

The shared MTL cluster is located on the HV which has crashed. We are working on it https://www.clevercloudstatus.com/incident/818

EDIT 16:50 UTC : The shared cluster is up and running.

[MTL] HV unreachable

2024-02-22T16:05:00+00:00

An hypervisor in MTL zone has crashed. It is currently rebooting. Applications are currently moved on other hypervisors. Add-ons on the rebooting HV are not reachable. Once the HV is up, we will make sure all add-ons are up.

EDIT 16:45 UTC : The hypervisor has rebooted and now operational

[PAR] PostgreSQL in plan `DEV` maintenance

2024-02-19T16:34:00+00:00

A maintenance is planned on our DEV PostgreSQL add-ons cluster (software upgrade)

EDIT 2024-02-19 21:00 UTC : Maintenance was successfully completed, maintenance end.

[MTL] Git repositories update currently unavailable

2024-02-15T17:58:00+00:00

We are investigating unavailable git repositories updates if you push a new commit on the MTL region. Deployments are currently working as expected. We are looking into the issue.

EDIT 18:02 UTC: The issue has been identified and fixed. If you pushed any commits that didn't get applied, please let our support know about it so we can force a deployment.

[MTL] Git deployment issue

2024-02-14T15:16:00+00:00

We have an issue with git authentication during deployment. We are investigating the issue.

EDIT 15:30 UTC : We have resolve the git authentication issue.

[PAR Scaleway] Load balancer maintenance

2024-02-12T18:44:00+00:00

Maintenance Window: 2024-03-07T13:00:00Z - 2024-03-07T17:00:00Z (UTC)

Scope:

Database Load Balancer (software and hardware upgrade)
Application Load Balancer (software and hardware upgrade)

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

Software upgrade already record on cleverapps.io (https://www.clevercloudstatus.com/incident/803) and on Paris load balancers (https://www.clevercloudstatus.com/incident/807 and https://www.clevercloudstatus.com/incident/805)
Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues and/or psql / redis / mysql queries for database load balancer).

EDIT : We will do the maintenance next week on 2024-03-07.

EDIT 17:30 UTC : We have finished to deploy the load balancers alongside the current ones. We will switch the traffic from the old one to the new one.

EDIT 17:32 UTC : We have switch the traffic from old to new database load balancer with a unexpected behavior which is fixed now. We may have seen unexpected connections refused.

EDIT 17:35 UTC : We will begin to switch the application load balancer soon.

EDIT 17:50 UTC : We have switched the first instance of application load balancer.

EDIT 18:00 UTC : We have finished to roll the application load balancer. We are watching.

[MTL] Load balancer maintenance

2024-02-12T18:42:00+00:00

Maintenance Window: 2024-02-23T13:00:00Z - 2024-02-23T17:00:00Z (UTC)

Scope:

Database Load Balancer (software and hardware upgrade)
Application Load Balancer (software and hardware upgrade)

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

Software upgrade already record on cleverapps.io (https://www.clevercloudstatus.com/incident/803) and on Paris load balancers (https://www.clevercloudstatus.com/incident/807 and https://www.clevercloudstatus.com/incident/805)
Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues and/or psql / redis / mysql queries for database load balancer).

EDIT : We move the maintenance to 2024-02-23 instead of 2024-02-22.

EDIT 15:00 UTC: We are beginning the maintenance alongside the current installation. We will do the failover next week.

EDIT 2024-02-26 16:30 UTC : We have added two new IP address to domain.mtl.clever-cloud.com. DNS records.

EDIT 2024-02-26 17:00 UTC : We have removed the two old IP address from domain.mtl.clever-cloud.com. DNS records.

EDIT 2024-02-26 17:15 UTC : We will update the DNS records for database load balancers.

EDIT 2024-02-26 17:30 UTC : We have updated the DNS records for database load balancers, we are watching

[WSW] Load balancer maintenance

2024-02-12T18:41:00+00:00

Maintenance Window: 2024-02-20T13:00:00Z - 2024-02-20T17:00:00Z (UTC)

Scope:

Database Load Balancer (software and hardware upgrade)
Application Load Balancer (software and hardware upgrade)

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

Software upgrade already record on cleverapps.io (https://www.clevercloudstatus.com/incident/803) and on Paris load balancers (https://www.clevercloudstatus.com/incident/807 and https://www.clevercloudstatus.com/incident/805)
Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues and/or psql / redis / mysql queries for database load balancer).

EDIT 13:30 UTC : We are preparing the hardware and software upgrade along-side the current stack.

[SGP] Load balancer maintenance

2024-02-12T18:40:00+00:00

Maintenance Window: 2024-02-15T13:00:00Z - 2024-02-15T17:00:00Z (UTC)

Scope:

Database Load Balancer (software and hardware upgrade)
Application Load Balancer (software and hardware upgrade)

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

Software upgrade already record on cleverapps.io (https://www.clevercloudstatus.com/incident/803) and on Paris load balancers (https://www.clevercloudstatus.com/incident/807 and https://www.clevercloudstatus.com/incident/805)
Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues and/or psql / redis / mysql queries for database load balancer).

EDIT 13:30 UTC : We are starting the preparation to do the upgrades

EDIT 13:50 UTC : We have finished the preparation, we are beginning the rolling of the application load balancer

EDIT 14:20 UTC : We have finished to do the rolling of the application load balancer

EDIT 14:30 UTC : We are starting to roll the database load balancer

EDIT 14:35 UTC : We have finished to roll the database load balancer.

[SYD] Load balancer maintenance

2024-02-12T18:38:00+00:00

Maintenance Window: 2024-02-13T13:00:00Z - 2024-02-13T17:00:00Z (UTC)

Scope:

Database Load Balancer (software and hardware upgrade)
Application Load Balancer (software and hardware upgrade)

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

Software upgrade already record on cleverapps.io (https://www.clevercloudstatus.com/incident/803) and on Paris load balancers (https://www.clevercloudstatus.com/incident/807 and https://www.clevercloudstatus.com/incident/805)
Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues and/or psql / redis / mysql queries for database load balancer).

EDIT 13:15 UTC : We are beginning the hardware upgrade along side the current hardware.

EDIT 14:20 UTC : We have finished the hardware upgrade, we will start the rolling by application load balancer.

EDIT 14:35 UTC : We have rolled the first application load balancer, we are beginning the second one.

EDIT 15:00 UTC : We have finished to roll the application load balancer, we are beginning the database load balancer.

EDIT 15:15 UTC : We have rolled the first database load balancer, we are watching.

EDIT 15:25 UTC : We have rolled the second database load balancer.

EDIT 15: 25 UTC : We have rolled all load balancers, we are keeping an eye on them, but the maintenance is over.

[Heptapod Cloud] Planned upgrade to 1.0

2024-02-06T09:58:00+00:00

We are planning an upgrade of our Heptapod Cloud offer heptapod.host on Wednesday 2024-02-07 at 14:00 UTC. Heptapod will be updated to the 1.0 version.

Expected downtime of the service is 30 minutes. During that time, git and mercurial operations might fail as well as loading the UI.

EDIT 16:45 UTC: The update is over.

[PAR] Load balancer maintenance

2024-02-06T09:42:00+00:00

We will proceed to software upgrade of the database load balancer which should be transparent. You may observed a few connection cut during the operation. If you have an issue during this maintenance, please contact the support with a way to reproduce the issue (a curl command or a psql / mysql / redis example will be great).

EDIT 10:15 UTC : We have started the maintenance

EDIT 11:15 UTC : We have finished the maintenance

[PAR] Hypervisor DOWN

2024-02-05T20:40:00+00:00

After receiving many alerts, it seems a hypervisor is down. We are working on getting it back up.

EDIT 21:30 UTC : The hypervisor is now responding after an hard reboot. We are currently ensuring that every virtual machines in a healthy state and investigating the HV crash root cause.

EDIT 22:00 UTC: Every VM on the hypervisor are running as expected, the root cause was a kernel panic (the kernel is now in a more stable version)

[PAR] Load balancer maintenance

2024-02-02T14:25:00+00:00

We will proceed to software upgrade of the load balancer which should be transparent. You may observed a few connection cut during the operation. If you have an issue during this maintenance, please contact the support with a way to reproduce the issue (a curl command will be great). This software upgrade was running successfully on cleverapps.io without issue since one week (https://www.clevercloudstatus.com/incident/803).

EDIT 15:00 UTC : The software upgrade is still in progress

EDIT 15:30 UTC : The first server that host a load balancer instance has been updated

EDIT 15:45 UTC : We are proceeding to the others load balancers.

EDIT 16:30 UTC : We have updated 2/3 of load balancers.

EDIT 17:00 UTC : We have updated all load balancers.

[cleverapps.io] Load balancer maintenance

2024-01-29T09:19:00+00:00

EDIT 10:30 UTC : we have begun the maintenance procedure for one of the two instances.

EDIT 11:10 UTC : we have finished the upgrade, we will restart the instance this afternoon around 14:00 UTC.

EDIT 15:00 UTC : we have restart one the two load balancer instances, we are watching the metrics to get more insights between the two versions.

EDIT 9:30 UTC D+1 : since yesterday, we have observed telemetry and saw enhancement of them, we will begin the update of the second one

EDIT 11:00 UTC D+1 : the update is achieved without issues.

[Heptapod Cloud] Security update

2024-01-26T13:29:00+00:00

An update of our Heptapod Cloud service will be done today at 15:00 UTC+1 to apply the latest Gitlab security patches related to https://about.gitlab.com/releases/2024/01/25/critical-security-release-gitlab-16-8-1-released/. Expected downtime should be less than 1 minute.

EDIT 15:34 UTC+1: Patches were applied and services were restarted. The maintenance is now over.

[Metrics] query latency

2024-01-25T15:15:00+00:00

We have enabled a new parameter designed to improve the reliability of the cluster. Some queries may not work. We are watching it.

[Metrics] Requests timeouts

2024-01-23T16:35:00+00:00

We are currently observing requests timeouts on the Metrics cluster. The issue has been identified and we are working towards the resolution. No data loss is to be expected. Various graphs (grafana, console, ..) might not properly load or render with various errors.

Edit Tue Jan 23 17:59:56 2024 UTC: A faulty configuration has been applied to a node to investigate a memory-leak. The configuration backfired on the whole cluster, making it unhealthy. The configuration have been rollback. The storage layer is currently under healing mode. To speed-up the recovery, query have been disabled.

Edit Tue Jan 23 19:51:21 2024 UTC: cluster is now healthy and recovering lag, which should last a few hours. Query will be opened when lag is resorbed.

Edit Wed Jan 24 00:04:59 2024 UTC: datalag is now ok. We are still reloading metrics's metadata, so query is still not available. Should be up in a few hours

Edit Wed Jan 24 01:54:22 2024 UTC: metadata lag is now ok, query is back online

[Accesslog] Not available

2024-01-23T16:05:00+00:00

We are encountering problems with the delivery of accesslogs. We are investigating.

EDIT Edit Thu Jan 25 11:00:00 2024 UTC : Platform is now ok, we're ingesting lag

EDIT Edit Thu Jan 25 16:54:00 2024 UTC : Lag ingested, Some applications may not have accesslog reachable.

[Scaleway] Load balancer instability

2024-01-23T10:47:00+00:00

We are detecting a higher number of errors than usual on the load-balancers serving the scaleway zone. We are investigating.

[Metrics] rate-limiting error

2024-01-16T14:58:00+00:00

We are detecting a high number of errors on our storage layer. As a result, cluster is rate-limiting queries. You may experience trouble to retrieve datapoints. We are watching.

Update Tue Jan 16 17:11:02 2024 UTC: cluster is no longer applying rate-limit

Issue when deploying a PostgreSQL 10 add-on

2024-01-10T14:00:00+00:00

Trouble to deploy new PostgreSQL add-on in version 10. This version is temporarily disabled for migration and new add-on order.

[Paris] Datacenter updates

2024-01-09T10:30:00+00:00

We are planning to do various updates on one of our datacenter in the Paris region starting at 10:35 UTC. It will last for a few hours. No issue is to be expected during this maintenance.

We will update this status accordingly.

EDIT 2024-01-10 20:00 UTC: Maintenance is over, no impact during the operations.

Elevated rate of failed deployments

2024-01-06T15:35:00+00:00

We are seeing an elevated rate of failed deployments. We are investigating the issue.

EDIT 15:58 UTC: The issue has been identified and deployments should be back to normal since 15:40 UTC.

[Metrics] Elevated queries error rate

2024-01-04T12:15:00+00:00

We are seeing elevated error rate for metrics read queries due to the underlying storage system. The problem has been identified and we are working toward its resolution. This can impact some of the grafana dashboards or API queries. Write performance is not impacted.

Update Thu Jan 04 14:48:00 2024 UTC: We have triggered some data balancing. Some queries may take longer than expected. This can impact some of the grafana dashboards or API queries. Write performance may be impacted.

Update Thu Jan 04 20:44:01 2024 UTC: data balancing is more aggressive than expected, overloading some components. Query may be unavailable during that time

Update Fri Jan 05 02:26:05 2024 UTC: some components are still overloaded. We are currently catching up the lag, but query is disabled for now.

Update Fri Jan 05 08:01:45 2024 UTC: our write-path is still overloaded. We are searching for the bottleneck

Update Fri Jan 05 16:03:48 2024 UTC: a cleanup subroutine has been triggered to balance and remove slack space from our internal Btree storage. Query is still disabled to speed-up the process.

Update: Sat Jan 06 11:25:28 2024 UTC: lag has been absorbed. Query is now up, the cleanup subroutine is still in-progress. You may notice latency spikes during query.

Update: Mon Jan 08 14:36:57 2024 UTC: cleanup subroutine is still in-progress, and some workloads triggered an overloading of some components. Query is disabled to speed-up recovery

Update: Mon Jan 08 16:36:18 2024 UTC: query is now open.

Update Tue Jan 09 14:38:34 2024 UTC: Some StorageServers are late, meaning that a really small portion of the data is not available for the query. We are currently catching up with the lag

Update Tue Jan 16 14:56:55 2024 UTC: closing the ticket.

[PAR] Load balancer network connectivity

2024-01-02T14:33:00+00:00

We have removed the ip address 46.252.181.103 from the domain name domain.par.clever-cloud.com. One of our network partner has detected an abnormal amount of traffic coming to this ip address and begin to mitigate it. We are investigating the issue

EDIT 15:15 UTC: we are still digging the issue, the abnormal traffic is over and everything seems going back to normal

EDIT 16:30 UTC : we have put back the ip address in the load balancer pool 46.252.181.103

[NORTH] Partial Cellar requests timeout

2023-12-29T16:58:00+00:00

Between 16:58 UTC and 17:03 UTC, the Cellar service on the North region timed out on some requests. The faulty component has been decommissioned and further investigations will be done to understand the source of the timeouts. The service is currently up and running.

EDIT 2023-12-30 00:51 UTC: The problem has been identified and resolved. The component is back in the pool and is working as expected. This incident is now over.

[Metrics] Elevated queries error rate

2023-12-28T08:55:00+00:00

We are seeing elevated error rate for metrics queries due to the underlying storage system. The problem has been identified and we are working toward its resolution. This can impact some of the grafana dashboards or API queries.

EDIT 09:44 UTC: The issue is not fully resolved yet but we are seeing improvements. We continue working on the issue.

EDIT 11:04 UTC: Queries are now working since 10:15 UTC, we continue monitoring to ensure everything is working as intended.

EDIT 15:43 UTC: Everything is back to normal, this incident is now over.

[RBX] Unreachable hypervisor

2023-12-25T03:00:00+00:00

An hypervisor is unreachable, we are investigating.

EDIT 03:17 UTC : There is no database affected on this hypervisor and applications has been redeployed.

EDIT 03:30 UTC : The hypervisor has been reboot and everything comes back to normal

[RBX] Unreachable hypervisor

2023-12-23T03:32:00+00:00

An hypervisor is unreachable, we are investigating.

EDIT 3:37 UTC : The issue seems to be related with the following OVH incident : https://bare-metal-servers.status-ovhcloud.com/incidents/x135vv46x85l

EDIT 3:45 UTC : Applications on this hypervisor are currently redeploying and there is no such addons on it, we also have remove temporarely the A record from domain.rbx.clever-cloud.com to solve connection issues

EDIT 4:00 UTC : Applications have been redeployed, we are waiting after ovh folk to go further

EDIT 05:30 UTC : The hypervisor is reachable again, we are starting the recovery process

EDIT 05:45 UTC : The recovery process is over, everything works normally, the load balancer ip affected by the incident will be put later in the pool. for the record, the ip is 87.98.177.176 for domain.rbx.clever-cloud.com.

[PAR] Load balancer connexions issues

2023-12-20T09:00:00+00:00

We are seeing the number of connexions on load balancers rising, we are investigating

EDIT 10:20 UTC : the investigation is still in progress and we are mitigating the issue with a rise a maximum connexions

EDIT 11:00 UTC : We are now on the nominal values, we are still watching

[Paris] Datacenter update

2023-12-18T13:43:00+00:00

We are planning to do various updates on one of our datacenter in the Paris region starting at 14:15 UTC. It will last for a few hours. No issue is to be expected during this maintenance.

We will update this status accordingly.

EDIT 15:10 UTC: Maintenance is over, no impact during the operations.

[RBX] Hypervisor down

2023-12-13T05:10:00+00:00

Our monitoring detected that an hypervisor located in RBX-1 is unreachable. We are investigating.

EDIT 06:07 AM UTC: the hypervisor has become unresponsive due to a really high cpu load average. It has been rebooted. Almost all databases are reachable, we are fixing the last ones.

EDIT 06:45 AM UTC: all databses are now up

API instability

2023-12-11T17:14:00+00:00

We have detected a high number of errors towards certain APIs. One of the core database have been restarted to restore the service.

Internal loadbalancer desync

2023-12-07T11:35:00+00:00

We have detected a configuration issue on our internal loadbalancer. It has been fixed. You may have experienced issues connecting to api.clever-cloud.com and the console for a few minutes.

Matomo add-on creation does not work

2023-12-06T13:45:00+00:00

We are investigating an issue with Matomo add-ons failing to create since a few days.

EDIT 2023-12-21 16:00 UTC+1: We found and fixed the rood cause. Matomo add-ons can now be ordered again.

[Paris] Datacenter update

2023-12-05T15:33:00+00:00

We are planning to do various updates on one of our datacenter in the Paris region starting at 15:40 UTC. It will last for a few hours. No issue is to be expected during this maintenance.

We will update this status accordingly.

EDIT 17:30 UTC: Maintenance is over, no impact during the operations.

[PAR] Load balancer connection issues

2023-12-05T14:30:00+00:00

We are observing connections issues on load balancers. We are investigating.

EDIT 16:00 UTC : We have found that one of our customers is under ddos, we are mitigating the issue.

EDIT 16:30 UTC : The ddos seems to be mitigated, we are watching.

[JED] An hypervisor is unreachable

2023-12-04T10:34:00+00:00

An hypervisor is unreachable on the Jeddah region since 10:25 UTC. We are investigating.

EDIT 10:55 UTC: The hypervisor went back online at 10:33 UTC. All applications were redeployed to another hypervisor. The incident is now over.

Hypervisor in SCW crashed

2023-12-03T15:25:00+00:00

An hypervisor in the SCW region crashed. We restarted it.

Some databases went unavailable, We are checking that they all rebooted correctly.

EDIT 15:51 UTC: all checks have completed. All the services are operational.

EDIT 04/12/2023 11:00 UTC : It seems that the load balancer behind the ip 212.129.27.183 was impacted by the incident. The issue is solved.

[SCW] A database reverse proxy went unresponsive for 3 minutes

2023-12-02T16:44:00+00:00

16:44 UTC: one of the reverse proxy for databases became unresponsive on SCW. We restarted it. 16:47 UTC: the reverse proxy has restarted and is working again.

Consequences: some applications on SCW may have lost connection to their database for a few minutes. They may have crashed and been redeployed by our monitoring.

API seems to be slow

2023-11-29T17:00:00+00:00

Our main API responds slowly. We are investigating to find out why.

EDIT 19h UTC : The issue has been solved

[Paris] Datacenter update

2023-11-29T13:14:00+00:00

We are planning to do various updates on one of our datacenter in the Paris region starting at 13:30 UTC. It will last for a few hours. No issue is to be expected during this window.

We will update this status accordingly.

EDIT 17:30 UTC: All updates are now over. Operations went smoothly and no impact was detected.

[Paris] Datacenter update

2023-11-27T14:06:00+00:00

We are planning to do various updates on one of our datacenter in the Paris region starting at 14:00 UTC. It will last for a few hours. No issue is to be expected during this window.

We will update this status accordingly.

EDIT 23:15 UTC: All updates are now over. Operations went smoothly and no impact was detected.

TCP redirections unavailable

2023-11-27T13:30:00+00:00

In our efforts to fix the issues listed in this status, we fully moved our trafic from the old LB running sōzu 0.13 to new LBs running sōzu 0.15 at 13:30 UTC.

While performing the move, a network configuration issue arose, impacting only customers using TCP redirections on the PAR region.

As the team was focused on monitoring and fine-tuning the configuration of the new LB, it failed to see the error reports until 14:30 UTC. To prevent such an incident in the future, we have since improved our monitoring and alert tools for TCP redirects.

The issue was fixed by 14:55 UTC.

Trouble to access addons Metrics

2023-11-27T07:30:00+00:00

We detected an issue to both access and ingest Clever-cloud applications and addons metrics.

We are investigating.

Edit 27 Nov 2023 11:02:23: Query is now functional. We are also observing an issue with metrics from add-ons. We are on it. Edit 27 Nov 2023 06:00 PM: A regression on token's regen has been fixed, and all tokens have been updated.

[PAR] Cellar C2 is unreachable

2023-11-24T16:51:00+00:00

During the deployment of maintenance update to solve the https://www.clevercloudstatus.com/incident/767, we applied a patch that put cellar into an unreachable state. We are currently rolling back the update.

EDIT 17:00 UTC : Cellar is available

Load balancers issues and stability : summary and our actions

2023-11-23T14:12:00+00:00

Over the past few days, our platform encountered several glitches in the handling of connections, with some of our customers experiencing slowdowns in some services. Here are the results of our investigations and the actions taken by our teams:

Update 2023-12-01 18:00 UTC

Cellar

After running more tests, we discovered performance issues on long-distance connections, possibly caused by HTTP/2, which we activated on Cellar a few weeks ago. Our analyses confirmed that uploading data to Cellar using HTTP/2 in such conditions could heavily limit the throughput, whereas HTTP/1.1 gave us better and consistent results. The improvements seen for customers affected by the identified problems far outweigh the benefits of HTTP/2 seen in few cases. So we're disabling HTTP/2 and monitoring throughput to confirm this on a larger scale.

Update 2023-11-28 13:30 UTC

Load balancers 🥁

We will begin to include new load balancer instances deployed yesterday in the load balancer pool starting 14:00 UTC. New load balancer IP addresses that will be added with the current ones are :

91.208.207.214
91.208.207.215
91.208.207.216
91.208.207.217
91.208.207.218

EDIT 15:30 UTC : The monitoring saw an increasing number of 404 response status code. We rolled back the modification and investigate the issue. It was an overlapping of internal ip addresses with the cellar load balancer which is fixed now.

EDIT 15:45 UTC : After further investigation, we could resume the maintenance.

EDIT 18:05 UTC : We have finished to deploy new instances.

Update 2023-11-27 17:20 UTC

Load balancers 🥁

We have installed new load balancers. We will review and test them tonight and will add them to the lb pool tomorrow morning (2023-11-28).

UPDATE 2023-11-27 15:30 UTC

Load balancers 👀

We are still seeing a few random SSL errors here and there. We are investigating. The culprit may be a lack of allocated resources. We are following this lead.

… we have fine tuned the load balancers, which have caused temporary more SSL Errors for a minute. The traffic seems to be better.

UPDATE 2023-11-27 14:00 UTC

Load balancers ❌

We are experiencing new errors on the load balancers: customers report PR_END_OF_FILE_ERROR errors in their browsers while connecting to their apps and SSL_ERROR_SYSCALL from curl. We are able to reproduce these errors. They look like the incident from friday 24th in the morning. We are looking for the configuration misshap that may have escaped our review.

✅ It's fixed. We started to write a monitoring script for that kind of configuration error, we will speed up the writing and the deployment of this monitoring in production.

UPDATE 2023-11-27 08:30 UTC

Load balancers ✅

We've been monitoring the load balancers all week-end: The only desync was observed (and fixed right away by the on-call team) on old sōzu versions (0.13) that are still processing 10% of Paris' public traffic! We plan to remove these old load balancers quickly this week.

We consider the desynchronization issue resolved.

Cellar 👀

Last Friday, we configured Cellar's front proxies to lower their reload rate. We haven't seen any slowness since, but it was already hard to reproduce on our side. No slowness on Cellar was reported during the week-end, but we are still on the look.

UPDATE 2023-11-24 15:00 UTC

Load balancers 🥁

After more (successful) load tests, the new version of sōzu (0.15.17) is being installed on all impacted public and private load balancers. Upgrades should be over in the next two hours.

Cellar 👀

The team continues to investigate the random slowness issues still encountered by some customers, which we are trying to reproduce in a consistent way.

UPDATE 2023-11-24 10:45 UTC

Load balancers 🥁

We've tested our new Sōzu release (0.15.17) all night with extra monitoring and no lag or crash was detected. The only remaining issues were on the non updated (0.13.6) instances. They were detected by our monitoring and the on-call team restarted them.

We are pretty confident that this new release solves our load balancers issues. We plan to switch all private and public Sōzu load balancers to 0.15.17 today and monitor them over the coming days.

Temporary incident:

While updating our configuration to grow the traffic shares of the new (0.15.17) load balancers, a human mistake (and not a newly discovered bug) broke part of the configuration, causing many ssl version errors on 15% of the requests between 09:25 and 09:50 UTC.

UPDATE 2023-11-23 18:43 UTC

Certificates ✅

As we planed earlier, the renewal of all certificates in RSA 2048 has been completed, except for a few wildcards (mostly ours) which require manual intervention. This will be dealt with shortly.

Load balancers 🛥️

We were able to identify the root cause of our desync/lag in Sōzu. A specific request, a ‘double bug’, was causing worker crashes. We developed fixes and are confident they will fix our problems. We’ll test them and be monitoring the situation before deploying them fully in production.

Cellar 👀

We’ve upgraded our load balancers infrastructure and monitoring tools to check whether this will improve the various types of problems reported to us.

Original Status (2023-11-23 14:12 UTC)

1. Key management in Sōzu and security standards

Background: Two months ago, we migrated our Let’s Encrypt aumotatic certificate generation from RSA 2048 keys to RSA 4096 keys. Following a major certificates renewal in early November, this led to timeouts when processing requests, and then 504 errors.

Actions:

On Monday November 13, we rolled back key generation to RSA 2048 for all new certificates.
On Monday November 20, we launched a complete key regeneration in RSA 2048, which requires an increase in our Let's Encrypt quotas (in progress).

Back to normal: Within the day, while we finish regeneration.

Next steps: We have also explored a migration to the ECDSA standard, which according to our initial tests will enable us to improve both the performance and security levels of our platform. Such a migration will be planned in the coming months, after a deeper impact analysis.

2. HTTPS performance issues

Background: We noted a significant drop in HTTPS request processing performance, with capacity reduced from 8,000 to 4,000 requests per second, due in particular to an excessive number of syscalls via rustls.

Actions: We developed a Sōzu update and pushed it on November 16.

Back to normal: The problem is now resolved.

3. Load balancers desync/lag

Background: Load balancers are sometimes out of sync, Sōzu gets stuck in TLS handshakes or requests. The workers no longer take the config updates, causing the proxy-manager to freeze. The load balancers then miss all new config updates until we restart them.

Actions: We have improved our tooling to detect the root cause of the problem at a deeper level. We have been able to confirm that this concerns both Sōzu versions 0.13.x and 0.15.x.

Next steps: We'll be tracing the problem in greater depth within the day, to decide what actions to take in the short term to mitigate the problem.

4. Random slowness on Cellar

Background: Customers are reporting slowness or timeouts on Cellar, which we are now able to identify and qualify. If the cause has not been fully spotted, we have several ways of mitigating the problem.

Actions: Add capacity to front-ends infrastructure and enhance network configuration.

Load balancer TLS handshake issue and timeout

2023-11-21T09:28:00+00:00

Since the incident https://www.clevercloudstatus.com/incident/746, we have TLS handshake issues that appears as timeout. We are deploying a series of patches to solve this issues and it should be better as we deploy them.

Addon cellar - API maintenance

2023-11-20T08:47:00+00:00

A maintenance is planned for our addon cellar API, some calls will probably not be available for a couple of minutes. This will not impact any deployed service.

The maintenance will start today November 20, 2023 at 12:00 UTC+1.

EDIT 2023-11-20 12:10 UTC+1: Maintenance is starting

EDIT 2023-11-20 13:00 UTC+1: Maintenance is now over, the addon cellar API is fully available

[PAR] Addons load balancers instability

2023-11-17T16:00:00+00:00

We are facing some instabilities with our addon load balancers in Paris, some connections can randomly be interrupted. We have identified the issue and we are currently fixing it.

EDIT 18:20: All addon load balancers have been fixed, we are currently actively monitoring their state

EDIT 2023-11-20 18:49 UTC: The fix has not been as effective as we would have hoped. We are currently issuing another fix. During the next few minutes, you might encounter some connection refused errors when connecting to some add-ons.

EDIT 2023-11-20 18:59 UTC: The operations are done. We are now monitoring the situation.

Main API connectivity error

2023-11-15T14:43:00+00:00

Our monitoring has trigger alerts about the Clever Cloud API. We are investigating

EDIT 15:00 UTC : The amount of errors is decreasing, we are still investigating EDIT 15:30 UTC: We have the issue, we are deploying a patch EDIT 15:45 UTC: The patch has been applied successfully EDIT 16:00 UTC : The situation has came back to normal, we are watching

Metrics kafka cluster maintenance

2023-11-14T14:47:00+00:00

We are currently scaling up our kafka cluster for metrics usage. You may not see latest datapoints

EDIT 15th of november 09:09 AM UTC: cluster has been scaled up and partitions distributed among new brokers

API is slow

2023-11-14T10:15:00+00:00

Our main API responds slowly. We are investigating to find out why.

(times in CET)

EDIT 11:54 - API has been redeployed and it solved the issue. We are still investigation the root cause.
EDIT 12:09 - API is down, we are working on it.
EDIT 12:23 - The API is recovering, our load balancers on Paris and Scaleway regions are down
EDIT 12:30 - Load balancers on Scaleway region is now operational
EDIT 12:33 - Load balancers on PAR region is now operational. API is now fully operational
EDIT 14:17 - We found and fixed a cause of the API problems. The API seems to be running fine, we may have found the root cause of the issue. We are keeping an eye on it.

API maintenance

2023-11-14T09:47:00+00:00

A maintenance is planned for our API, some API calls will probably not be available for a couple of minutes. This will not impact any deployed service.

The maintenance will start today November 14, 2023 at 12:00 UTC+1.

EDIT 2023-11-14 12:09 UTC+1: API is down, we report this maintenance

EDIT 2023-11-14 15:20 UTC+1: Maintenance is starting

EDIT 2023-11-14 17:10 UTC+1: Maintenance has been correctly completed

Metrics instability

2023-11-13T17:32:00+00:00

We are detecting a performance issues on our new metrics storage layer. You may not see fresh datapoints. We are working on it.

EDIT Mon Nov 13 18:51:01 2023 UTC: config tuning has been made, cluster is now fully recovered. Lag will be resolved within minutes

Cleverapps load balancer instability

2023-11-13T08:00:00+00:00

We are detecting a high number of errors on our reverse proxies towards cleverapps domains, we are investigating

EDIT: fixed

SSL Handshake error on one of our Load Balancer

2023-11-10T13:30:00+00:00

One of our load balancers randomly close connection during TLS handshake

EDIT 16h35: This issue has been fixed.

Addon API maintenance

2023-11-09T16:39:00+00:00

This maintenance is planned on our API, this maintenance will only impact the view of the add-ons and the creation of new one. Each already deployed add-ons will still be available.

This concern only Jenkins, ElasticSearch, MySQL, PostgreSQL, MongoDB and Redis API.

For each kind of add-ons expect a downtime of 20 to 30 minutes.

The maintenance will start tonight November 9, 2023 at 21:00 UTC.

EDIT 2023-11-09 22:00 UTC+1: Maintenance is starting.

EDIT 2023-11-10 01:00 UTC+1: Maintenance is now completed.

APIs are experiencing issues

2023-11-08T16:50:00+00:00

We are seeing issues with our APIs. We are investigating this.

Edit 2023-11-09 : we are keeping this incident open as the performance issues seem to have lowered but not vanished. There seems to be a seasonality with these issues, we are still searching why we have these surges in load.

[Cellar] random lag on Cellar connections

2023-11-08T11:34:00+00:00

Customers have complained about slow request handling on Cellar for a few days. On random requests, Cellar might take up to a few seconds to send the first byte.

We are investigating this issue.

Main API connectivity issues

2023-11-07T16:30:00+00:00

We have difficulties to reach our main api (api.clever-cloud.com). We are investigating.

EDIT 16:30 UTC : The main api is reachable, we are investigating the root cause which may seems related to the database.

EDIT 16:40 UTC : We have detected that we missed of capacity on the database, we have update the capacity and reboot the database, we are deploying the api again.

EDIT 18:00 UTC : The main api is reachable

Hypervisor on RBX region went down.

2023-11-07T12:04:00+00:00

One hypervisor went down on rbx region. We are trying to reboot it.

All the applications on that HV are being redeployed. A few add-ons that are on it are unavailable.

The hypervisor was not rebooting from our OVHCloud interface. We asked the support and they put it back up again.

12:28 UTC: the HV is running, we are starting the cleaning procedure and making sure all the add-ons have restarted correctly.

Main API connectivity issues

2023-11-06T14:00:00+00:00

Our main api is encountering issues while loging into our database. It may make some requests fail in a random manner. We are investigating it.

Accesslogs are not retrievable

2023-11-06T07:16:00+00:00

The accesslogs APIs are having issues. We are working on it.

All the access logs are still stored, but the API will not give you the recent ones (up to two weeks).

Edit 2023-11-14: we are still working on making the accesslogs available from the API.

Reverse proxies for cleverapps.io are desync

2023-11-03T08:00:00+00:00

Some cleverapps.io domain might display 404 or 503 instead of the actual website they should point to. We are reloading the reverse proxies configuration.

update 08:40 UTC - the reverse proxies have been resynchronized. We are watching it and looking for the reason of the desynchronization.

Deployments are not working correctly

2023-11-02T18:06:00+00:00

Due to an issue with our message broker, deployments are acting up. We are investigating.

Update 20:04 UTC - We have fixed the broker issue and restarted every service that failed to reconnect. The situation is back to normal.

[PAR] public reverse proxies are experiencing issues

2023-10-28T17:55:00+00:00

We are experiencing issues with Paris public reverse proxies.

These issues impact TLS and the ability to answer correctly.

EDIT 20:32 UTC - fixed.

Metrics loadbalancer issue

2023-10-28T07:30:00+00:00

We are detecting some errors on our newly metrics stack. We are on it.

Edit Sat Oct 28 14:51 2023 UTC: infrastructure have been scaled up, optimizations on LBs are underway, you may still experience errors during queries

[PAR] Load balancer maintenance

2023-10-26T13:30:00+00:00

We have to update load balancer in the Paris region. We will remove dns one A record of load balancer, wait for the TTL, update the load balancer behind and then add the dns record back. If you are long running connection, they will be closed at the end of the TTL as we will stop the load balancer.

Edit 15:00 UTC : We start rolling the load balancer records for domain.par.clever-cloud.com

Edit 15:50 UTC : We have finished to do the rolling of the first ip address (46.252.181.103), next ones should be faster.

Edit 16:00 UTC: We have removed the second record (46.252.181.104), we are waiting for the ttl to expire before beginning

Edit 16:10 UTC: We have added back the second record (46.252.181.104), we are waiting for the ttl to expire before going further.

Edit 16:15 UTC : We have removed the third record (185.42.117.108 ) we are waiting for the ttl to expire before beginning

Edit 16:25 UTC: We have added back the third record ((185.42.117.108), we are waiting for the ttl to expire before going further.

Edit 16:30 UTC : We have removed the fourht and last one record (185.42.117.109 ) we are waiting for the ttl to expire before beginning

Edit 16:40 UTC : We have added back the third record ((185.42.117.109), we have finished the maintenance

Edit 17:38 UTC: We have an increase in TLS errors for incoming requests, we are looking into it.

Edit 18:08 UTC: We found a potential issue. We are deploying a fix and will monitor the situation closely.

Edit 19:06 UTC: The fix has been deployed since 18:55 and we are monitoring the situation

Edit D+1 16:00 UTC : We have find the issue on the update and patch the software. We will apply it in a few moment.

Edit D+1 16:30 UTC : We will update the first ip address 46.252.181.103.

Edit D+1 17:15 UTC : We have updated the second ip address 46.252.181.104, we will begin the third address 185.42.117.108.

Edit D+1 17:30 UTC : We have updated the fourth ip address 185.42.117.109.

Edit D+1 18:30 UTC : We have finished the operation, we are watching it

cleverapps.io TLS unavailability

2023-10-26T12:35:00+00:00

We are currently experiencing TLS requests issues on *.cleverapps.io domains. We are looking into the issue.

EDIT 13:00 UTC: The problem has been fixed and will be investigated further to pinpoint the origin. EDIT 13:30 UTC: We have applied a patch to solve the issue.

[Global] Deployment issue

2023-10-26T07:21:00+00:00

There is an issue on the deployment stack. We have identified the issue and we have begun the recovery process.

07:34 UTC : we have fixed the issue and we keep watching the issue

13:00 UTC: The issue did not occur again. This incident is now over.

Observability, Logs, Metrics and Stats APIs

2023-10-23T16:47:00+00:00

This schedule concerns the availability of Stats API, Grafana metrics and Web console metrics (like heatmaps and HTTP statistics).

Friday 6PM UTC (20h CEST): we will activate the new Logging and Metrics infrastructure for your services.

Clever Cloud Observability has been beta for a while now, hiding the underlying work to provide a generally available service.

Not statisfied with the current quality of service, in the last months we've been building and testing a new customer experience for Logs and Metrics with a whole new infrastructure optimized for performance and durability. Part of this work is already available as tech preview for Clever Tools users wanting to consume their Logs. This maintenance is how we will deliver it for all other services.

What does it means?

Logs

There are 3 kinds of Logs :

Access Logs
Services Logs for Apps and AddOns
Audit Logs

Services Logs are exposed in the Web Console and the CLI while AccessLogs are exposed in the CLI only and Audit Logs are now exposed currently.

The new infrastructure homogenize Logs and Access Logs through the same Logs API using our Topic as a Service service under the hood. It means you will be able to setup a custom retention for all your Logs. Also a new API will let you sync them with other services (Pulsar, Otel, Datadog, etc...). In the coming weeks, we will deliver our brand new Web Console Logging experience that we hope you will love.

Meanwhile, the Clever Tools CLI will be updated to reflect the new Logs API capabilities, providing Live and Replay streams of your Logs data. During the maintenance window, these data may not be available and be sure to update your Clever Tools CLI to benefit from the new Logs API for your AccessLogs. \

Metrics

There are multiple use of Metrics data:

Generated Grafana Dashboards
Statsd pushed metrics
Stats API for differents products
Metrics shown in Web Console
Geolocalized heatmap of your requests and connections

They all share the same storage layer which has not satisfied our quality expectations to reach GA. This storage technology has been replaced and is expected to bring more stability for all Clever Cloud's Observability metrics.

All services will be switched to the new infrastructure, which will cause some unavailability for the time of the operation.
We hope this operation will find you happy with the overall new Observability experience it will brought as this is a big accomplishment for us :)

For all operations, a follow up will be maintained on https://www.clevercloudstatus.com/

Edit 18:08PM UTC: We start the maintenance operation with redeployment of apps with Token dependencies. (grafana, scheduler, etc.)

Edit 18:11PM UTC: Grafana is being shut to reconfigure the managed service behind.

Edit 18:40PM UTC: Token manage is successfully up to date. Apps are being redeployed to switch their metrics endpoint

Edit 18:46PM UTC: Web console metrics are unavailable for a few minutes (this is expected)

Edit 19:31PM UTC: Web console has now server metrics available

Edit 20:16PM UTC: All Grafana dashboards are back online. If you encounter an issue with a "Error 500: invalid token", then you can go to your org home page > Metrics in Grafana > and click on the RESET ALL DASHBOARDS button.

Edit 21:20PM UTC: Only access logs based dashboards remain unavailable.

[PAR] PostgreSQL in plan `DEV` maintenance

2023-10-23T15:43:00+00:00

We are going to migrate our DEV PostgreSQL services on the Paris (PAR) region. Applications using those services will be impacted.

For this reason, we have deployed a new cluster in version 15. Starting from today, you can already migrate your DEV add-on to this new cluster and by Thursday last delay, we will automatically migrate all add-ons that are compatible with PostgreSQL version 15.

For incompatible add-ons, we are planning a maintenance in order to update the par dev cluster. This maintenance will take place on Thursday the 26st of October 2023, between 15:00 UTC+2 and 17:00 UTC+2.

For the entire duration of the update, services will be unavailable. The time required to perform the update is estimated between 1 and 2 hours. However, total downtime might be longer as every application using the cluster will need to be restarted.

In case you have connection issues after those updates, you can manually trigger a redeployment of your linked applications.

If you do not want to be impacted by your DEV add-on being offline, you can still order or migrate to a dedicated one before this maintenance starts.

Our support team is available for any questions via the ticket center in the console.

EDIT 2023-10-25 15:00 UTC+2: We will delayed the maintenance to 15:00 UTC+2 the 26st of October 2023.

EDIT 2023-10-26 15:00 UTC+2: Most of the DEV addons have been migrated, we are going to start the maintenance

EDIT 2023-10-26 15:35 UTC+2: Dev cluster par-postgresql-c4 is back online.

EDIT 2023-10-26 16:30 UTC+2: Everything is now back to normal. Maintenance end

[PAR} Load balancer security maintenance

2023-10-21T17:50:00+00:00

For security reasons, we will migrate our public load balancers on the Paris (PAR) region including cleverapps.io domains.

The maintenance will take place on Sunday 22 October 2023, between 14:00 UTC+2 and 20:00 UTC+2.

During the maintenance, applications and add-ons on this region will experience unexpected connection closed or reset, specifically on long running connections, beginning at 16:00 UTC+2. To prevent issues, you could restart your application if you see connection issues.

To check which of your services are impacted, you can consult the information section of your applications and see the region where your application is deployed.

14:45 UTC+2 : we are beginning the preparation steps to update load balancer that received cleverapps.io traffic 16:00 UTC+2 : we have identified a bug, so we will skip the update for now of cleverapps.io load balancers 16:30 UTC+2 : we are beginning the update of the last load balancer. 18:00 UTC+2 : we will soon update dns records to send traffics to new load balancer. 18:15 UTC+2: dns records has been updated 18:20 UTC+2 : monitoring is green, the maintenance is done

Deployments off for ~2 hours

2023-10-20T16:00:00+00:00

Due to security updates, we will need to shutdown and upgrade a component of our deployment infrastructure.

As a result, we will need to shutdown the deployment component for approx. 1 hour.

The maintenance is over, deployments are now usable again.

[PAR} Load balancer security maintenance

2023-10-20T10:29:00+00:00

For security reasons, we will migrate our public load balancers on the Paris (PAR) region including cleverapps.io domains.

The maintenance will take place on Saturday 21 October 2023, between 14:00 UTC+2 and 20:00 UTC+2.

To check which of your services are impacted, you can consult the information section of your applications and see the region where your application is deployed.

14:15 UTC+2 : The maintenance will start soon, we are ending preparation steps 15:15 UTC+2: Preparation steps took more time than estimated, we are rolling some configuration update on dedicated load balancers 16:15 UTC+2: Update in rolling of dedicated load balancers is terminated, we are beginning the public shared load balancer. 17:15 UTC+2: We are udpating the domain name resolutions for public shared load balancer of addons 18:30 UTC+2: We have updated two of eights servers of public shared load balancer of addons. 19:00 UTC+2: We have updated four of eights servers of public shared load balancer of addons. 19:15 UTC+2: We have updated six of eights servers of public shared load balancer of addons. 19:15 UTC+2: We have updated seven of eights servers of public shared load balancer of addons. 19:50 UTC+2: We have updated all servers of public shared load balancer of addons. As it is late and we are reaching the end of the window, we will update last load balancers tomorrow afternoon

[PAR] MySQL in plan `DEV` maintenance

2023-10-18T15:24:00+00:00

For security reasons, we will migrate our DEV MySQL services on the Paris (PAR) region. Applications using those services will be impacted.

Only the par dev cluster will be updated during this maintenance.

The maintenance will take place on Monday 23rd of October 2023, between 11:45 UTC+2 and 15:00 UTC+2.

For the entire duration of the update, the services will not be available.

The time required to perform the update is estimated between 1 and 2 hours. However, total downtime might be longer as every application using the cluster will need to be restarted.

In case you have connection issues after those updates, you can manually trigger a redeployment of your linked applications.

If you do not want to be impacted by your DEV addon being offline, you can still order or migrate to a dedicated one before this maintenance starts.

Our support team is available for any questions via the ticket center in the console.

EDIT 2023-10-23 11:50 UTC+2: Maintenance is starting.

EDIT 2023-10-23 12:20 UTC+2: Dev addons are now available again. We will restart linked applications

EDIT 2023-10-23 12:22 UTC+2: We investigate an error while creating new DEV addons

EDIT 2023-10-23 12:40 UTC+2: New DEV addons can now be created. All applications linked to DEV addons are currently restarting

EDIT 2023-10-23 13:00 UTC+2: All applications linked to DEV addons have restarted

Notifications

2023-10-18T12:50:00+00:00

A maintenance on our notification stack is scheduled for this afternoon. This means services such as WebHooks and Email Notifications will not work.

Downtime is expected to last between 30 minutes to 1 hour.

[14:15 UTC] All notifications services are now up and running

[PAR] Network connectivity issue

2023-10-18T00:00:00+00:00

This night, we had network connectivity issues when adding new bgp peer. Those connectivity issues has resulted in desynchronisation of our load balancers. You may have experienced connectivity issue to your database and unavailability of your applications.

[RETROACTIVE][PAR] PHP+FTP applications timeouts / slowness

2023-10-17T21:12:00+00:00

PHP+FTP applications may have seen degraded performances and / or timeouts those past few hours on the Paris region due to a configuration issue of the underlying disk.

The configuration has been fixed at 21:00 UTC and disk access time are now in the normal range. We will keep monitoring the situation in the upcoming days to make sure performance stays in normal ranges.

[MongoDB] DEV Cluster security update on PAR

2023-10-16T15:57:20+00:00

Our security updates will need us to update nodes from our DEV mongodb cluster on PAR.

If you use the mongodb uri correctly, it should only disrupt your application for a few seconds. Otherwise, expect up to two hours of maintenance.

Logs are experiencing issues

2023-10-16T12:51:00+00:00

We are encountering issues with our log systems.

The fetch of logs can take a while.

EDIT 21:37 UTC - fixed.

[ALL] Logs collection maintenance

2023-10-11T20:54:00+00:00

For security reasons, we will update our logs collection systems.

The logs collection (logs drains too) will be unavailable during the maintenance.

EDIT 00:00 UTC: The maintenance is now over.

[PAR] FS Bucket maintenance on n19 and n20 servers

2023-10-11T10:10:00+00:00

For security reasons, we will update the our FS Bucket services on the Paris (PAR) region. Applications using those services will be impacted.

FS Bucket hosts that will be updated during this maintenance are: n19 and n20.

The maintenance will take place on Friday 20 October 2023, between 12:00 UTC+2 and 14:00 UTC+2.

During the update of each server host, the services will only be available in read-only mode. Once the update is complete, linked applications will be restarted automatically to take into account the environment variables of the updated services and to restore write capacity.

The required update time is estimated at 1 hour but the total time until the applications are restarted might be longer.

In case you have write issues after those updates, you can manually initiate a redeployment of your linked applications in order to avoid waiting for the automatic redeployment.

To check if your services are impacted, you can consult your FS Bucket’s server in the Dashboard tab of your add-ons, in the “Cluster information” section and thus determine the update day(s) that concerns you.

Specific case - old applications

Please check if you have any old applications (>5 years) that are still using a buckets.json file in their code repository, as we will not be able to prioritize the redeployment of these applications and they will most likely suffer from read-only FS Bucket for an extended time. We therefore recommend that you now mount FS Bucket by environment variable (ideally by linking the add-on to your application). See more details in this documentation page: https://www.clever-cloud.com/doc/deploy/addon/fs-bucket/#configuring-your-application

Our support is available for any questions via the ticket center in the console.

EDIT 2023-10-20 12:00 UTC+2: Maintenance is starting.

EDIT 2023-10-20 13:24 UTC+2: Applications are currently redeploying.

EDIT 2023-10-20 14:20 UTC+2: Applications have redeployed. We are cleaning things up.

EDIT 2023-10-20 16:12 UTC+2: The maintenance is over.

[PAR] FS Bucket maintenance on n10 and n17 servers

2023-10-11T10:09:00+00:00

For security reasons, we will update the our FS Bucket services on the Paris (PAR) region. Applications using those services will be impacted.

FS Bucket hosts that will be updated during this maintenance are: n10 and n17.

The maintenance will take place on Thursday 19 October 2023, between 12:00 UTC+2 and 14:00 UTC+2.

The required update time is estimated at 1 hour but the total time until the applications are restarted might be longer.

In case you have write issues after those updates, you can manually initiate a redeployment of your linked applications in order to avoid waiting for the automatic redeployment.

Specific case - old applications

Our support is available for any questions via the ticket center in the console.

[PAR] FS Bucket maintenance on n15 and n16 servers

2023-10-11T10:07:00+00:00

For security reasons, we will update the our FS Bucket services on the Paris (PAR) region. Applications using those services will be impacted.

FS Bucket hosts that will be updated during this maintenance are: n15 and n16.

The maintenance will take place on Wednesday 18 October 2023, between 12:00 UTC+2 and 14:00 UTC+2.

The required update time is estimated at 1 hour but the total time until the applications are restarted might be longer.

In case you have write issues after those updates, you can manually initiate a redeployment of your linked applications in order to avoid waiting for the automatic redeployment.

Specific case - old applications

Our support is available for any questions via the ticket center in the console.

[PAR] FS Bucket maintenance on n12 and n13 servers

2023-10-11T10:06:00+00:00

For security reasons, we will update the our FS Bucket services on the Paris (PAR) region. Applications using those services will be impacted.

FS Bucket hosts that will be updated during this maintenance are: n12 and n13.

The maintenance will take place on Tuesday 17 October 2023, between 12:00 UTC+2 and 14:00 UTC+2.

The required update time is estimated at 1 hour but the total time until the applications are restarted might be longer.

In case you have write issues after those updates, you can manually initiate a redeployment of your linked applications in order to avoid waiting for the automatic redeployment.

Specific case - old applications

Our support is available for any questions via the ticket center in the console.

EDIT 2023-10-17 12:10 UTC+2: The maintenance is starting. FSBucket servers are set in read-only mode.

EDIT 2023-10-17 12:47 UTC+2: Applications are being redeployed to use the new FSBucket server. You can also start a deployment on your side to speed things up.

EDIT 2023-10-17 16:15 UTC+2: The maintenance is over. All applications should now have access to their fsbucket since 14:00 UTC+2. Please reach out to our support team if you have any issues following this maintenance.

[PAR] PHP FTP maintenance on n11 and n18 servers for PHP+FTP applications

2023-10-11T09:58:00+00:00

For security reasons, we will update the PHP FTP services on the Paris (PAR) region used by PHP+FTP applications. PHP applications using the Git deployment method are not impacted by this maintenance.

PHP FTP hosts that will be updated during this maintenance are: n11 and n18.

The maintenance will take place on Monday 16 October 2023, between 12:00 UTC+2 and 14:00 UTC+2.

The required update time is estimated at 1 hour but the total time until the applications are restarted might be longer.

In case you have write issues after those updates, you can manually initiate a redeployment of your PHP+FTP applications in order to avoid waiting for the automatic redeployment.

Our support is available for any questions via the ticket center in the console. This maintenance will be updated during the maintenance window.

EDIT 2023-10-16 12:03 UTC+2: The maintenance will begin shortly. FSBucket add-on hosted on those servers will soon become read-only.

EDIT 2023-10-16 12:09 UTC+2: FSBuckets are now read-only

EDIT 2023-10-16 12:54 UTC+2: Applications are being redeployed to use the new FSBucket server. You can also start a deployment on your side to speed things up.

EDIT 2023-10-16 14:04 UTC+2: The maintenance is over. All applications should now have access to their fsbucket since 13:30 UTC+2. Please reach out to our support team if you have any issues following this maintenance.

[PAR] Hypervisor network unreachability

2023-10-10T15:16:00+00:00

An hypervisor had its internal network unreachable between 15:16 and 15:20 UTC. During that time, services on the hypervisors may have suffered a total loss of network connectivity. Applications have been redeployed because of the unreachability. The issue has been fixed and we are monitoring the situation.

EDIT 15:49 UTC: All services are now reachable again, the incident is now over.

[Metrics/AccessLogs] Ingestion lag

2023-10-09T08:10:00+00:00

We are detecting a data lag on our Metrics / AccessLog stack. You may not experience the latest datapoints. We are working on it.

EDIT 10:00 PM: lag has been fully absorbed

A Hypervisor is not responding

2023-10-07T06:05:00+00:00

One hypervisor is unreachable. We are investigating it.

It may impact some databases that are hosted on top of this hypervisor.

06:20 The hypervisor seems to have encountered a kernel panic. It has been rebooted and we fixed the kernel version to avoid future Kernel Panic.

06:45 Now that the hypervisor is back up, we are cleaning the situation: checking all add-ons instances have rebooted successfully, that all applications have redeployed successfully.

07:21 Everything is now back to normal

Migrating parts of our deployment stack

2023-10-06T17:00:00+00:00

Starting 2023-10-06 23:00 UTC, we will conduct a maintenance operation on one of the key parts of the deployment system. With the security margins, we expect it to take 2 hours.

As a result, deployments on all our zones will be disabled between 11:00 and 01:00 (2023-10-07).

EDIT 2023-10-07 01:32 UTC: The maintenance is over. Deployments are now working again.

[PARIS] Network unreachability

2023-10-03T21:06:00+00:00

Part of our network infrastructure in Paris is unreachable. We are looking into it.

09:25 PM UTC: Network is back online. We are bringing back services which are not healthy.

09:30 PM UTC: Network is still flappy, we are on it.

10:06 PM UTC: Network seems stable. We are bringing back services which are not healthy.

10:37 PM UTC: We are bringing back services which are not healthy.

10:52 PM UTC: all services are back online. Good night.

SSH Gateway planned unavailability

2023-10-03T15:03:00+00:00

The SSH Gateway will be unavailable on October 3rd, 2023 starting at 20:00 UTC. During that time, SSH access to services using either ssh -t ssh@sshgateway-clevercloud-customers.services.clever-cloud.com or using our CLI command clever ssh will be unavailable. Existing SSH connections through the gateway to services will be interrupted.

Once the maintenance is over, it is possible that some applications will need to be restarted to be able to be accessed through the SSH Gateway again.

The maintenance is planned to last less than 30 minutes.

EDIT 20:05 UTC: The maintenance is starting.

EDIT 20:27 UTC: The maintenance is now over. We are monitoring the results. You should now be able to access your services using the SSH gateway.

EDIT 20:27 UTC: Everything is working as intended. If you have any issues using the SSH gateway, you can try to redeploy your service and contact our support team.

New deployments may not be triggered on git push

2023-10-03T08:20:00+00:00

Some pushes are not starting a new deployments. We are investigating.

EDIT 09:06 UTC: We implemented a fix and are monitoring the results. If you pushed new commits that didn't get deployed, you can either contact us through the support with your application id and the associated commit, or use our CLI with clever restart --commit .

EDIT 12:25 UTC: The incident is now resolved.

Deployments and git repositories maintenance

2023-10-02T21:08:00+00:00

A maintenance on deployments and git repositories will happen at 22:00 UTC on October 2nd, 2023. During that time, deployments may be queued up for longer than usual. Git repositories will also be unavailable for a few minutes. Git pushes may fail to start new deployments or be rejected.

The maintenance is expected to last less than 30 minutes.

EDIT 22:00 UTC: The maintenance is starting

EDIT 22:35 UTC: The maintenance is mostly over. Deployments and git repositories are back since 15 minutes. We continue to make sure everything is running smoothly.

EDIT 23:05 UTC: Everything is back to normal since 22:20 UTC. The maintenance is over. Thanks for your patience.

Network instabilities

2023-09-29T14:51:00+00:00

We are experiencing some network instabilities. We are investigating.

EDIT 02:55 PM UTC: the network instability has been fixed. Some customers may experiences a connection reset.

[PAR] Elevated rate of deployments for Monitoring/Unreachable reason

2023-09-26T12:14:00+00:00

Some applications may have encountered an elevated rate of deployments with the monitoring/unreachable reason even if the applications were correctly reachable. A fix has been implemented and we are monitoring the situation.

EDIT 14:46 UTC: This incident is no over. No more incorrect Monitoring/Unreachable alerts were emitted.

Query latency

2023-09-26T08:05:00+00:00

We are observing some latency to retrieve metrics and accesslogs from our storage layer. We are investigating.

EDIT 02:46 PM UTC: Latencies have been fixed by rebalancing data

EDIT 27/09 at 13:00 UTC: Queries were not available

EDIT 27/09 at 14:10 UTC: Queries are re-open

We continue to investigate

EDIT 27/09 at 16:10 UTC: Closing incident

Main API is experiencing issues

2023-09-19T14:14:00+00:00

Our main API is affected by the pulsar outage. We are looking into it.

EDIT 15:02 UTC: we deployed a new version of the API that will survive future pulsar outages.

Pulsar Addons on are experiencing issues

2023-09-19T11:49:00+00:00

We are experiencing issues on one of our Pulsar clusters.

We have identified the issue and are working on it.

EDIT 13:31 UTC - we are still working on the issue.

EDIT 14:44 UTC - we are still working on the issue.

EDIT 16:09 UTC - fixed.

Metrics and access logs storage layer issue

2023-09-12T13:11:00+00:00

The storage layer has lost some nodes. We are investigating the issue.

EDIT 13:45 UTC : We have found that we have a network issue which cause storage nodes to timeout and then crash. Those nodes are now up and running, we are beginning the recovery process

EDIT 15:10 UTC : We have finished the recovery process and we are consuming the lag.

EDIT 18:52 UTC : We have almost consume all the data lag (estimate duration is 30 mins left), but there is still 2h of metadata lag.

EDIT 21:00 UTC: We have catched up the data and metadata lag, the query is now open

Main API unreachability

2023-09-12T12:48:00+00:00

Our main API is currently unreachable. We are aware of the issue and working towards bringing it back.

EDIT 12:56 UTC: The main issue is now resolved and the API is back online. We continue to see some errors and are working towards identifying their source.

EDIT 14:25 UTC: The API has stabilized but we are still looking for the origin of the troubles.

EDIT 13/09 09:03 UTC: The API is unreachable again, we are working on it

EDIT 13/09 09:15 UTC: The API is now operational, the root cause has been identified.

Main API unavailability

2023-09-11T23:23:00+00:00

We are performing security updates on some core components.

Our main API may be unavailable for 1 hour.

EDIT 00:30 UTC: The maintenance is now over since 25 minutes ago. We are monitoring the results.

[Maintenance] Main API planned unavailability

2023-09-11T16:03:00+00:00

On Monday 2023-09-11 around 20:00 UTC, our main API (api.clever-cloud.com) will be unavailable. The CLI and Console will be impacted and may display errors during some requests. Deployments will also be impacted and won't be available either through the Console/CLI or using git.

The maintenance is planned for one hour but is expected to last a few minutes at most.

EDIT 20:00 UTC: The maintenance is starting.

EDIT 20:02 UTC: The API is now unavailable as well as the Console.

EDIT 20:16 UTC: One of the steps took a bit more time than expected, we are back on track.

EDIT 20:44 UTC: Unexpected problems occurred and we are currently doing a rollback of the changes.

EDIT 20:54 UTC: The maintenance is over, changes were rollback and everything should now be operational again.

[JED] Hypervisors unreachability

2023-09-06T18:15:00+00:00

We are currently experiencing unreachable hypervisors on the JED region. We are investigating the issue.

EDIT 18:50 UTC: The hypervisors are back online since 25 minutes now, all services were restarted by our monitoring.

[SCW] Hypervisor has crashed

2023-09-06T16:33:00+00:00

An hypervisor has crashed, we are currently investigating the root cause

EDIT 18:45 - The hypervisor had a kernel panic. During the reboot operation the kernel has been upgraded and this issue should not occur again.

[PAR] Network instabilities

2023-09-05T06:20:00+00:00

We've seen network instabilities on the PAR region. It is currently resolved but we are still investigating the root cause.

EDIT 06:34 UTC: The problem is back with elevated packet loss. Our network provider is currently having an incident and is looking into the issue.

EDIT 06:46 UTC: Some DNS domains for services hosted on other regions may also have issues to resolve because their authoritative server is currently hosted on the PAR region.

EDIT 06:55 UTC: The incident is still ongoing and our network provider is still looking into the issue.

EDIT 07:20 UTC: Our upstream network provider is currently experiencing a DDoS attack. We are currently looking to use an alternative network transit to avoid going through the upstream network provider.

EDIT 07:47 UTC: We are seeing improvements for the last 20 minutes. We still are waiting for a confirmation of the issue resolution.

EDIT 07:58 UTC: We are seeing some loss again.

EDIT 08:15 UTC: The DDoS is still happening. It's partially mitigated. We still see some loss, but there is less impact globally.

EDIT 10:54 UTC: We still see loss from time to time, but much less that before. We keep an eye on the situation.

EDIT 15:45 UTC: Most of the ddos is mitigated, we didn't have any loss those past few hours, we still monitor the situation.

EDIT 2023-09-06 15:24 UTC: No more instabilities were detected since yesterday. The incident is now over.

New databases issues

2023-09-04T16:00:00+00:00

When ordering a new database it can take time to reach them (databases are correctly created)

[EDIT] 21:38 UTC the root cause was identified and a path deployed

[PAR] Public reverse proxies are very slow / unresponding

2023-09-02T14:45:00+00:00

Some websites went unreachable by external monitoring. This indicates reverse proxies are not taking connections as they should.

Metrics on the proxies seem ok. We are investigating why they are acting like that.

Seems some applications where causing connections to enqueue and blocking new connections. We are looking into ways to avoid this to happen.

The issue is resolved

[PAR] Public reverse proxies elevated TLS errors/timeouts

2023-08-26T14:41:00+00:00

We are currently investigating an elevated rate of TLS errors/timeouts on our Paris reverse proxies serving applications domains.

EDIT 14:53 UTC: We are seeing signs of improvements since 14:50. We continue monitoring the situation.

EDIT 15:23 UTC: We confirm that the issue has been resolved since 14:50. Sorry for the inconvenience this incident may have caused.

New logs are currently unavailable

2023-08-24T10:30:00+00:00

The logs infrastructure currently do not ingest new logs anymore. The root cause has been identified. In the meantime, if you need to access your logs, you can SSH to your application: https://www.clever-cloud.com/doc/reference/clever-tools/ssh-access/#show-your-applications-logs

EDIT 11:00 UTC: The problem is now resolved. Some logs may have been lost for that period. We apologies for the inconvenience.

[RETROACTIVE][PAR] An add-on reverse proxy was unreachable

2023-08-23T20:43:00+00:00

Between 20:43 UTC and 20:48 UTC, an add-on reverse proxy of the PAR region was unreachable. Some applications may have had errors connecting to their add-ons during that time if they didn't automatically switched to another working proxy. The issue has been fixed.

[RETROACTIVE][SCW] A FSBucket server was unreachable

2023-08-23T20:25:00+00:00

Between 20:25 and 20:31 UTC, a FSBucket server was unreachable on the SCW region. The issue has been fixed, applications using an FSBucket add-on may have had I/O issues (read/write timeout or hang) during that time. Applications will re-connect to the FSBucket add-on automatically.

[MTL] load balancer tls issue

2023-08-22T17:39:00+00:00

There is an issue that are preventing to load tls certificate by our load balancing system. We are investaging the issue

EDIT 18:00 UTC: The issue is resolved

Deployment issues

2023-08-21T15:16:00+00:00

We are currently experiencing issues with deployments.

Some applications may have been redeployed multiple times with the Monitoring/Unreachable reason. Most of those deployments were false positives. Other applications may currently have troubles deploying.

We are working on restoring the service.

EDIT 15:49 UTC: The underlying issue has been found and fixed. Some deployments may have failed even when there were no reason for them to fail. You can start them again if needed. If you still have deployment issues, feel free to reach our support team.

[RETROACTIVE][PAR] Cellar TLS errors

2023-08-11T13:51:00+00:00

Between 13:51 UTC and 14:12 UTC, some requests may have failed to establish a TLS connection to the Cellar service.

The issue has been identified and has been fixed.

[SCW] Public load balancer network connectivity issue

2023-08-07T09:11:00+00:00

We detect that the ip 212.129.27.183 is unreachable, we have identified the root cause and we are waiting for the feedback of scaleway cloud provider.

EDIT 12:39 UTC : The ip address is reachable

Metrics and access logs storage layer unavailability

2023-08-05T08:57:00+00:00

The storage layer of metrics and access logs has lost some data nodes. We are fixing the issue

EDIT 09:18 UTC : We are recovering from the events and consuming the lags. The storage layer is now operational

Reverse proxies errors

2023-07-22T21:05:00+00:00

We are detecting some errors on our reverse proxies, your apps may not be reachable. We are working on it.

EDIT 22:48 PM UTC: all reverse proxies are now working properly

Order and MySQL migrations were experiencing issues

2023-07-21T10:00:00+00:00

An issue with the control plane triggered some issues when ordering or migrating MySQL add-ons.

Edit 12:50 UTC: Control plane has recovered, everything is now OK

Deployment issues on Paris

2023-07-20T12:45:00+00:00

We are currently looking into an issue regarding applications deployments. They may be able to start but may never complete.

EDIT 15:05 UTC: The issue appears to be limited to the Paris zone

EDIT 15:20 UTC: A counter measure has been deployed to mitigate issues. Deployments are now scheduled as expected. Some errors may still appear in your Logs. We're processing stuck deployments, but you may cancel or start a new one if you want to prioritize your deployment.

DNS migration

2023-07-18T11:30:00+00:00

We have asked our provider to transfer the domain name cleverapps.io. The transfer ends at 12:30 UTC and we saw that records are missing or have not the right value.

EDIT 15:00 UTC : we have found that NS records and SOA records was not good, we have updated it. EDIT 16:00 UTC: everything is back to normal.

cleverapps.io timeouts

2023-07-11T08:39:00+00:00

Following the yesterday deployment, we had issues with http and tcp redirections which cause infinite loop and timeouts. We are investigating the issue.

EDIT 09:00 UTC The issue was found and fixed

some cleverapps are not accessibles

2023-07-10T18:50:00+00:00

Some apps are not availables

21h11: only apps with redirect_https enabled are impacted

21h56: we rollback to the old cleverapps loadbalancers

Hypervisor lost on SYD region

2023-07-05T18:15:00+00:00

20h15: We have lost our hypervisors on SYD region 20h30: Our infrastructure provider on SYD lost its connectivity 21h00: hypervisors are back online

[JED] Reverse proxies instabilities

2023-07-03T16:40:00+00:00

We are currently experiencing issues on reverse proxies of the JED region. We are investigating them.

EDIT 16:44 UTC: The root cause has been identified and a fix has been applied. We are monitoring the results.

EDIT 16:50 UTC: The service is now operational.

[PAR] FSBucket server timeout for some actions

2023-07-02T10:30:00+00:00

An FSBucket server is currently being investigated for connection timeouts when mounting buckets. The problem has been partially identified and a first fix has been applied. Additional steps will be taken shortly to make sure everything is working as intended.

EDIT 10:49 UTC: The underlying issue has been fixed. Some applications may have had troubles mounting FSBuckets, writing or reading files stored on that server between 08:50 UTC and 10:25 UTC. Impacted applications are currently being redeployed out of caution (most of them successfully reconnected to the server after the fix has been issued).

Metrics/access logs storage layer issue

2023-06-29T13:05:00+00:00

We are detecting some errors on our storage layer responsible for storing metrics and access logs data. Queries were unavailable.

Edit 14:10 UTC: query is re-open

We continue to investigate.

[PAR] An hypervisor is unreachable

2023-06-26T16:10:00+00:00

The monitoring system has detected that an hypervisor is unreachable. We are investigating.

EDIT 16:27 UTC: The hypervisor took some time to reboot but it is now up and running. We are making sure services are working fine following this incident.

EDIT 17:10 UTC: The incident is now over. The underlying problem has been identified but the hypervisor is currently in the upgrade queue.

[PAR] Planned hypervisor reboot #2

2023-06-21T13:59:00+00:00

An hypervisor on the Paris region needs to be rebooted due to a kernel issue. The reboot will take place tonight (June 21, 2023) at 18:00 UTC. Services on that hypervisor are already migrated apart for a few of them. Impacted customers will shortly receive an email with more details.

EDIT 18:14 UTC: The maintenance is starting

EDIT 22:00 UTC: The maintenance is now over

[PAR] Planned hypervisor reboot

2023-06-21T09:25:00+00:00

An hypervisor on the Paris region needs to be rebooted due to a kernel issue. The reboot will take place tonight (June 21, 2023) at 20:00 UTC. Services on that hypervisor will be migrated starting at 18:00 UTC. Impacted users will shortly receive an email with more details.

EDIT 18:13 UTC: The maintenance is starting

EDIT 23:11 UTC: The maintenance is now over

Deployments are experiencing issues

2023-06-20T20:10:00+00:00

A deployment issue has been identified, we are working on a fix.

EDIT 20:43 UTC - fixed.

Metrics/access logs storage layer issue

2023-06-20T14:30:00+00:00

We are detecting some errors on our storage layer responsible for storing metrics and access logs data. We are investigating.

Edit 04:58 PM UTC: A storage node had a hardware issue, it has been rebooted.

Maintenance: Metrics & Access-logs storage layer

2023-06-20T09:58:00+00:00

We will start a maintenance this Tuesday designed to improve performance on our storage layer for metrics and access-logs. During the maintenance, you may not see latest datapoints and access-logs.

Maintenance will start 20 of June, at 03:30 PM UTC.

Edit 03:45 PM UTC: maintenance is starting.

Edit 04:58 PM UTC: maintenance is over.

MySQL add-ons creation stopped

2023-06-19T14:14:00+00:00

MySQL add-on API started to timeout while trying to create add-ons. Currently created add-ons still work, though.

We are investigating the issue.

EDIT 09:00 PM UTC: the root cause has been corrected.

Maintenance: Metrics & Access-logs storage layer

2023-06-18T14:02:00+00:00

We will start a maintenance this Sunday designed to improve performance on our storage layer for metrics and access-logs. During the maintenance, you may not see latest datapoints and access-logs.

Maintenance will start 18 of June, at 02:30 PM UTC.

EDIT 02:36 PM UTC: maintenance is starting.

Edit 08:21 PM UTC: maintenance is still on-going, storage layer is a few minutes late on average.

EDIT 08:51 PM UTC: maintenance is over, we are catching up lag

EDIT 08:00 PM UTC. An error during catching up the lag has put the storage layer into an inconsistent state. Queries are disabled for now

EDIT 11:00 PM UTC: storage layer is still inconsistent

EDIT 00:47 PM UTC D+1: storage layer is (finally?) consistent. We are catching up the lag

EDIT 04:30 PM UTC D+1: We have catch up the lag.

EDIT 07:29 AM UTC D+1: storage layer got inconsistencies. We are investigating the reason why.

EDIT 08:10 AM UTC D+1: storage layer is up and running. We are consuming the lag. Queries are disable during this phase.

EDIT 08:45 AM UTC D+1: We have consumed the lag. Queries are available.

[PAR] Network connectivity issues

2023-06-17T05:12:00+00:00

The monitoring system has difficulties to reach some services. We are investigating...

EDIT 00:50 UTC : The monitoring do not see network issues anymore.

EDIT 01:00 UTC : The monitoring has detected connectivity issues, we are fixing.

EDIT 01:30 UTC : The monitoring has detected new connectivity issues, we are on it.

[MTL] Network connectivity issue

2023-06-16T10:28:00+00:00

We are impacted by our infrastructure provider incident, you can get more details by following their incident website : https://network.status-ovhcloud.com/incidents/9vzvvwrm69ps

SSH connections to instances may fail

2023-06-16T07:10:00+00:00

SSH connections may fail with the message 'Error: This application has no instances you can ssh to' or may ask you a password during the connection initialization. We are currently investigating this issue.

08:10 UTC : We have found the component causing this issue and restarted it. We are still investigating the root cause.

21/06 : The problem was most likely caused by the network instability observed at this time. We haven't detected any problems since.

One hypervisor in scaleway's DC is unresponsive

2023-06-15T16:20:00+00:00

One hypervisor only responds to ping. It does not take new VMs anymore and does not delete VMs that should be deleted.

19:57 UTC: We are going to reboot it. Some databases (that run on this hypervisor) will become unresponsive for a few minutes.

20:18 UTC: Hypervisor has been rebooted. All services hosted on it have been checked: everything is up and running.

Logs show a kernel panic.

Read-only live logs system storage layer

2023-06-15T09:28:00+00:00

Live logs system storage layer falls in read-only mode. we are investigating the issue.

EDIT 09:30 UTC : Following the incident https://www.clevercloudstatus.com/incident/669, the storage layer did not perform scheduled tasks.

EDIT 09:45 UTC : The storage layer is accepting write. Logging system is operating normally.

[Paris] Network connectivity issue

2023-06-15T00:09:00+00:00

We are investigating a network connectivity issue towards our Paris region.

EDIT 00:27 UTC: The issue has been identified and fixed around 00:11 UTC. We continue identifying the impact on customer and internal services.

EDIT 01:00 UTC: We have identified services impacted by the incident and we have started to recover from the network issue. Identified impacted services are Metrics and access logs that are taking time to recover, others services should be working normally.

EDIT 02:30 UTC: Metrics and access logs are recovering from the network issue.

EDIT 04:00 UTC: Metrics and access logs are still recovering from the network issue. To follow, the incident you can go on https://www.clevercloudstatus.com/incident/669

Metrics and access logs network connectivity issue.

2023-06-14T22:54:00+00:00

Following the incident https://www.clevercloudstatus.com/incident/669, we are recovering the network connectivity issue

EDIT 06:05 UTC: The storage layer is now up and healthy. We are now consuming the ingestion lag, it should take a few hours to fully resolve. Queries are now available but will show outdated data. We will update this status accordingly.

EDIT 10:00 UTC: We've had a slower ingestion than initially anticipated so queries are still returning out of date data. We've made some adjustments and saw an increase in ingestion for the last hour. We will still need a few hours to fully consume the lag.

EDIT 15:00 UTC: The lag has been consumed, the metrics and access logs stack is operating normally.

[PAR] An hypervisor is unreachable

2023-06-13T08:21:00+00:00

The monitoring system has detected that an hypervisor is unreachable. We are investigating.

EDIT 08:32 UTC : We have found the issue and the hypervisor is rebooting

EDIT 08:50 UTC: The hypervisor has finished to reboot and services is working

[PAR] An hypervisor rebooted

2023-06-12T10:55:00+00:00

An hypervisor rebooted on the Paris zone. Impacted applications are redeployed on other servers. We are monitoring the situation.

EDIT 11:40 UTC: All impacted applications have been redeployed automatically. We will investigate further why this server rebooted. The incident is now over.

Metrics system write is slow

2023-06-08T12:15:00+00:00

Our metrics system's hbase cluster is in an inconsistent state. We found out which nodes are responsible for it and are fixing them.

12:26 UTC: we restarted the node responsible for the issue. While it re-converges, we stop the egress servers. We will put them back on in a few minutes.

13:31 UTC: Query is back online. We are still catching up the lag, so new datapoints may not be available

14:35 UTC: lag has ben catched up

Metrics and access logs storage layer unreachbility

2023-06-07T15:50:00+00:00

Our monitoring has detected failure on the storage layer of metrics and access logs. We have found that a storage node has lost several disk. We have remove faulty disks and restarted the storage node.

EDIT 16:00 UTC : The storage layer is restarted and we are consuming the ingestion lag

[RBX] A hypervisor has rebooted

2023-06-07T08:56:00+00:00

2023-06-07 08:56 UTC: A hypervisor on the RBX zone has rebooted.
09:00: the machine has fully rebooted, it is restarting all its VMs. Applications VMs are redeploying on other hypervisors.
09:31: the checks are done, everything seems to be running fine as of now.

We will investigate to understand why this hypervisor rebooted in the first place.

[JED] Load balancers metrics show abnormal response status code

2023-06-06T11:08:00+00:00

Monitoring of load balancers is detecting an abnormal amount of http 404 status. We are investigating.

EDIT 13:00 UTC : We have located the root cause, we are applying a fix.

EDIT 14:20 UTC : The issue is resolved

[RBX] lost connectivity with an hypervisor

2023-06-04T03:00:00+00:00

We lost connectivity with an hypervisor on RBX. Applications have been redeployed but some databases may not be reachable. We are investigating.

EDIT 03:58 UTC: server is back online. All databases should now be reachable.

Metrics/access logs storage layer issue

2023-06-02T18:20:00+00:00

We are detecting some errors on our storage layer responsible for storing metrics and access logs data. We are investigating.

EDIT Lag has been catched up

[RBXHDS] Load balancers metrics show abnormal response status code

2023-06-01T16:20:00+00:00

Monitoring of load balancers is detecting an abnormal amount of http 404 status. We are investigating.

EDIT 17:51 UTC : We have found the issue and the fix is passed. Everything is operating normally

[RBXHDS] Cellar Load Balancers partially unavailable

2023-06-01T16:20:00+00:00

2023-06-01 16:20 UTC : During the RBXHDS incident, one of the Cellar LB lost its configuration. The configuration of each LB was not correctly monitored. Only the whole service availability was.

2023-06-02 09:15 UTC : after customer complaints we found out about the LB misconfiguration and fixed it.

2023-06-02 09:28 UTC : monitoring checks have been added to catch this kind of issues right away.

Ticket center availability issue

2023-06-01T12:45:00+00:00

We are currently aware of an issue impacting our Ticket center service. This may impact our customers to open, view and reply to the tickets opened with our support team.

EDIT 13:30 UTC: Our ticket center provider told us that the issue has been mitigated on their end and that it is now resolved. We keep monitoring the situation for now but we can indeed see that service are operating normally those last few minutes.

EDIT 14:47 UTC: We did not see any other issues. We consider this incident to be over.

The metrics storage layer is unavailable

2023-05-31T10:59:00+00:00

The monitoring detect errors on the metrics / access logs storage layer. We are investigating.

EDIT 11:46 UTC : We have found the issue and fixed it. We are recovering the lag.

EDIT 13:19 UTC: The lag has been consumed, everyhting is operating normaly

[Montreal] Multiple hypervisors are unreachable

2023-05-28T19:45:00+00:00

An hypervisor on the Montreal zone is unreachable. One of the FSBucket servers of the zone is hosted on it and is therefore unreachable too. This might impact PHP applications as well as any applications using an FSBucket hosted on this server.

We are awaiting information from our infrastructure provider regarding this incident.

EDIT 19:53 UTC: It seems like multiple servers are impacted at the same time, we believe it to be an issue with a specific OVH rack or room. Multiple services on the zone are thus impacted. We are looking at ways to mitigate the issues.

EDIT 20:05 UTC: The servers are reachable again since a few minutes. We are currently making sure everything is fine. OVH incident can be followed here: https://bare-metal-servers.status-ovhcloud.com/incidents/k664s90jxfj0

EDIT 20:15 UTC: Servers in the impacted rack couldn't reach each other up until now. It could have prevented some services to correctly work. It seems like OVH fixed it before we could report it to them. We continue to making sure everything is working as expected.

EDIT 20:36 UTC: The incident is over. We are redeploying all the applications of the zone to be on the safe side.

Metrics: Ingestion issue leads to missing data points

2023-05-25T13:49:00+00:00

We are currently having an ingestion issue on our metrics cluster. The root cause has been identified and we are currently working on a fix. Until this incident is fixed, metrics data points might be missing from your metrics dashboards. Access logs are also impacted but will be re-queued later.

EDIT 14:14 UTC: Metrics ingestion is now back to normal. Access logs are being re-queued and are currently lagging a bit.

EDIT 14:20 UTC: Access logs have been ingested and are now up-to-date. The incident is now over.

EDIT 16:25 UTC: The problem came back, we are working on it.

EDIT 16:56 UTC: The problem is now solved again. Another root cause has been identified and has been fixed.

Cellar network is slow

2023-05-24T13:40:00+00:00

We are encountering slowness on the Cellar infrastructure. We are investigating why.

EDIT 15:05 UTC: The issue has been found and fixed. Performance went back to normal around 13:45 UTC. Additional measures will be taken to avoid this issue in the future.

Add-ons' reverse proxies break some connections

2023-05-24T07:00:00+00:00

Users reported issues while connecting to their database. We are investigating.

09:30 UTC : A huge number of add-ons recently created by malicious users was detected. It was issuing a lot of configuration changes on our reverse proxies, making them unstable.

We banned those users and are watching the situation closely.

MongoDB shared cluster is experiencing high load

2023-05-04T17:50:00+00:00

MongoDB shared cluster for free addons seems to be under heavy load. We are investigating.

Slowness in deployments

2023-05-02T01:18:00+00:00

Deployments services are experencing an abnormal load. We have identified the root cause and are fixing it.

12:20 UTC: The deployments are still running slow. We are still cleaning the situation.

13:16 UTC: we have found a deployment loop with the monitoring. We are stopping it…

13:51 UTC: cleaning is done, we are watching to see if deployments are running as expected

14:00 UTC: we have found an abnormal behaviour, we are investigating

D+1 14:30 UTC: we have made a patch for the abnormal behaviour and we are watching deployments

Clever Cloud API Major update

2023-04-27T12:10:00+00:00

Maintenace is now over

Thursday 27 of april at 2:00 PM CEST (12:00AM UTC) we will apply a major update concerning Clever Cloud APIs.
This update prepares some work for future and actual services.

Are you concerned?

All Clever Cloud public regions are concerned. Gov and Private regions aren't concerned, neither On Premise regions.

What's the expected behavior during the maintenance window?

All Applications and Cloud Services will continue to run as expected.
Some APIs calls may be delayed or refused for a few minutes. Deployments may take a bit longer than expected.

We expect services to be fully operational for 3PM CEST.

What do I have to do?

If you're driving your scalability, please anticipate your requirements to be fulfilled by 1PM CEST, since autoscaling won't be as reactive during the maintenance window.
--
We will keep you posted with the process in here and via this twitter thread

[PAR-SCW][RBX] Connectivity issues with load balancers

2023-04-26T20:19:00+00:00

The monitoring has detected a burst of connections on regions PAR-SCW (Paris Scaleway) and RBX (Roubaix). Applications may have experienced deconnections and blocked new connections.

EDIT 20:18:00 UTC : The issue is mitigated and we are watching

EDIT 20:50:00 UTC : Everything goes to habitual levels

Deployments halted for emergency maintenance

2023-04-24T11:35:00+00:00

We are having issues with some deployments. To fix them, we halt deployments for a few minutes.

EDIT 25 of April 08:04 AM UTC: We are still experiencing some deployments issues. Issue have been identified, we are working on a fix.

Deployments instability: VM stuck in STOPPING state

2023-04-20T06:08:00+00:00

Some VMs are currently stuck in the STOPPING state. We are investigating.

EDIT 8:00 AM UTC: VMs are no longer stuck.

Reason: a bad user found a way to start a lot of huge instances and run resource-heavy cryptomining operations. It loaded the hypervisors and made some APIs unresponsive. We blocked them and took actions to prevent future abuse of our service.

Metrics and access logs queries are experiencing issues

2023-04-18T09:32:00+00:00

Metrics and access logs (through Grafana, the Web Console or the API) are experiencing either slowness or missing data. We are currently looking into it.

EDIT 10:40 AM UTC: We have found the issue and we are currently fixing it.

EDIT 04:01 PM UTC: The issue is resolved

Deployments are stucks and infrastructure instabilities

2023-04-18T05:18:00+00:00

Deployments are currently stuck, preventing users from deployment their applications. Other parts of the infrastructure are currently experiencing instabilities.

EDIT 09:35 UTC: The problem should now be mostly resolved. Some services might still have troubles, dedicated incidents will be opened. We continue monitoring the situation.

The metrics platform is experiencing ingestion slowdown

2023-04-14T08:45:00+00:00

A maintenance on the storage layer of the metrics platform slowdown the ingestion and we have some log to consume.

EDIT 12:42 UTC: The maintenance operation is complet'ed and no more lag is present

[WSW] Reverse proxies instabilities

2023-04-11T19:20:00+00:00

We are currently investigating reverse proxies issues on the Warsaw zone.

EDIT 20:30 UTC: The problem was due to an increased load and capacity has been added to handle it. We continue to monitor the incident.

EDIT 00:53 UTC: We did not see any other issues since 20:30 UTC. This problem is now fixed.

Shared MongoDB cluster on PAR does not take new add-ons.

2023-04-06T11:26:00+00:00

The shared MongoDB cluster is having issues with people abusing it.

We have disabled the creation of new MongoDB DEV plans. This will give us time to setup a new cluster and clean the existing one.

You can still provision the other MongoDB plans.

MongoDB free Cluster on PAR is unresponsive

2023-04-02T13:05:00+00:00

MongoDB shared cluster is currently down. We are working on getting it back on.

It seems that the cluster got a lot of connections and could not handle the load.

The cluster is currently reconstructing. Waiting for it to finish.

19:40 The cluster has finished reconstructing and is now taking connections.

[RETROACTIVE][WSW] Reverse proxies instability

2023-04-01T16:15:00+00:00

We've had reverse proxies instabilities on the Warsaw zone between 16:15 UTC and 16:22 UTC. During that time, some connections might have been refused or closed unexpectedly. The problem has been fixed and the underlying issue has been found.

[PAR] Cellar control plane has detected inconsistencies

2023-03-28T09:15:00+00:00

Cellar control plane has detected inconsistencies. We are investigating the issue.

EDIT 09:25 UTC : We begin the recovrey process. We are waiting for the process to terminate

EDIT 09:32 UTC : The recovery process has ended successfully cluster is healthy

[PAR] Hypervisor has crashed

2023-03-24T13:25:00+00:00

An hypervisor has crashed

Deployment slowdown

2023-03-24T08:30:00+00:00

We are observing that few deployments are freeze or unsync with load balancing system.

EDIT 15:00 UTC : Deployment system is now in-sync and freeze deployement are up and running

Metrics storage layer issue

2023-03-23T13:53:00+00:00

The monitoring has detected that the metrics storage layer is offline. We are investagating.

EDIT 13:56 UTC : A node has crashed, the metrics storage layer finished its recovery process, it will take 20 minutes to consume the lag.

EDIT 14:22 UTC : Lag has been consumed and metrics storage layer has been operating normally

[RBX HDS] A hypervisor is unreachable

2023-03-20T08:00:00+00:00

The monitoring has detected that a hypervisor is not responding. We are investigating.

EDIT 8:31 UTC: hypervisor is up and running.

[PAR] Reverse proxies instabilities

2023-03-16T18:30:00+00:00

We are currently investigating reverse proxies instabilities on our Paris zone.

EDIT 18:56 UTC: To be more specific about the instabilities, the connections were slower to be processed, increasing the response time, sometimes drastically. The root cause has been found and fixed at 18:42 UTC. Since then, everything is back to normal. We continue to monitor the situation.

EDIT 19:11 UTC: Additional investigation will be performed to pinpoint the exact cause of the problem and measures will be added to prevent it from happening again. Sorry for the inconvenience.

Maintenance on main Clever Cloud API

2023-03-15T21:00:00+00:00

The main Clever Cloud API will go under maintenance for about 30 minutes, starting at 21:00 UTC.

During these 30 minutes, some deployments may not go through. Some calls may fail.

Everything seems to have gone well. The operation was over at 21:28.

EDIT 23:15 UTC: It seems like some application creation are having issues following this change, we are investigating.

EDIT 00:10 UTC: A fix has been implemented and applications are now correctly created. Some users may have had the API answer a 200 - OK for application creation but following requests for that application would return a 404 - Not Found. Sorry for the inconvenience.

One PAR reverse proxy is not responding

2023-03-14T20:10:00+00:00

(All times UTC)

At 20:10 one of the 4 reverse proxies on zone PAR stops responding to some requests. No internal metrics changed, no weird logs were written. The requests would just time out. The other three were still running, so the requests errors were random.
At 20:25 it stops responding at all.
At 20:40 our external monitoring tool alerts us. We investigate, find which reverse proxy failed, restarted it.
At 20:43 the reverse proxy is restarted and traffic goes fine.

Free MongoDB cluster on PAR unreachable

2023-03-12T16:30:00+00:00

(All times in UTC)

16:30 we started seeing alerts about high load on the primary node. 17:00 we started getting report about the cluster being unreachable. 18:00 after checking the cluster, we decided to restart the primary node.

Data may have been lost as the node was not writing / replicating correctly. We are still waiting for the primary node to restart. The secondary does not seem to elect itself as primary.

19:30 the secondary finally got promoted as primary. We are blocking users with unfair use of the cluster. 22:45 we detect that the node we restarted failed to get back in the cluster. We decide to remove it entirely and re-create that node from scratch. 2023-03-13 10:00 the node has fully reached the "SECONDARY" state. We put it back into production.

Measures have been taken to prevent future unfair use from users.

Main API is down

2023-03-11T11:30:00+00:00

(All times in UTC)

11:30 Our main API keeps stopping to respond. We are investigating it. This impacts the following, in an irregular fashion:

clever ssh may not succeed
Some deployments may not go through

Applications should keep running, but some monitoring deployments may fail.

12:55 The API seems to have stabilized. The database seems to have had a huge load. We are investigating the queries responsible for that load and try to improve them.

[PAR] Investigating network issues

2023-03-10T16:48:00+00:00

We are currently investigating network issues on our Paris zone.

EDIT 17:15 UTC: The issue is now resolved. A part of our infrastructure in Paris couldn't access some public DNS servers anymore, leading to multiple DNS queries failing. An upstream network provider made a change that fixed the problem around 16:52 UTC.

Core API is experiencing issues

2023-03-07T15:26:00+00:00

Clever Cloud Core API is currently experiencing performance issues. We are investigating it.

EDIT 16:03 UTC: We are seeing improvements, we continue to monitor the situation and keep investigating the root cause. We continue to add more data collection around the various points of contention.

[PAR] An hypervisor went down

2023-03-07T11:30:00+00:00

An hypervisor went down, we are investigating. Applications are being redeployed.

Update 11:11 AM UTC: The hypervisor has been rebooted, add-ons should be reachable. Root cause of the issue will be determined later. In the meantime, applications hosted on that hypervisor are still redeploying. We continue to monitor the situation.

Update 03:13 PM UTC: the same hypervisor went down again. It has been rebooted. Add-ons should be reachable. In the meantime, applications hosted on that hypervisor are still redeploying. We continue to monitor the situation.

Core API is experiencing issues

2023-03-03T14:11:00+00:00

Clever Cloud Core API is currently experiencing performance issues. We are investigating it.

EDIT 14:37 UTC: We are seeing improvements, we continue to monitor the situation.

EDIT 16:23 UTC: The incident is now over.

[MTL] Deployment partial issues

2023-03-01T21:00:00+00:00

We are facing a network issue between MTL and our control plane causing some deployment issues. A workaround has been found and deployments are, as of now, OK on this region. A ticket has been opened in our subcontractor to solve the root cause.

EDIT 03/03 02:15 PM UTC: Connectivity between MTL and our control is to fully restored.

[RBX] Deployment partial issues

2023-03-01T09:00:00+00:00

We are experiencing failures when deploying apps to RBX. We are investigating.

EDIT 10:32 AM UTC: a connectivity issue have been detected between RBX and our control-plane. The issue is now fixed.

[Support] Ticket Center maintenance

2023-02-28T10:53:00+00:00

A maintenance has been planned on our Ticket Center tool February 28th, 2023 at 19:00 UTC. Users will need to refresh their Clever Cloud Console (https://console.clever-cloud.com) to complete the update. Otherwise, the Ticket Center might display an authentication error. During that time, actions on tickets (creation, comment, ..) might fail.

The maintenance is expected to last 5 minutes. If you urgently need to contact us, you can send an email to support@clever-cloud.com

EDIT 19:38 UTC: The maintenance is now over. Actions on the ticket center should be fully available. If you encoutner any problems following this update, please email us at support@clever-cloud.com

[JED] Hypervisor update on Jeddah

2023-02-27T19:17:00+00:00

We need to conduct an update on our Jeddah hypervisors on February 28th, 2023. Services of impacted users will be migrated starting at 20:00 UTC before the update begins.

Impacted users will receive an email for each impacted service.

EDIT 2023-02-28 20:25 UTC: The maintenance is starting

EDIT 2023-02-28 22:18 UTC: The maintenance is now over.

[PAR] Degraded performance towards github.com

2023-02-24T14:54:00+00:00

We are currently experiencing degraded performances towards github.com services from our Paris infrastructure. We are investigating the issue. Tools relying on GitHub (composer, go, ...) might take longer than usual to fetch their dependencies or experience connections timeouts / instabilities.

EDIT 15:48 UTC: We are seeing improvements and the situation is currently back to normal. The root cause seemed to be a BGP announce change from GitHub's side that made our traffic go through suboptimal routes, leading to degraded performances. We keep monitoring the situation.

EDIT 16:30 UTC: The incident is fully resolved.

Core API is experiencing issues

2023-02-17T12:28:00+00:00

Clever Cloud Core API is currently experiencing performance issues. We are investigating it.

[PAR] Planned hypervisors reboot

2023-02-15T11:57:00+00:00

This is a follow up for the various hypervisors incidents we had those last weeks. A first batch of hypervisors will be updated to try and fix the issue. Impacted users will shortly be contacted by email.

The reboot is planned tonight (15/02/2023) at 22:00 UTC. Maintenance will start at 21:00 UTC.

EDIT 21:07 UTC: The maintenance is starting. Add-ons will be automatically migrated in the next few minutes.

EDIT 22:52 UTC: The maintenance is over.

[PAR] An hypervisor went down

2023-02-13T22:15:00+00:00

An hypervisor went down, we are investigating. Applications are being redeployed.

EDIT 22:47 UTC: The hypervisor is back online with add-ons UP since a few minutes. Root cause of the issue will be determined later. In the meantime, applications hosted on that hypervisor are still redeploying. We continue to monitor the situation.

EDIT 23:44 UTC: The incident is now over. Sorry for the inconvenience.

Deployments may be very slow or stuck when uploading cache artefacts

2023-02-10T10:38:00+00:00

We are currently seeing applications having troubles complete their deployments, especially when using dedicated build VM. They may be stuck or very slow at the cache archives upload. We are investigating.

EDIT 10:55 UTC: The root cause has been found. It was only impacting multipart uploads. For deployments already at the upload phase, you will need to cancel the current deployment and start a new one for the problem to be fixed. Sorry for the inconvenience.

[PAR] An hypervisor went down

2023-02-09T21:55:00+00:00

An hypervisor went down, we are investigating.

EDIT 22:24 UTC: The hypervisor is up again since 10 minutes. Add-ons are available again. We make sure all applications were redeployed.

EDIT 00:17 UTC: The incident is over.

A Hypervisor is unresponsive

2023-02-06T16:29:00+00:00

At 16:29 UTC, a staff member started investigating an alert on one of our hypervisors. They saw the hypervisor could not be logged into anymore.

All services running on that hypervisor are still up and running, but deployments fail to stop the obsolete VMs and we cannot connect to the host itself. We are considering a "semi" kernel crash on the hypervisor's host. We are investigating and may reboot the hypervisor in the following minutes/hours. (First, we try migrating as much important services as possible to avoid causing too much downtime to our customers.)

EDIT 16:46 UTC: We are starting to migrate add-ons on the impacted hypervisor.

EDIT 18:54 UTC: We rebooted the hypervisor, everything went well, all the remaining services are UP again.

[RETROACTIVE] Git repositories SSH remote identification changed

2023-02-04T11:39:00+00:00

Between 12:39 UTC and 20:10 UTC, some users may have experienced an error message WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! when pushing code using git+ssh on our Git repositories. This was due to an update of the allowed signature algorithms of our SSH servers. Users that had an old signature algorithm stored in their known_hosts ssh file were impacted.

The change has been rolled back.

[MTL] Performance issues on MySQL shared cluster

2023-01-31T08:45:00+00:00

A few customer complains about performance issues on MySQL shared cluster. We are investigating.

EDIT 10:00 UTC We have made a hardware upgrade to the MySQL shared cluster

Monitoring detect an increasing number of unreachable virtual machines

2023-01-30T21:43:00+00:00

Monitoring detect an increasing number of unreachable virtual machines. It seems related to an update deployment.

EDIT 01:00 UTC the update deployment has been rollback

Clever Cloud API is slow

2023-01-30T08:38:00+00:00

Monitoring report that the number of timeout increase on the Clever Cloud API. We are investigating why.

EDIT 9:08 UTC : Backends behind Clever Cloud API are up and running. Numbers of timeouts have decreased. Everything is operating normally.

[MTL] Increase size of FSBucket storage backend

2023-01-26T10:00:00+00:00

One server that host FSBucket need additionnal disk space.

EDIT 10:10 UTC Operation to increase the disk space is done. We are redeploying the associated applications

Live logs storage backend issue

2023-01-25T21:30:00+00:00

Live logs system has an issue with the storage backend that put it to read only mode

EDIT 22:56 UTC : The storage backend has left the read-only mode

[PAR] Two hypervisors have rebooted

2023-01-25T13:44:00+00:00

Two hypervisors have rebooted in the Paris zone. Deployments have been impacted and some applications and databases may be unreachable. We are investigating the issues.

** EDIT 13:59 UTC ** One hypervisor is up and running

** EDIT 14:52 UTC ** The second hypervisor is down due to hardware issues

** EDIT 15:22 UTC ** Applications and databases may be difficult to reach as a load balancer node is hosted on the down hypervisor

** EDIT 17:00 UTC ** Deployments may have been impacted, we are redeploying the system

** EDIT 17:30 UTC ** Hypervisor is up and running. We are cleaning up the last thing

** EDIT 18:17 UTC ** Hypervisors are up and running. All systems seems working normaly

Cellar read-only

2023-01-20T02:20:00+00:00

At 02:20 UTC, we started having alerts saying the Ceph pools are full. We are investigating this.

04:40 UTC, we take the decision to lower the replication ratio to let the cluster breathe.

A lot of backups failed, though. We will start them again during the day.

[Par] Some apps and Cellar are unreachable

2023-01-17T21:45:00+00:00

Several applications deployed on Paris region are not reachable. We are on it.

EDIT 21:56 UTC: we are experiencing a network connectivity issue, impacting parts of Paris region. Cellar is also impacted.

EDIT 22:01 UTC: Network connectivity is back online. Apps should be reachable. Cellar is in recovery, we are working on it.

EDIT 22:25 UTC: Cellar should be accessible. You may experience a bit more latency due to recovery processes in progress.

EDIT 22:57 UTC: Everything should be up.

[PAR] An hypervisor is down

2023-01-12T14:26:00+00:00

An hypervisor in Paris is down/unreachable. We are investigating it. You may experience some deployments issues.

Edit 3:25 pm UTC: The hypervisor is back online. All impacted applications have been redeployed. If you are experiencing an issue, please contact our support.

Metrics / access logs unavailability

2023-01-11T16:15:00+00:00

We are currently facing an unavailability of the Metrics and access logs stack. The problem has been identified and we are working to bring it up.

Metrics through the console or Grafana or access logs query is currently affected.

EDIT 16:44 UTC: The service is back up, we are starting to process the backlog of events. You should now be able to query the data but it might lag a bit.

EDIT 17:01 UTC: The queue has been ingested. The service is now back to normal. Sorry for the inconvenience

Deployments are slow

2023-01-10T13:31:00+00:00

Deployments are currently slower than usual. They may take more time to start or complete. We are investigating.

EDIT 13:46 UTC: The slowness is now resolved since 13:35. The initial cause of the slowness has been found and we continue to monitor the situation.

Deployments are unavailable

2023-01-09T15:18:00+00:00

Deployments are currently unavailable and failing for unknown reasons. We are currently investigating.

EDIT 15:25 UTC: Deployments are running again. Some more operations will be done in the next few minutes to stabilize the situation. In the meantime, we continue to monitor the health of the deployment system.

EDIT 15:45 UTC: The incident is now over. If you still have troubles deploying your application, please reach out to our support team. Sorry for the inconvenience.

Clock skew on Cellar C2 (Paris zone)

2023-01-07T02:06:00+00:00

Cellar C2 is having issue with time sync. It may result with a "ClockTooSkewed" error when you try to list or access files.

We are working on fixing the clocks on the Ceph monitoring servers. (Ceph is the software we use to provide the Cellar service.)

EDIT 12:40 UTC+1: One of the reverse proxies in front of the Cellar system was desynchronized. This proxy is now out of the pool for further investigation and the issue should now be fixed.

[Paris] Network instabilities

2023-01-05T08:42:00+00:00

We are currently experiencing network instabilities on the Paris zone. Our network provider is aware of the issue and we are currently awaiting for more information. One instability was detected at 9:42 UTC+1. No other since then.

Our metrics and access logs stack is currently unavailable, we are working towards bringing it back up.

Update 9:55 am UTC: Metrics and access log storage is now up. We are catching up the lag

Update 14:33 UTC: The lag of the Metrics and access logs platform is now resolved. Regarding the network instabilities, our network provider identified the issue and is working towards resolving it. It may take a few hours to get back to a nominal situation. We did not see any other instabilities since this morning.

Update 15:59 UTC: Another network issue happened at 15:50 UTC and lasted for ~1 minute, parts of the Paris zone was unreachable during that time.

Update 23:11 UTC: No other incident has been seen, we are still waiting for our network provider to ensure that the issue is resolved on their end.

Update 2023-01-09 14:18 UTC: We've seen two new events, one at 13:23 UTC and another at 14:14 UTC. We notified our network provider. Those may be related to the same problems we've seen last week.

Update 2023-01-09 19:47 UTC: Those two events weren't linked to the ones seen last weeks. The reason has been identified by the network provider and has been fixed. We are still waiting for confirmation of resolve on the original issue.

API authorization is failing

2023-01-04T05:45:00+00:00

Some part of the authentication process in the API is failing. We identified the error and are trying to fix it.

Maintenance on Git repositories servers

2022-12-29T14:27:00+00:00

The Git repositories servers on all of our zones will go under maintenance today at 18:00 UTC+1. This maintenance will have the following impacts:

Delayed git repository creation for newly created applications
Delayed add or removal of SSH keys authorized to interact with the git repositories

GitHub applications will not be impacted.

During the maintenance, you will be able to continue to push your updates as well as do deployments. The maintenance is expected to last up to 1 hour. If you have any questions, please reach out to our support team.

EDIT 18:01 UTC+1: The maintenance is starting.

EDIT 18:35 UTC+1: The maintenance is now over. Thanks for your patience.

Maintenance on core API

2022-12-26T13:40:00+00:00

Due to an issue with our core APIs, we are doing urgent maintenance. It should take 15 minutes. Deployments will be blocked during this time. Applications will keep running.

EDIT 13:45 UTC - done.

Dedicated load balancer issue

2022-12-20T15:59:00+00:00

Following a maintenance, few servers that host some dedicated load balancers are seen unreachable by the monitoring

EDIT: 2022-12-20 19:47 UTC : During the recovery process some services goes down with tls issues

Paris: an HV is down

2022-12-19T03:00:00+00:00

An hypervisor in Paris is not responding. We are investigating.

Update 4:16 AM UTC: HV is now up. We are running the cleanup tasks associated to the HVs

Update: 4:54 AM UTC: Cleanup is over.

Paris: networking issues

2022-12-09T10:09:00+00:00

One of our Paris datacenter encounters network issues. Some servers were unreachable for about 1 minute at 10:09 UTC. Services (applications, add-ons, Cellar, ...) hosted on those servers were partially unreachable or fully unreachable during that time (depending on the scalability or replication of those services).

The cause has been identified and a solution is currently being investigated. This incident will be updated as soon as we have more information.

EDIT 13:44 UTC: Another network interruption happened at 13:01 UTC. A fix is currently being tested.

EDIT 14 Dec 2022 15:55 UTC: The fix appears to be working as expected. This incident is now over.

Paris: Monitoring issues on some applications

2022-12-07T15:45:00+00:00

Some applications are having troubles being monitored by our systems and get redeployed with the Monitoring/Unreachable reason even though they are still available. We are investigating the cause of the issue.

EDIT 15:55 UTC: The cause has been found. This issue only affects applications tied to a unique IP proxy service. The issue has been mitigated in the last minutes and we are working to fully fix it.

EDIT 16:20 UTC: The issue has been fixed and should not happen again. If you encounter weird Monitoring/Unreachable deployments, feel free to contact our support team.

Deployment issues

2022-12-05T17:00:00+00:00

Our distributed deployment message queuing system lost a broker which has lead to some lags during the consumption and a few components unable to reconnect properly to the cluster.

EDIT: 18:30 UTC - all systems are up

Read/write latency on Cellar

2022-11-23T17:50:00+00:00

We are experiencing high read-write latency on Cellar. We are working on it.

EDIT 24 of november 9:33 UTC: Balancing is over.

Deployment lag

2022-11-21T12:59:00+00:00

Some deployments lag

A monitoring desynchronization causes disturbance on deployments. We are investigating the troubles. We are manually cleaning unnecessary deployments

We have to clean some stuck deployment but system is now recovered

Pulsar production issue

2022-11-17T09:06:00+00:00

The storage layer does not allow any writes more. We are scaling the Pulsar storage capacity

Main API increased response time

2022-11-07T13:48:00+00:00

We are currently having an increased response time on our main API (api.clever-cloud.com). Services relying on it are impacted and might experience extra time to answer to requests.

We are investigating the issue.

EDIT 14:26 UTC: The underlying issue has been identified and fixed. Services, including the Console and CLI should now be loading as usual. Sorry for the inconvenience.

Metrics / Access logs: ingestion issue

2022-11-06T11:20:00+00:00

We are currently having ingestion issues on the metrics and access logs services. Data for the last 20 minutes is currently missing. We are investigating.

EDIT 11:23 UTC: Ingestion lag is now resolved, metrics and access logs should now be up-to-date.

SYD/SGP network issues

2022-10-31T18:57:00+00:00

Our monitoring has raised network issues, we are investigating.

Status:

Ping does not go through between PAR (on BSO Network) and SGP/SYD (OVH Network)
Ping does go through between PAR and other OVH zones (RBX, MTL, WSW…)
Ping does go through between RBX and SGP/SYD (OVH to OVH)
Applications on both SGP and SYD are still UP and reachable from other networks. Deployments on these zones are still unavailable.

UPDATE 20:13 UTC: Network is kind of coming back up, but we see 80% to 90% packet loss. 21:50 UTC: still a lot (90%) of loss on the PAR -> SGP/SYD route, way less (30%) in the SGP/SYD -> PAR route. 2022-11-01 0812 UTC: >90% of loss on the PAR -> SGP/SYD route. 2022-11-01 1812 UTC: network seems fine.

PAR zone proxies unavailable

2022-10-27T10:18:00+00:00

PAR reverse proxies unavailable for a short time period

Network issue on OVH

2022-10-27T09:49:00+00:00

Some network issues on our provider OVH can leads to some desync of reverse proxy configurations

Performance issue on Metrics/AccessLogs

2022-10-27T09:20:00+00:00

We are experiencing performance issues on our metrics/accesslogs infrastructure. We are on it.

Update 10:36 UTC: Performance has been fixed.

Deployment queue / Application slowness

2022-10-25T09:40:00+00:00

Some deployments can have abnormal delays to deploy. Applications may experience slowness.

** EDIT 11:55 UTC **: We have found the root cause, we have mitigated the issue. we are deploying the solution.

Metrics and access logs datastore issue

2022-10-24T03:50:00+00:00

The data store behind Metrics and access logs have lost a node. Some lags to query metrics and access logs could be observed.

Issues on OVH network

2022-10-23T11:26:00+00:00

At 11:26 UTC, the monitoring started alerting about unreachable proxies on RBX, RBX-HDS, MTL2, WSW. All these zones are hosted on the OVH network.

We are investigating and watching the situation.

At 11:53 UTC, the monitoring sees everything up again. We perform a few check on some services

Deployments slowness issue

2022-10-19T15:51:00+00:00

We observe slow deployment times, we are investigating why.

** EDIT 18:10 UTC ** : The issue has been identified and actions to solve this issue has been performed

Deployments issues

2022-10-14T07:58:00+00:00

Due to the pulsar incident, some deployments may fail from time to time.

Some hypervisors are behaving strangely. We are watching and fixing them.

EDIT 10:20:00 UTC: Deployments are currently unavailable while we work around the issue.

EDIT 11:31:00 UTC: Deployments issues are fixed. We continue to monitor the situation. If you have troubles redeploying an application, please contact our support.

POSTMORTEM: The Pulsar outage that started around 04:30 UTC (see https://www.clevercloudstatus.com/incident/574) got in the way of:

the deployment process, breaking some notifications at 09:30 UTC.
the uptime of some persistent VMs (like databases) (See https://www.clevercloudstatus.com/incident/576), making the monitoring trigger deployments.

The pulsar notification system is being gradually deployed on our infrastructure, having passed the tests on our preproduction zone. We do have a fallback method for notifications. However, the issue was weird enough that the pulsar notification was not cleanly failing. They rather timed out after a long time, preventing the fallback to trigger. We stopped all deployments at 10:20 UTC. We worked on quickly adding an emergency flag to prevent the hypervisors from using pulsar for notifications. This way, we can bypass it and go straight to the fallback method.

To avoid this issue, we are working on the following:

monitor the pulsar logs before it impacts the rest of the production.
try to mitigate the long timeout issue on the notification actors, allowing for a quicker fallback.

Pulsar add-ons issues

2022-10-14T04:30:00+00:00

The pulsar cluster hosting the pulsar add-ons is undergoing issues. We are investigating.

POSTMORTEM (all times are UTC) : Around 04:30: Timeouts in inter-nodes connections started to show up in the logs. They did not lead to alerts in the monitoring Around 05:00: We start getting issues in our infrastructure from software using that cluster.

11:30 : we disable the brokers to analyze the issue.

14:42: The incident is now resolved. If you still encounter any problems, please contact our support.

[RETROACTIVE] [PAR] Some databases instances went down.

2022-10-14T04:30:00+00:00

At 04:30 UTC: a pulsar cluster started to behave strangely (See https://www.clevercloudstatus.com/incident/574 ) At 05:30 UTC: on PAR, notification services on the hypervisors try to send messages in a loop, filling the system with stuck processes. At 07:00 UTC: the OS of these hypervisors start to kill processes to make room. It impacted some applications and databases. We start working on shutting down the stuck processes and restarting the broken instances. At 10:00 UTC: we finish restarting all the broken instances.

[RETROACTIVE][RBXHDS] Random HTTP 503 responses errors

2022-10-12T08:24:00+00:00

Between 10:25 UTC and 20:25 UTC, some applications hosted on the OVH RBXHDS zone may have experienced random 503 response errors due to faulty reverse proxies. The issue has been found and is now resolved.

Additional investigations will be conducted to understand why our monitoring system did not report the issue earlier. Apologies for the inconvenience.

Event API issue

2022-10-06T15:34:00+00:00

Event are not ACK

Maintenance on add-on APIs databases

2022-10-04T13:09:00+00:00

Add-on APIs database cluster disk is nearful. We are migrating it to a bigger disk.

Operation will take 10 minutes, during which add-on API will be unreachable.

Deployments are DOWN

2022-10-03T04:45:00+00:00

(Times are UTC) 04:45 - Deployments are broken because of a pulsar issue. We are investigating.

05:45 - To prevent issues on the infrastructure, we disabled all deployments.

05:55 - We detect that some VMs are DOWN. It seems that the pulsar connection issues have overwhelmed the hypervisor's processes.

06:05 - We shut down the processes that fill up the hypervisors. It seems to fix the issue.

06:20 - The deployments seem to be back on tracks. We continue investigating the pulsar issue before putting it back into the deployment processes.

09:09 - We are still experiencing deployments issues. We are investigating.

12:28 - Deployments have been fixed.

High latency observed in PAR

2022-10-02T13:30:00+00:00

We are observing high latency on our reverse-proxies on PAR.

It looks like we are under a DDoS. We are monitoring it and blocking IPs that are performing the most requests.

EDIT 15:08 UTC: we have found the application that was taking 50% of all the platform traffic. We blocked all the IPs trying to reach that application. Traffic is now operational.

Ingestion queue lag

2022-09-30T15:30:00+00:00

Our distributed database responsible for metrics and access-logs storage is not ingesting fast enough. As a result, you may experience some lags during queries. We are investigating.

EDIT 16:06 UTC: Ingestion lag is now resolved.

APIs are slow

2022-09-28T11:29:00+00:00

Our "main" API is very slow. We are investigating to find out why.

Ingestion issues

2022-09-27T15:44:00+00:00

Some components related to ingestion of metrics and access logs are currently overloaded. We are working on it.

** 16:30 UTC **: Incident has been resolved

Elasticsearch add-ons: License expiration

2022-09-26T10:16:00+00:00

Some Elasticsearch add-ons are currently reporting a license expiration. The license is set to expire on 2022-09-30 23:59:59 UTC. Our team is currently working on it and the license of affected add-ons will be updated prior to the expiration date. We will update this incident once all add-ons are updated.

No service degradation is to be expected from this warning.

Please reach out to our support team should you have any questions regarding this matter.

EDIT 2022-09-29 17:30 UTC: A first license update has been applied. A new license update will be applied in the following days to finish the license update.

EDIT 2022-10-12 16:55 UTC: All licenses have been updated with a valid platinum license. The incident is over.

Access logs unavailability

2022-09-24T10:00:00+00:00

The service encountered an outage in the ingestion path which retained the access logs at the messaging-level. While being identified by the monitoring, this error unfortunately triggered as low-priority, hence being silent during a part of the week-end which led to a drop of access logs after a retention period. We have found the root issue and the problem is now resolved and should prevent further similar incident. Besides, we've fixed the level of criticity in our monitoring infrastructure.

26/09/2022 12:00 UTC: End of incident

Planned Grafana Update

2022-09-23T10:23:00+00:00

A maintenance will occur on the Grafana used to plot Clever Cloud metrics the 09/27/2022 at 2:30 p.m. (CEST). We will update our instances to the last major release of Grafana: Grafana 9. You can check Grafana release post to learn what this change will bring to you: https://grafana.com/docs/grafana/latest/whatsnew/whats-new-in-v9-0/.

Application provisioning might fail for Jenkins and Heptapod runners with HTTP 500 Internal Server Error

2022-09-21T15:30:00+00:00

Applications creation might fail for Jenkins runners with HTTP 500 Internal Server Error. A fix will be soon deployed to fix the underlying issue.

EDIT 16:46 UTC: The fix has been deployed. We are monitoring the situation. This issue also impacted Heptapod runners creation.

EDIT 17:28 UTC: The issue has been fixed, runners creation are now working correctly. Sorry for the troubles.

Network issues on OVH zones

2022-09-20T11:45:00+00:00

We are currently seeing network loss between our Paris infrastructure and our zones on OVH (Roubaix, Montreal, ...). We are currently investigating the issue.

EDIT 11:50 UTC: First investigations are showing that it is not only a network issue between our Paris infrastructure and the OVH network. It seems to impact other network links as well. We will reach to OVH and try to know more about it.

EDIT 11:51 UTC: The incident has been renamed from "Network issues between Paris and OVH zones" to "Network issues on OVH zones"

EDIT 11:58 UTC. We are seeing improvements since a few minutes now. Connectivity has been restored from our point of view. We keep waiting for more information.

EDIT 12:09 UTC: We have not seen any new disruption so far. We consider this incident closed while we wait for a more detailed incident report from OVH.

EDIT 12:59 UTC: OVH status: https://network.status-ovhcloud.com/incidents/5mldyhd6v99c

Logs platform instabilities

2022-09-19T07:34:00+00:00

Some HBase datanode have lost their regions all datanodes are OK

[MTL] Planned Hypervisor reboot

2022-09-14T10:08:00+00:00

An hypervisor needs to be rebooted on our Montreal zone. The reboot will happen at 08:00 UTC on Friday, 16 September. Add-ons that support automatic migration will be migrated automatically starting at 07:30 UTC. You can also perform the migration at a time that suits you more before the given deadline.

This will also impact some FSBuckets add-ons during which reads and writes will be unavailable. Applications will be redeployed automatically once the maintenance is over to make sure they correctly re-connect to the FSBucket server.

The maintenance is expected to last 15 minutes.

Impacted users will shortly receive an email with the impacted add-ons.

** Edit 08:05 UTC ** Waiting for last migration to end

** Edit 08:25 UTC ** Last migration has ended, the maintenance is beginning

** Edit 08:35 UTC ** The server has rebooted successfully

** Edit 08:55 UTC ** Everything is up and running normally

[PAR] FS Buckets server issues

2022-09-13T12:37:00+00:00

A file-system bucket server was down during 6 minutes beginning at 12:24 UTC and ending at 12:30 UTC on Paris data center.

We have fix the issue and watching the service.

Latest metrics are not visible

2022-09-07T02:30:00+00:00

Some tokens used by our infrastructure have not been renewed. As a result, some vms cannot push their latest metrics We are working on it.

EDIT 10:38 UTC: All expired tokens have been regenerated and updated. Sorry for the inconvenience.

Ingestion queue lag

2022-09-02T11:00:00+00:00

Our distributed database responsible for metrics and access-logs storage is not ingesting fast enough. As a result, you may experience some lags during queries. We are investigating.

EDIT 03/09/2022 12:10 UTC: lag is finally catching up, we will keep you posted.

EDIT 03/09/2022 16:10 UT: lag is fully recovered

[PAR] Planned Hypervisor reboot

2022-08-30T09:00:00+00:00

An hypervisor needs to be rebooted on our Paris zone. The reboot will happen at 22:00 UTC on Monday, 5th September. Add-ons that support automatic migration will be migrated automatically starting at 21:00 UTC. You can also perform the migration at a time that suits you more before the given deadline.

The maintenance is expected to last 15 minutes.

Impacted users will shortly receive an email with the impacted add-ons.

EDIT 2022-09-05 21:10 UTC: Add-ons migrations is starting

EDIT 2022-09-05 21:40 UTC: Add-ons have been migrated. The hypervisor reboot will happen in twenty minutes.

EDIT 2022-09-05 22:00 UTC: Hypervisor is rebooting

EDIT 2022-09-05 22:28 UTC: Hypervisor has been rebooted in 4 minutes, fsbucket server went back one minute later with most clients reconnecting. We started all affected applications to make sure everyone properly reconnects.

One Hypervisor DOWN in MTL2

2022-08-29T14:19:00+00:00

One hypervisor went down in MTL2. We are trying to reboot it.

It affects: 1 load balancer 1 redis add-on 1 mysql add-on The free postgresql databases on MTL.

Update 16:40 after investigating, we decide to redirect the IP of the load balancer to the second LB. A ticket is open at OVHCloud to investigate what seems to be a hardware issue. Update 17:56 OVHCloud team physically checked the server: the RAID card was broken. They changed it and restarted the server. Update 18:05 All VMs on the hypervisor are up and running again.

[PAR] An hypervisor is unreachable

2022-08-26T17:28:00+00:00

An hypervisor on the Paris zone is currently unreachable. We are looking into it.

EDIT 17:38 UTC: Hypervisor has been rebooted. Services are being restarted.

EDIT 18:08 UTC: Services have all been restarted. We continue looking into why the hypervisor went down and continue to monitor the situation.

EDIT 18:27 UTC: Initial investigation shows that a KVM kernel bug was encountered, leading to a kernel crash. We will investigate further to see if this can be mitigated by an update. The incident is now over.

[New York] Network loss

2022-08-25T06:05:00+00:00

We are seeing a network loss towards the New York zone from multiple places since 06:05 UTC. We are looking into the issue. Applications and add-ons may not be reachable from different places and multiple services on the zone (deployments, logs) will not be available.

EDIT 07:04 UTC: We are seeing network improvements to reach the zone. It is currently operational but we are still waiting on confirmation from our provider. From our point of view as of now, traffic towards the zone was dropped when reaching the Level3 network transit. Our network provider seems to have changed it to another provider, allowing us to reach the zone again.

EDIT 12:18 UTC. The network problem is fully resolved. We are still waiting for an incident report from the network operator of the Datacenter. We will share it once available.

EDIT 2022-08-26 14:27 UTC: Here is the report from our provider: It has been identified that the incident is due to a bug found in our device at DRT1. As an initial resolution, our team rebooted the device. Consequently, all alarms cleared and all services were restored after executing the said activity. As of the moment, we can confirm that the link has remained clean and error-free since the service went up.

[PAR] An hypervisor is unresponsive

2022-08-19T15:56:00+00:00

We are investigating an unresponsive hypervisor on the Paris zone. An FSBucket server is on this hypervisor, some PHP applications may be impacted as well as add-ons hosted on this hypervisor.

EDIT 18:02 UTC+2: Hypervisor is rebooting

EDIT 18:04 UTC+2: Hypervisor is up again. Services are currently restarting.

EDIT 18:25 UTC+2: Hypervisor services are all up since a few minutes. Add-ons should now be reachable. Applications of owners using the FSBucket server that is hosted on this hypervisor will be redeployed. Since there is a huge number of applications, you can deploy them on your end directly if needed. We will continue to monitor the situation.

EDIT 19:10 UTC+2: The situation seems to be back to normal. We will investigate further why this hypervisor became unresponsive. If you still have any issues, please contact our support team.

Lost FRS server

2022-08-17T09:08:00+00:00

We lost a server hosting FS buckets. server up and running

An FS Bucket has become unreachable

2022-08-05T12:59:00+00:00

A FS Bucket machine has become unreachable from 2:59 PM to 3:01 PM today. It has been rebooted and is now available.

An reverse-proxy has become unreachable.

2022-08-04T13:53:00+00:00

An reverse-proxy for add-ons has become unreachable from 3:53 PM to 3:59 PM. It was rebooted.

Deployments slow down

2022-07-31T20:13:00+00:00

We are experiencing slow down of deployment, we have identified the root cause and are working on the solution.

EDIT: 00:54 Issue has been resolved, deployments must be worked normally

Tickets center maintenance in the Clever Cloud Console

2022-07-25T16:35:00+00:00

At 19:00 UTC+2 on Tuesday 26th July 2022, our tickets center the Clever Cloud Console will be unavailable for a few minutes.

Once the maintenance is over, you will have to refresh your Clever Cloud Console to be able to access your tickets or contact our team.

During this maintenance, you will still be able to reach our support team using our email address: support@clever-cloud.com

EDIT 2022-07-26 18:59 UTC+2: The maintenance is about to start.

EDIT 2022-07-26 19:10 UTC+2: The maintenance is now over. You will need to refresh your Clever Cloud Console to access the ticket center.

[RETROACTIVE][PAR] Shared RabbitMQ cluster publishers unavailability

2022-07-22T12:32:00+00:00

The cluster refused publishers messages since 12:32:34 UTC due to a system wide alert that stopped all nodes from accepting publishers messages. This means that RabbitMQ clients would keep trying to publish their messages until the cluster accepted them. The node has been restarted at 17:23 UTC, fixing the issue.

Investigations will be carried out to understand how this happened and why our monitoring did not raise an alert.

The cluster should now be fully operational.

One FSBucket node is down

2022-07-20T09:31:00+00:00

One of our FSBucket node is currently down. We are working to resolve the issue.

EDIT 10:40 UTC - fixed.

Deployments may be blocked

2022-07-18T16:39:00+00:00

In rare occasions, an inappropriate behavior of our scheduling infrastructure can lead to deployment being stuck. We've identified the root cause and we're qualifying a fix. If it happens, don't hesitate to reach our support team.

Connectivity issue between PAR and NYC

2022-07-16T11:48:00+00:00

We are experiencing connectivity issue between NYC and PAR. These connectivity issue are impacting deployments on the NYC zone. We are working on it.

EDIT 16:03 pm: connectivity has been resolved

Metrics maintenance

2022-07-13T07:30:00+00:00

In our efforts to stabilize the Metrics infrastructure, we will perform a maintenance on 13 of July. Once it is started, some lag can be expected for a few hours.

Maintenance will start at 07:30 am UTC

EDIT 07:30 am UTC: Starting maintenance

EDIT 08:16 am UTC: Maintenance is over, we are catching up with the lag

EDIT 08:30 am UTC: Queries are currently disabled to speed up recovery

EDIT 09:17 am UTC: our maintenance triggered a major compaction on our storage layer. To speed up recovery, query are still disabled

EDIT 16:20 pm UTC: major compaction is over. We are struggling to handle both read and write operations at the same time. We are working on it.

EDIT 20:23 pm UTC: queries are still disabled. We are testing new configurations to resolve the issue

EDIT 14 of July 9:22 am UTC: it's a brand new day, we are still working on it.

EDIT 14 of July 18:26 pm UTC: We are struggling to handle both read and write operations at the same time. We are working on it. Happy french national day.

EDIT 16 of July 17:35 pm UTC: We found a performance issue triggered when the dotmap on the Console is accessed. We disabled some macros used to retrieve data to allow other users to access metrics. Metrics and access logs are now accessible.

[RETROACTIVE][PAR] Cellar unavailability

2022-07-11T11:24:00+00:00

The service was having troubles handling most of the requests between 11:24 and 11:28 UTC. We will investigate further the issue. The Cellar service is currently operational.

Roubaix: intermittent network failures

2022-07-10T09:31:00+00:00

Starting 09:31 UTC, we saw intermittent network failures on the Roubaix (RBX) zone hosted on OVH. Failures are both from the external and internal networks. Timeouts reaching your applications or add-ons might have happened.

Some applications are being redeployed for Monitoring/Unreachable because the monitoring couldn't see them anymore.

Things seem to be working fine again since 09:37 UTC. We continue to monitor the situation and will try to get more information from OVH.

EDIT 11:12 UTC: The issue has not occurred again. We will wait for any input from OVH and will add it here if we get any useful information.

Roubaix: an hypervisor has been lost

2022-07-08T15:27:00+00:00

An hypervisor has been lost on the OVH Roubaix zone. We are investigating. Impacted services are FSBuckets and add-ons.

EDIT 15:32:00 UTC: The server is back online. We are making sure services are correctly restarted. Additional services were impacted: One application reverse proxy and one add-on reverse proxy were unavailable.

EDIT 15:48:00 UTC: We are still investigating the cause of the reboot. We opened a ticket on OVH services to know if they had any un-planned intervention for that machine.

EDIT 16:03:00 UTC: The machine is unreachable again. We are investigating.

EDIT 16:11:00 UTC: The machine is up again. We are starting to suspect a hardware issue.

EDIT 16:30:00 UTC: We will drop all services from the machine to avoid any other issues until we know more about the underlying issue. FSBuckets server will be moved out around 19:00 UTC.

EDIT 19:59:00 UTC: Unfortunately, FSBuckets are going to require more time to move to another server. So far the server is working fine but OVH suspects an issue with the power supply.

EDIT 23:58:00 UTC: The FSBuckets migration is starting. FSBuckets will be set into read-only and applications will be redeployed to use the new server.

EDIT 2022-07-09 00:28:00 UTC: Buckets are fully migrated. The server is now empty and will be investigated further by OVH. This incident is now over.

[PAR] Network maintenance

2022-07-06T08:21:00+00:00

A network maintenance has been scheduled by our network provider for Wednesday 06/07/22 at 22:30 UTC. The maintenance should not have any visible impacts other than a few seconds of network delay while the network links switch to the backup links.

EDIT 22:30 UTC. The maintenance is starting.

EDIT 22:55 UTC: Maintenance is over, no visible impact happened, links failed over in less than 100ms each time.

Ingestion queue issue

2022-07-05T12:29:00+00:00

One of the server queue storage reach its disk max storage capacity

One of the partition is corrupted, fixing

EDIT 17:10 UTC: The underlying issue has been fixed. The queue is currently being processed. Some events might have been lost during the cluster rebalance. Data points will take a few more hours to be up-to-date in the various dashboards.

EDIT: Queue is in sync

Main API is currently experiencing timeouts

2022-07-04T14:50:00+00:00

We are currently looking into it. Console and CLI are not working correctly.

A batch was sent by an employee. The throttle interval was set two small and the batch made a huge amount of queries to the database, making it unresponsive. We stopped the batch and will restart it with a higher throttle interval.

Removal of TLS 1.0 and 1.1 from our load balancers

2022-07-01T13:21:00+00:00

When you access a website or an online application, you most often do so in a “secure” way. This is for example the well-known green padlock that symbolizes HTTPS connections in your browser, which has become a standard these years thanks to initiatives like Let’s Encrypt.

This means that the data transferred to the server is encrypted, and that even if they are intercepted, they cannot be read by a third party. This protection has been provided by the TLS (Transport Layer Security) protocol for almost 20 years, whether it’s a personal site, an online shop or an access to your bank’s services.

Over time, this critical technical brick on the Internet has evolved to strengthen the level of security it offers. In August 2018, its version 1.3 (the latest) was released. Meanwhile, versions 1.0 and 1.1 were considered to no longer offer a sufficient level of protection. They have been deprecated by the IETF (Internet Engineering Task Force) since March 2021 and have therefore been gradually removed from recent browsers such as Firefox, Chrome and its derivatives or Safari.

At Clever Cloud, we have seen our customers adopt TLS 1.2 and 1.3 gradually. On our load balancers, based on our in-house and open source reverse proxy Sōzu, the latest version accounts for over 90% of the requests processed each day. TLS 1.2 for just under 9%. TLS 1.0 and 1.1 for only a few tens of thousands of requests per day, less than 0.1% of our traffic.

While we have maintained these versions for compatibility reasons, this will no longer be the case as of June 30. We will of course inform the customers affected by this choice, and encourage them to switch to more recent versions, which will have advantages for them in terms of security, performance and SEO.

Several reminders will be sent between now and the final shutdown of TLS 1.0 and 1.1. If you have any questions on this subject, please contact our support team through the Console.

EDIT 2:00 PM UTC: every public load balancers has been updated with new configuration

Network issue and partial reverse proxies outage on RBX

2022-06-29T17:12:00+00:00

17:12 UTC, there was a unreported network issue. It caused two of our reverse proxies to fail. 17:13 UTC, two alerts get sent through the on-call system. The on-call person ACK both of them, handles the first one and mistake the second one for a redundant alert of the first one. 18:30 UTC, some customers complain about issues between APIs. We start investigating. 19:45 UTC, the culprit is found: a reverse proxy was down. It is restarted and everything goes back to normal. 19:50 UTC, we find the unattended alert and understand the mistake that was made. (reading the two alerts as one issue.)

Storage issue on Warp10

2022-06-29T10:29:00+00:00

A component stop consuming this queue

Metrics / Access logs: ingestion and query issues

2022-06-22T13:25:00+00:00

We are experiencing issues network connectivity issues

EDIT 14:37 UTC: Network connectivity has been resolved. Database is starting.

Metrics / Access logs: query issues

2022-06-21T13:35:00+00:00

One of our indexes is reloading which can lead to performance issues on queries.

Ticket Center is unavailable

2022-06-21T06:06:00+00:00

Due to a massive Cloudflare outage (https://www.cloudflarestatus.com) the support is not available in the ticket center. You can still contact the support via email to support@clever-cloud.com

Edit 07:13 UTC : the ticket center is back online.

Metrics / Access logs: query issues

2022-06-20T05:00:00+00:00

One of our indexes is reloading which can lead to performance issues on queries.

EDIT 13:02 UTC: The index has reloaded

[PAR] An hypervisor has been lost

2022-06-10T05:30:00+00:00

An hypervisor has been lost on the Paris zone. We are investigating.

EDIT 06:04 UTC: The server experienced a hardware failure. It may not be able to come back. Applications on it were redeployed elsewhere. Custom services and add-ons are currently impacted.

EDIT 06:23 UTC: A public reverse proxy serving requests for domain.par.clever-cloud.com (185.42.117.109) was on this hypervisor. This IP was moved to another server. Between 05:23 and 05:35, it was unreachable.

EDIT 06:52 UTC: ETA for server to come back is 08:00

EDIT 07:46 UTC: Hardware has been changed, server will be rebooted.

EDIT 07:57 UTC: Server is back online, we are making sure all services are up.

EDIT 09:10 UTC: Everything is now back to normal, the incident is over. We will investigate further on the reason of the hardware failure.

Mails delivery issues

2022-06-09T12:25:00+00:00

Our mail provider is currently experiencing issues. You may notice delays in receiving emails for notification, password forgotten, or account signup, billing and other services. You may also experience errors when clicking on links in those emails, like "Bad request".

EDIT 13:55 UTC: Our provider now indicates that emails should now be received with some delays.

EDIT 16:15 UTC: Email delivery should now be working fine again. Our provider's incident is over.

[PAR] Random 503 unavailable errors on public reverse proxies

2022-06-08T21:20:00+00:00

We are seeing an unusual amount of 503 errors on public reverse proxies, we are looking into it.

EDIT 21:28 UTC: The issue has been found and fixed. We are monitoring the situation.

EDIT 21:40 UTC: Everything seems to be back to normal. The issue was happening for a couple of applications starting around 16:30 UTC. We will investigate further on why its configuration was out of sync during that time period.

A hypervisor is down

2022-06-08T16:13:00+00:00

16:13:00 UTC: A hypervisor has stopped responding. We are investigating why. The system is redeploying the applications that were on it. Some reverse proxies are not responding.

16:24:00 UTC: At first look, it seems that a network error is making us see that hypervisor as down. No information yet on if it's a hardware or software network issue.

16:28:00 UTC: The hypervisor seems to be back up again. We are making sure everything on it is responding well.

16:40:00 UTC: Everything has been check and is responding correctly.

Impacts:

Some add-ons became unresponsive.
Logs were not served.
One public reverse proxy was unresponsive. Traffic should have been diverted to others. Applications may have been a bit slow.
Some custom services for customers were unresponsive.

Deployments are experiencing issues

2022-06-08T14:33:00+00:00

Deployments are currently experiencing various issues, we are investigating.

EDIT 14:55 UTC: The problem has been identified and fixed. Deployments should now be working for the last 10 minutes. Sorry for the inconvenience.

[RETROACTIVE][PAR] An add-on reverse proxy was unreachable

2022-06-07T14:45:00+00:00

An add-on reverse proxy was unreachable between 14:45 and 14:48 UTC. It has been restarted and is now serving requests as expected. Applications may have failed reaching their add-on during this time.

Service instabilities

2022-06-07T09:20:00+00:00

Our monitoring shows abnormal CPU usage on some Pulsar brokers, we are investigating.

EDIT: we stop some components which were increasing load of the cluster. it should be more stable now

Paris zone is experiencing network issues

2022-06-06T19:22:00+00:00

[Times in UTC] 19:30: We are experiencing network issues in our Paris data center.

19:40: The culprit is a switch that half stopped responding. Turns out that it's not broken enough so its routes are automatically removed. Our DC contractor is moving to physically remove the switch. ETA is 30 minutes.

20:00: Cellar seems to be up again. We are still watching and waiting for a direct confirmation from our DC contractor.

00:00: Everything is back to normal

[PAR] Unique IP service planned unavailability

2022-06-06T14:57:00+00:00

The unique IP service will undergo a maintenance period for 30 minutes on June 7th starting at 20:00 UTC. During this time period, the service will be unavailable. Applications using the service will encounter timeouts or various errors when trying to use the service.

Applications will automatically be restarted once the maintenance is over.

EDIT 20:05 UTC: The maintenance is beginning

EDIT 20:28 UTC: The downtime was reduced to a few minutes but multiple network cuts may have happened. Applications linked to this service are currently redeploying.

Server lost

2022-06-04T09:20:00+00:00

A hardware failure occurred on one of our server (hv-par4-001) Applications are being redeployed on other ones Addons are impacted

Mongodb shared cluster issue

2022-06-03T12:36:00+00:00

After an anormal CPU load, one of the Mongodb did not restart.

EDIT: trying to repair database files EDIT: database filesystem repaired

EDIT 04/06: MongoDB process has restarted. Some customer perform expensive queries on the MongoDB cluster, which can cause an OOM of the process,

EDIT 06/06 10:31:06 UTC: mongodb-c2 is still experiencing issues, we are working on it.

EDIT 06/06 11:24:00 UTC: Because of a replication recovery bug not fixed by MongoDB on pre-SSPL version, we are working on making databases back from the previous backups made overnight. Everything should be back on in the afternoon. Users can setup new dedicated database with the previous backups for faster recovery.

EDIT 06/06 13:45:00 UTC: Restore process has began, it will take a few hours. We will keep you posted.

EDIT 06/06 15:01:00 UTC: We restored half of the customers. We are expecting full recovery in a few hours.

EDIT 06/06 17:01:00 UTC: An issue occured while restoring the databases. We are investigating.

EDIT 06/06 23:00:00 UTC: We restored all the databases that were not above usage quota. The cluster is now running and we improved how we export connection data so applications will behave better when connecting.

Current state:

DBs have been imported from backups. Backups that were above the free quota were not imported.
Connection URIs have been updated to include the whole replica set. This will simplify and stabilize how applications connect to the cluster.

TCP reverse proxies on Montreal zone are experiencing issues

2022-05-31T19:52:00+00:00

We identified one flaky TCP reverse proxy in the Montreal zone. We are investigating.

EDIT 20:37 UTC - fixed.

[PAR] Unique outgoing IP service issues

2022-05-31T09:06:00+00:00

There are some issues with this service for now, applications traffic may not be routed through the proxy and may end up using another IP (hypervisor's IP) instead. We are investigating.

EDIT 09:21 UTC: The issue should have been fixed. Your applications might need to be redeployed if the issue persists. We continue to monitor the service.

EDIT 13:11 UTC: We didn't see any other issues with the service, the issue is now resolved.

PAR: An hypervisor is currently unreachable

2022-05-29T22:25:00+00:00

We are currently having an unreachable hypervisor on the Paris zone due to a connection loss. We are trying to restart it. Impacted applications are automatically redeployed.

EDIT 22:46 UTC: The hypervisor doesn't reboot, we continue our investigation.

EDIT 00:06 UTC: The hypervisor is back online since a few minutes. All services are now available again. The extended period of downtime has been identified and will be fixed on similar hypervisors to have a faster recovery next time.

Metrics / Access logs: ingestion issues

2022-05-29T19:10:00+00:00

Ingestion of new access logs and metrics points is currently having an issue, leading to missing data points in metrics. Access logs ingestion is currently on hold and will be processed later. The issue has been identified and we are working to fix it.

EDIT 21:04 UTC: Ingestion is now back to normal. Access logs will be processed over the next few hours.

[PAR] Server lost

2022-05-27T12:53:00+00:00

We lost a server which host severval components on PAR zone

UPDATE: all applications have been redeployed

Some applications are unavailable

2022-05-25T14:37:00+00:00

Some applications are experiencing issues. We are investigating it.

UPDATE 14:57 UTC: Some Add-ons are being inaccessible due to a faulty proxy. We're removing it from the pool to mitigate.

UPDATE 14:59 UTC: Services are being reloaded to ensure the faulty proxy is removed from the pool.

UPDATE 15:10 UTC: Services are back online for redeployed apps. A faulty sentry induced an abnormal behaviour in the API.

CALL FOR ACTION 15:23 UTC: Remaining applications are currently redeployed. If you're impacted, we advise you to redeploy your app to accelerate the recovery process

Issues with deployment not working correctly

2022-05-24T14:46:00+00:00

We currently have issues with deployments. Deployments may end up with errors asking you to contact our support alongside a stacktrace. We are currently working on a fix.

EDIT 14:59 UTC - We have identified defaulting component which encounters an issue in the connection pooler.

EDIT 15:09 UTC - deployments queue is being consumed and catching up. Issue it mitigated.

EDIT 15:23 UTC - Incident is fixed.

Root cause: we've found an issue in a messaging driver on a couple of isolated servers. Anyway, we've curated out this specific driver to fall back on an alternative messaging layer. In the coming days, we will dive into this specific bug we've found and will communicate the bug fix upstream.

MySQL c5 is experiencing issues

2022-05-23T18:10:00+00:00

The MySQL c5 shared cluster is experiencing issues. We are investigating.

EDIT 20:02 UTC: the MySQL shared cluster is back online.

Logs are experiencing issues

2022-05-22T20:40:00+00:00

Logs are currently having some ingestion/query issues. We are working on it.

EDIT 21:39 UTC - querying logs is now available.

MySQL c6 is experiencing issues

2022-05-22T20:10:00+00:00

The MySQL c6 shared cluster of EU zone is experiencing issues. We are investigating.

EDIT 21:39 UTC - shared cluster is now back online

SSH Gateway is unavailable because of maintenance

2022-05-22T17:29:00+00:00

The SSH Gateway will undergo a maintenance which will stop the service. Expected downtime is 30 minutes. During this time, SSH access to instances will be unavailable both from the CLI or from the regular SSH tool. Existing SSH connections will be stopped.

Maintenance is expected to start in a few minutes

EDIT 17:56 UTC: Service is back online, you should now be able to SSH to your instances. Sorry for the inconvenience.

Metrics and access logs are experiencing issues

2022-05-21T17:15:00+00:00

Metrics and access logs are currently having some ingestion/query issues. We are working on it.

EDIT 23:06 UTC - Storage cluster is now up. We are now catching up the accumulated ingestion lag. Query components will be restarted in a rolling fashion throughout the next 6 hours.

EDIT Sunday 11:27 UTC - Some query components are still reloading

EDIT Sunday 20:27 UTC - We are still experiencing issues on the query components.

EDIT Monday 07:20 UTC - Query is back online

[PAR][RETROACTIVE] High number of Monitoring/Unreachable deployments

2022-05-20T12:21:00+00:00

A few hypervisors on the Paris zone had a configuration issue between 12:21 UTC and 14:16 UTC leading to instances not being properly monitored. This caused Monitoring/Unreachable deployments for the instances hosted on them.

Because of this, those hypervisors became more empty than the others. More VMs were scheduled on them since they had more resources available, which then lead to more Monitoring/Unreachable events.

Instances weren't, for the most part, unreachable, but were redeployed anyway.

This should now be fixed. Sorry for the inconvenience

PAR: FS Bucket Migration

2022-05-18T16:19:00+00:00

Some FS-Bucket add-ons will need to be migrated to a different server for security reasons. During this migration, the Buckets will be in Read-Only mode. Any attempt to create or update a file on the add-on will fail, including for FTP operations. Errors related to Read-only file system are expected during this migration.

The migration is expected to last at most 1 hour. All impacted applications will be redeployed during the migration. After the deployment, applications will be able to write to the bucket. Read operations will not be impacted.

Users of buckets that need to be migrated have received emails.

EDIT 2022-05-31 10:00 UTC: The migration is starting, buckets will be put into read-only.

EDIT 2022-05-31 10:25 UTC: The migration is over. Applications have started redeploying, it should take around 2 hours. You can redeploy your application earlier to finish the migration.

EDIT 2022-05-31 13:11 UTC: All applications have been redeployed, the migration is now over.

Metrics and logs are experiencing issues

2022-05-18T04:35:00+00:00

Metrics and access logs are currently having some query issues. We are working on it.

EDIT 07:16 UTC - Indexes have been rebuilt. Query is now available.

[PAR] Delay in monitoring actions for some applications

2022-05-17T16:26:00+00:00

There is currently a delay in monitoring actions for some applications. This may result in extended time to detect crashed application instances and upscales / downscales events. Actions are currently queued and will resume shortly. ETA is 30 minutes.

EDIT 17:12 UTC: The queue is still being consumed.

EDIT 17:27 UTC: The queue is now empty. Every monitoring actions should now be working as expected.

PAR: An hypervisor is currently unavailable

2022-05-17T09:00:00+00:00

An hypervisor is currently unavailable. Applications are currently restarting. Add-ons hosted on that hypervisor are currently unavailable. We are looking into the root cause.

EDIT 10:50 UTC: Hypervisor is back online. Add-ons hosted on that hypervisor are currently available.

Metrics and logs are experiencing issues

2022-05-16T14:23:00+00:00

Metrics and access logs are currently having some query issues. We are working on it.

EDIT 15:02 UTC - Indexes have been rebuilt. Query is now available.

Metrics and logs are experiencing issues

2022-05-16T06:30:00+00:00

Metrics and access logs are currently having some query issues. We are working on it.

EDIT 09:20 UTC - Indexes have been rebuilt. Query is now available.

[PAR][RETROACTIVE] A FSBucket server was unreachable

2022-05-15T19:16:00+00:00

A FSBucket server was unreachable for 15 minutes, leading to increased response time for basic read / write operations on some FSBuckets. This has been fixed, impacted applications will be redeployed.

PAR: FS Bucket Migration

2022-05-13T14:51:00+00:00

Users of buckets that need to be migrated have received emails.

EDIT 24/05/2022 12:00 UTC+2: The migration will start soon. FSBuckets will be put into read-only for a couple of minutes so that all buckets are correctly synchronized.

EDIT 24/05/2022 12:03 UTC+2: FSBuckets are now in read-only mode.

EDIT 24/05/2022 12:39 UTC+2: Synchronization is over. Applications are being redeployed. If you wish to recover faster, you can trigger a deployment through the web Console or CLI. Deployments are expected to all be started within the next 30 minutes.

EDIT 24/05/2022 13:34 UTC+2: The migration is over, if you have any issues, please contact our support team

Metrics are expericing issues

2022-05-09T07:48:00+00:00

Metrics and access logs are currently having ingestion and query issues. We are working on it.

EDIT 08:28 UTC - We are consuming the lag.

EDIT 08:28 UTC - Indexes are rebuilding.

EDIT 09:34 UTC - Indexes are rebuilt. Query is available.

EDIT 16:03 UTC - Fixed.

AccessLogs/Metrics are experiencing issues

2022-05-08T23:41:00+00:00

AccessLogs/Metrics are experiencing issues

EDIT 23:41 UTC - Issue has been identified and we are consuming the lag.

EDIT 07:28 UTC - Lag has been consumed .

EDIT 07:30 UTC - Fixed.

Logs are experiencing issues

2022-05-08T23:24:00+00:00

Logs are experiencing issues

23:55 UTC - Issues has been identified and we are consuming the lag.

00:19 UTC - lag has been consumed.

00:20 UTC - Fixed.

AccessLogs/Metrics are experiencing issues

2022-05-05T09:05:00+00:00

Metrics/AccessLogs are experiencing issues.

EDIT 09:11 UTC - Metrics/AccessLogs are catching up their lag.

EDIT 16:34 UTC - Fixed.

Logs are experiencing issues

2022-05-05T08:38:00+00:00

Logs and drains systems are experiencing issues. We are working on it.

EDIT 09:06 UTC - The logs are catching up.

EDIT 11:15 UTC - Fixed.

Deployments are experiencing issues

2022-05-05T07:30:00+00:00

Deployment components are experiencing issues to due deployment lag triggered by the Core API issues.

EDIT 08:00 UTC - We have identified ongoing issues.

EDIT 08:02 UTC - New deployments are currently disabled to reduce the impact on our infrastructures. We will reactivate them when the queued ones will be deployed.

EDIT 08:45 UTC - Deployments are still flaky, we are working to resolve the issues.

EDIT 09:08 UTC - Deployments queue is catching up. When it ends, we will redeploy a part of the PAR zone to ensure deployments are monitoring are consistent.

EDIT 09:25 UTC - The mentioned deployments are running.

EDIT 11:16 UTC - We are about at 75% of the deployments completed.

EDIT 12:06 UTC - Finished and fixed.

Clever Cloud Core API is experiencing issues

2022-05-05T06:14:00+00:00

We are investigating issues with our Core API.

EDIT 06:34 UTC - Our orchestrator is impacted and the deployments are experiencing issues.

EDIT 06:44 UTC - Core API is fixed.

EDIT 08:34 UTC - We are experiencing issues affecting console, cli. We are investigating.

EDIT 08:45 UTC - Core API is fixed.

Slow-dwns on Clever Cloud API

2022-05-04T17:12:00+00:00

Clever Cloud API behind the domain name api.clever-cloud.com got some slow-downs

Metrics & Access logs are experiencing query issues

2022-05-04T09:42:00+00:00

We found out that there is an issue with a shard of our indexes. Some metrics may be unavailable during the reloading period.

Metrics & Access logs are experiencing ingestion and query issues

2022-05-02T11:41:00+00:00

Metrics and access logs are currently having ingestion and query issues. We are working on it.

EDIT 12:11 UTC: Metrics and logs are now accessible again. Sorry for the inconvenience.

Metrics & Access logs are experiencing ingestion and query issues

2022-05-01T09:38:00+00:00

Metrics and access logs are currently having ingestion and query issues. We are working on it.

EDIT 10:56 UTC: Metrics and logs are now accessible again. Sorry for the inconvenience.

Live logs workflow unavailable

2022-04-26T15:33:00+00:00

Live logs are unavailable EDIT: an internal service was unreachable, Live Logs system is now fully operational

[RETROACTIVE] Lost MongoDB node

2022-04-25T09:10:00+00:00

A mongodb node was unreachable. This node is now fixed

Metrics & Access logs are experiencing ingestion and query issues

2022-04-25T08:40:00+00:00

Metrics and access logs are currently having ingestion and query issues. We are working on it.

EDIT: Ingestion fixed, query almost restored

Logs pipeline is experiencing issues

2022-04-22T09:26:00+00:00

Logs pipeline components lost their connection EDIT: Connection issue fixed, we are consuming logs queue lag. we lost almost30 min of logs. EDIT: Lag consumed.

Metrics & Access logs are experiencing issues

2022-04-21T14:00:00+00:00

We are facing an issue with the indexes which result on some metrics and access logs unavailability at egress-level. EDIT: all indexes has been rebooted

Metrics & Access logs are experiencing issues

2022-04-14T12:40:00+00:00

Metrics and access logs are currently having ingestion issue. We are working on it.

EDIT 13:20 UTC: The issue has been fixed. Some metrics data points have been lost. Access logs are being queued for ingestion again.

Some API actions are currently unavailable

2022-04-13T14:45:00+00:00

Some of API calls might return a 504 error. The source cause has been found and we are working to restore the service.

EDIT 17:20 UTC: The service has been fully restored. Sorry for the inconvenience.

Logs are experiencing issues

2022-04-13T06:30:00+00:00

We have identified issues affecting logs and drains. We are working on it.

EDIT 06:45 UTC: fixed.

EDIT 07:22 UTC: we have identified another issue.

EDIT 09:45 UTC: fixed.

Main API instabilities

2022-04-12T15:05:00+00:00

We are currently experiencing instabilities with our main API. We are looking into it.

EDIT 15:12 UTC: This seems to be back to normal. We did not find the root cause but we keep looking. Some actions may have failed like deployments, git push or accessing the dashboard / using the CLI in general

EDIT 17:34 UTC: We still see some instabilities, resulting in various longer queries or even errors from some services that fail to contact our API. We are still working on identifying the root cause.

EDIT 20:34 UTC: We didn't see any more instabilities since the latest status update. We'll continue to monitor the activity in the next couples of days.

Delay in access logs ingestion

2022-04-12T13:20:00+00:00

Access logs currently have a few hours of ingestion delay. It is currently being resolved and the delay should be back to normal in a few hours. This impacts the retrieval of access logs using the CLI or the API. Also, the various console dashboards (status codes, requests per hours, ...) are impacted and might display out of sync data. Sorry for the inconvenience.

EDIT 20:43 UTC: The delay has now resolved, you should now be able to query the access logs using the CLI or API.

Logs are experiencing issues

2022-04-11T13:45:00+00:00

We have identified issues affecting logs and drains.

EDIT 15:05 UTC: fixed.

Sydney zone is unreachable

2022-04-07T18:09:00+00:00

SYD zone is unreachable. We are investigating.

EDIT 18:23 UTC - the SYD zone (provided by OVH) seems only reachable using the OVH network

EDIT 18:30 UTC - we are waiting for our provider's feedback

EDIT 19:00 UTC - fixed https://network.status-ovhcloud.com/incidents/j5vzf90dpzcc

Metrics / Access logs: data points are currently delayed

2022-04-05T09:50:00+00:00

Metrics and access logs are currently delayed. Data points are queued and will be processed as soon as possible. This may lead to some series missing recent data.

Edit 10:27 UTC: The delay is now resolved. Sorry for the inconvenience.

[RETROACTIVE][PAR] Add-on reverse proxy unavailability

2022-04-04T09:15:00+00:00

An add-on reverse proxy on the PAR zone was unreachable for 15 minutes. The restart initially failed, ence the extended downtime.

This should now be resolved. The 7 other reverse proxies were working as usual.

All deployments are stopped

2022-04-03T04:10:00+00:00

Deployments are broken. We are looking why.

0740: The reason has been found and it's been fixed.

Metrics & Access logs are experiencing issues

2022-04-03T04:06:00+00:00

We identified issues on our metrics and accesslogs storage where certain metrics and accessLogs are not accessible.

The team has found to origin. We are working on a fix.

VMs are crashing on some hypervisors

2022-04-03T01:40:00+00:00

Live updates:

Some hypervisors are experiencing issues with qemu. VMs are randomly crashing.

We are investigating.

0323: Looks like too processes are started and systemd is kill qemu threads.
0330: We suspect a recent update to be causing the thread exhaustion on the HVs.
0345: We start applying a patch to revert the update.
0407: We finish checking up everything. The HVs look fine, now.

Post Mortem:

Incident summary

The 4th of April, some new deployments were unable to be completed by the CCOS (Clever Cloud Operating System) orchestrator.

A few day ago, we introduced a new notification subsystem. It was required to enable the Network Groups feature. The new notification subsystem led to new connections from hypervisors agent to be initiated to the messaging component.

An issue on the proxy layer which did not properly closed connexions, led to connexion stacking until saturation of the pooler. This situation made agents to stack up too many processes on hypervisors machines for too much time preventing new processes for being spawned.

Our hypervisor controller suffered from being able to spread new threads, which led to new deployments being unable to be completed. It also prevented the current virtual machines from spawning new threads, thus crashing some of these running VMs.

Short term resolution

Network Groups being in ALPHA, we immediately decided to rollback their availability, pushing back a non blocking version which did not rely on our messaging layer.

Long term resolution

Two different actions are being rolled out.

The first one is a patch being currently tested on a dedicated deployment to ensure the garbage collection of connections on the messaging service proxy layer.
The second one is targeting the hypervisor's agent with an architectural change to prevent too much processes for being spawned. A specific driver has been setup as a service to maintain a single connexion and a single process instead of spawning an on-demand process at each notification. This modification would avoid any issue regarding the messaging service, even in case of other issue than the connection handling.

[PAR] Cellar C2: connectivity issues

2022-03-31T15:25:00+00:00

Cellar C2 is having writing and reading issues. The team is investigating.

EDIT 15:32 UTC: The team has found to origin. We are working on a fix.

EDIT 15:50 UTC: Reading is back, tu situation is being mitigated.

EDIT 16:01 UTC: Cellar C2 is up and running.

Metrics & Access logs are experiencing issues

2022-03-31T07:17:00+00:00

We identified issues on our metrics and accesslogs storage where certain metrics and accessLogs are not accessible.

Problem has been identified, we are working to fix the problem.

EDIT 15:36 UTC: certain metrics and accessLogs are still not accessible. EDIT 18:50 UTC: metrics and accessLogs are now accessible.

Clever Cloud's internal reverse proxies are experiencing issues

2022-03-30T13:00:00+00:00

Our private reverse proxies (which serve our APIs) are encountering performance issues. This is slowing down API requests and parts of the deployment process.

We are trying to fix these performance issues.

Deployements are heavily slowed down

2022-03-30T13:00:00+00:00

Some parts of our infrastructure are slowing down the deployments.

Logs are experiencing issues

2022-03-29T16:17:00+00:00

We have identified issues affecting logs and drains.

EDIT 18:43 UTC: fixed.

Pulsar addons cluster will be restarted

2022-03-29T15:45:00+00:00

Due to security issues in the biscuit-auth token v1. The addon pulsar cluster will be restarted with the new biscuit authentication/authorization plugins (biscuit v2.0) which have breaking changes. The related addons will have their environment variables updated accordingly so the linked applications will be redeployed automatically.

Everything went well. Do not hesitate to each us via support for any questions.

Pulsar addons creation is disabled

2022-03-29T13:43:00+00:00

Due to an incoming maintenance operation. We disabled the addon pulsar creation.

EDIT 20:57UTC - creation is enabled.

Metrics & Access logs are experiencing issues

2022-03-25T11:00:00+00:00

We identified issues on our metrics and accesslogs storage. We are working to fix the problem which is currently causing some difficulties on the query-side.

Instabilities heptapod.host website

2022-03-23T21:00:00+00:00

Users are experiencing HTTP errors on website heptapod.host.

** UPDATE ** 2022-03-24 15:40 UTC website does not have HTTP errors anymore

[PAR] cellar-c1 shutdown

2022-03-23T10:46:00+00:00

As announced, cellar-c1 has been definitively shutdown.

If you lack some files that were on it, please contact the support with all the informations: add-on ID, bucket name, etc.

[PAR] Cellar-c1 network brownouts #4

2022-03-16T16:00:00+00:00

This maintenance concerns the migration of our cellar-c1 Cellar cluster. Affected customers have been emailed multiple times since the January regarding this service end of life.

As a reminder, the service will be shut down on 21/03/22. A few network brownouts will be applied to remind customers that they need to migrate their data.

A total of 5 brownouts will be applied. During these planned downtime, the service will refuse any connections, be it HTTP or HTTPS.

This brownout will happen on 16/03/22 16:00 UTC for a 30 minutes window.

Our support team stays at your disposal for any questions.

[PAR] Networking issues

2022-03-08T18:05:00+00:00

We are currently having various networking issues (packet loss or slow response times) on our Paris infrastructure. We are investigating.

Some services are also impacted:

Pulsar
Metrics
Access logs

EDIT 18:20 UTC: Our network provider is investigating the issue.

EDIT 18:28 UTC: The issue has been identified and has been escalated. Logs may also be impacted.

EDIT 18:44 UTC: The issue is still being worked out but Pulsar and Logs are now working fine again.

EDIT 19:26 UTC: The issue has been fixed by the network provider at 18:54 UTC. All components are now working fine again. Access logs are being ingested and may have some lag for a few hours. Sorry for the inconvenience.

Metrics & Access logs components are experiencing issues

2022-03-08T11:30:00+00:00

We identified issues on our metrics/accesslogs storage. We are working to fix the problem which is currently causing some lags in the ingress data plane.

EDIT 12:04 UTC: The lag in the ingestion pipeline has been resolved.

[PAR] Cellar-c1 network brownouts #5

2022-03-07T18:15:00+00:00

This maintenance concerns the migration of our cellar-c1 Cellar cluster. Affected customers have been emailed multiple times since the January regarding this service end of life.

As a reminder, the service will be shut down on 21/03/22. A few network brownouts will be applied to remind customers that they need to migrate their data.

A total of 5 brownouts will be applied. During these planned downtime, the service will refuse any connections, be it HTTP or HTTPS.

This brownout will happen on 18/03/22 10:00 UTC for a 30 minutes window

Our support team stays at your disposal for any questions.

EDIT 11:00 UTC: The brownout has started and will last for 30 minutes.

EDIT 11:30 UTC: The brownout has ended. The service will be decommissioned next Monday.

[PAR] Cellar-c1 network brownouts #3

2022-03-07T18:13:00+00:00

This maintenance concerns the migration of our cellar-c1 Cellar cluster. Affected customers have been emailed multiple times since the January regarding this service end of life.

As a reminder, the service will be shut down on 21/03/22. A few network brownouts will be applied to remind customers that they need to migrate their data.

A total of 5 brownouts will be applied. During these planned downtime, the service will refuse any connections, be it HTTP or HTTPS.

This brownout will happen on 14/03/22 09:30 UTC for a 30 minutes window.

Our support team stays at your disposal for any questions.

EDIT 09:36 UTC: The brownout is starting. It will last for 30 minutes.

EDIT 10:07 UTC: The brownout has ended. Next one will happen on 16/03/22 16:00 UTC for a 30 minutes window.

[PAR] Cellar-c1 network brownouts #2

2022-03-07T18:11:00+00:00

This maintenance concerns the migration of our cellar-c1 Cellar cluster. Affected customers have been emailed multiple times since the January regarding this service end of life.

As a reminder, the service will be shut down on 21/03/22. A few network brownouts will be applied to remind customers that they need to migrate their data.

A total of 5 brownouts will be applied. During these planned downtime, the service will refuse any connections, be it HTTP or HTTPS.

This brownout will happen on 11/03/22 14:00 UTC for a 10 minutes window.

Our support team stays at your disposal for any questions.

EDIT 14:00 UTC: The brownout is starting and will last for 10 minutes.

EDIT 14:10 UTC: The brownout has ended. Next one will happen on 14/03/22 09:30 UTC for a 30 minutes window.

[PAR] Cellar-c1 network brownouts #1

2022-03-07T18:09:00+00:00

This maintenance concerns the migration of our cellar-c1 Cellar cluster. Affected customers have been emailed multiple times since the January regarding this service end of life.

As a reminder, the service will be shut down on 21/03/22. A few network brownouts will be applied to remind customers that they need to migrate their data.

A total of 5 brownouts will be applied. During these planned downtime, the service will refuse any connections, be it HTTP or HTTPS.

This brownout will happen on 09/03/22 10:00 UTC for a 10 minutes window.

Our support team stays at your disposal for any questions.

EDIT 10:00 UTC: The brownout has started.

EDIT 10:10 UTC: The brownout has ended. Next one will happen on 11/03/22 14:00 UTC

[PAR] Cellar C1: connectivity issues

2022-03-07T14:10:00+00:00

Our Cellar C1 cluster service has currently connectivity issues leading to failed requests. We are investigating with our network provider the reason of those issues.

Edit: Connectivity issues has been solved by our network provider. The service should run as expected

Metrics & Access logs components are experiencing issues

2022-03-02T08:00:00+00:00

We identified issues on our metrics/accesslogs storage. We are working to fix the problem which is currently causing timeouts on queries.

EDIT 10:27 UTC: Queries have returned to normal, Metrics and Access logs should now be reachable. We are monitoring the queries.

EDIT 11:03 UTC: Queries have returned to normal, Metrics and Access logs should now be reachable.

[RETROACTIVE][PAR] An add-on reverse proxy behaving erratically

2022-03-01T17:55:00+00:00

An add-on reverse proxy started behaving erratically. This triggered timeouts and unreachability for some add-ons if the active connections were proxied through it. It has been restarted, which fixed the issue.

Sorry for the inconvenience.

[Retroactive][PAR]: Shared MongoDB cluster unavailability

2022-02-26T11:55:00+00:00

Some of the databases hosted on that cluster were unreachable due to a node failure during a few hours. The problem has been fixed and the failure will be investigated further. Dedicated databases were not impacted.

[PAR] One of our datacenters is having troubles reaching some of Scaleway IPs

2022-02-25T22:40:00+00:00

We are currently having connections issues toward Scaleway infrastructure from one of our datacenters in Paris. We are investigating this issue with our network providers.

EDIT 23:15 UTC: The connectivity is now back since 23:07 UTC with our network provider saying that the issue has been resolved. This incident is now closed on our end. Sorry for the inconvenience.

[PAR] Cellar C1: nodes are currently unreachable

2022-02-25T22:40:00+00:00

Our Cellar C1 cluster service is currently unreachable by our Paris infrastructure, leading to failed requests. This cluster is the old cluster, with either domains cellar-c1.clvrcld.net or cellar.services.clever-cloud.com. We are investigating the issue.

EDIT 22:42 UTC: After a quick investigation, only one of the 3 IP that is serving those domains is having troubles reaching other nodes of the cluster. The IP has been dropped from the DNS. Meanwhile, we are investigating the issue with our network provider.

EDIT 22:39 UTC: Lowering the severity to Performance Issues. Ticket has been open with our network provider.

EDIT 23:15 UTC: The connectivity is now back since 23:07 UTC with our network provider saying that the issue has been resolved. We will wait a bit before adding back the IP of the faulty node in the DNS just to be sure but this incident is now closed on our end. Sorry for the inconvenience.

Delay of certificate generation using Let's Encrypt

2022-02-25T19:08:00+00:00

Generation of certificates for newly added domain is currently delayed due to a rate limit issue. A fix has been issued on our end and the situation should come back to normal in a few hours.

EDIT 19:16 UTC+1: This does not impact renewal of certificates.

EDIT 19:36 UTC+1: We are now under the rate limit, newly added domains should have their certificates generated in a few minutes, as usual. Sorry for the inconvenience.

[PAR] PostgreSQL free cluster scheduled maintenance

2022-02-24T14:16:00+00:00

We need to perform a maintenance operation on the free PostgreSQL shared cluster of the Paris zone. A fail-over will be initiated and applications may have troubles connecting to the new leader. Make sure to restart them if needed.

The fail-over will be done in the upcoming hour.

EDIT 15:17 UTC: The cluster will be fail-over in the next few minutes. Some queries might be failing as soon as the leader goes down and until your application correctly connect to the new leader.

EDIT 15:28 UTC: The fail-over has been done. Make sure to restart your applications if they can't connect to their add-on.

Live logs / Drains: delivery issues

2022-02-18T14:15:00+00:00

We are currently having difficulties with the logs pipeline. This impacts live logs in the Console / CLI as well as drain logs. We are working on it.

EDIT 14:25 UTC: The issue has been fixed. A fix has been scheduled for deployment this afternoon which should reduce those delivery issues events. We will monitor the fix closely once it gets deployed.

Live logs / Drains: delivery issues

2022-02-17T14:52:00+00:00

We are currently having difficulties with the logs pipeline. This impacts live logs in the Console / CLI as well as drain logs. We are working on it.

EDIT 15:07 UTC: Live logs and drains are back. Some drains logs may have been lost during the recovery process. Sorry for the inconvenience.

EDIT 15:52 UTC: Live logs and drains are down again, we are looking into it

Metrics & Access logs components are experiencing issues

2022-02-16T15:25:00+00:00

We identified issues on our metrics/accesslogs storage. We are working to fix the problem which is currently causing timeouts on queries.

EDIT 15:52 UTC: Queries have returned to normal, Metrics and Access logs should now be reachable.

Metrics / Access logs: Ingestion issue

2022-02-15T16:47:00+00:00

Metrics and access logs are currently having an ingestion issue. Some metrics points will be lost, access logs will be kept and ingested at a later time.

Ingestion is now starting at full capacity again. There will be some delay before having up-to-date access logs but it should be good in a few hours. Sorry for the inconvenience.

Some log drains may be broken

2022-02-14T04:00:00+00:00

The log drains infrastructure went down last night (2022-02-14 around 5 AM) and some drains were lost / are broken.

We are still identifying which ones are broken to restart them. If you see that your drains are broken, please contact the support so we can restart them!

Edit 15:11 — we restarted all drains to be sure. Edit 16:27 — Most of the drains are still broken. We are trying to fix the issue by deleting and re-creating message queues in the logs infrastructure. Edit 16:37 — Deleting and creating back everything seems to have cleaned up the situation. Drains seem to be working again!

Logs and drains are experiencing issues

2022-02-03T14:47:00+00:00

Logs and logs drains are experiencing issues. We are investigating.

EDIT 20:32 UTC - fixed.

Metrics & Access components are experiencing issues

2022-02-02T13:35:00+00:00

We identified issues on our metrics/accesslogs storage. We are working to fix the problem which is currently causing timeouts on queries.

EDIT 17:45 UTC: The incident is over, sorry for the inconvenience.

Logs ingestion issues

2022-01-25T20:18:00+00:00

We are experiencing issues with logs collection and distribution (drains included).

EDIT 20:27 UTC: We identified the issue, and the resolution is on going.

EDIT 20:54 UTC : Fixed.

Logs ingestion pipeline blocked

2022-01-19T20:28:00+00:00

Logs ingestion is down. We are looking into it.

20:39 UTC: The ingestion pipeline is back for now but the underlying issue is not properly fixed yet.

20:49 UTC: Theoretically, the problem is fixed. In any case, the ingestion pipeline is working at full speed. We are keeping an eye on things.

Network incident on PAR

2022-01-19T15:58:00+00:00

A network incident between our two Paris datacenters occurred at 15:58 UTC and lasted for 55 seconds (with a few seconds where it was back during that time window).

We have dealt with msot consequences of that downtime, we are still working on fixing an issue with the ingestion pipeline of Metrics and access logs. There will be some delay.

16:40 UTC: Everything is working as expected, delay will go back to normal soon.

Metrics / Access logs are unavailable

2022-01-12T14:48:00+00:00

Metrics and Access logs queries are currently unavailable. Data is still ingested, only queries are impacted. ETA for resolution is 18:00 UTC.

This impacts:

Metrics (Grafana, in the console or using or API)
Access logs: (requests tiles for an organization / application, CLI access or our API)

EDIT 17:30 UTC: Everything is back to normal. Sorry for the inconvenience.

Logs accessibility and delay issues

2022-01-11T21:30:00+00:00

We are currently experiencing accessibility and delay issues on logs. We are working on it.

EDIT 21:58 UTC: Everything should be back to normal, sorry for the inconvenience.

Support tool migration

2022-01-05T17:50:00+00:00

We will migrate our support tool from Intercom to Crisp, this migration will impact your ongoing tickets with our support team, you will still be able to pursue them by replying to the e-mail transcripts but you can also change the recipient e-mail address to console+intercom@clever-cloud.on.crisp.email.

The migration will start at 19:00 UTC+1 and should apply instantly as soon as you refresh the console.

During the transition, you can directly contact us at supportmail@clever-cloud.com.

EDIT 20:44 UTC+1: The migration has ended, our new support tool is now ready to be used! Make sure to refresh the web console.

Emergency maintenance on one core component

2021-12-29T16:59:00+00:00

We need to do an emergency maintenance on one of our core components. This might impact deployments on all zones. Applications already deployed won't be impacted. The maintenance is starting right now.

17:26 UTC: The maintenance operation did not fix the issue. Deployments are completely disabled at the moment. We are investigating.

17:31 UTC: It was DNS (DNS reverse resolving was too slow when opening connections, which timed out). We are working on bringing everything back up.

17:52 UTC: Everything is back up. If you are experiencing an issue, please contact us.

Network incident between the two datacenters of the PAR zone

2021-12-24T09:28:00+00:00

We have experienced two network issues between the two datacenters of the PAR zone:

Between 09:28 UTC and 09:30 UTC
Between 09:40 UTC and 09:41 UTC

We do not have any details about this incident as of now.

Duplicate notifications in Slack

2021-12-22T13:48:00+00:00

You may be receiving duplicate Slack notifications.

This is due to an issue with Slack. Slack is replying with 500 errors to our notifications even though they are clearly processing the messages just fine, our notification system sends multiple retries after receving failures so you will be receiving multiple duplicates and your webhooks will probably be disabled automatically (as they are after too many repeated failures). We will be re-enabling them once the issue is fixed. If your webhook remains disabled, please contact us.

14:17 UTC: We have not received a single 500 error from Slack in 8 minutes. It looks like this may be fixed. Although a broader incident is still ongoing on Slack's end: https://status.slack.com/2021-12/a17eae991fdc437d

14:44 UTC: Webhooks disabled since 12:00 UTC have been re-enabled. Slack status says messaging/notifications part of the incident is resolved, we are not seeing any errors so this incident is now over. If you are experiencing an error or if your webhook has not been re-enabled, please contact us.

An hypervisor is unreachable in RBX zone

2021-12-22T13:37:00+00:00

An hypervisor is unreachable in the RBX zone. Affected applications are being redeployed automatically. Affected add-ons are unreachable.

13:40 UTC: Multiple servers in the same rack have gone down at the same time. It's most likely a network issue.

13:45 UTC: Our provider (OVHcloud) is aware of the issue. They will come back to us with more details later.

13:53 UTC: The hypervisor is back online. We are making sure everything is fine.

14:11 UTC: Everything is fine now, there was an issue with outgoing traffic from 13:53 until 14:08 UTC. This is now fixed.

Our provider tells us it was an issue with the cooling system. More info may be posted here: https://bare-metal-servers.status-ovhcloud.com/incidents/5cqtb0q9ht67

Logs ingestion issue

2021-12-15T09:08:00+00:00

We are experiencing an issue with logs ingestion pipeline. We are looking into it.

EDIT 9h15 UTC : The ingestion pipeline is back to normal. No abnormal delay.

Metrics / Access logs are unavailable

2021-12-14T12:20:00+00:00

Metrics and Access logs queries are currently unavailable. Data is still ingested, only queries are impacted. ETA for resolution is 15:00 UTC.

This impacts:

Metrics (Grafana, in the console or using or API)
Access logs: (requests tiles for an organization / application, CLI access or our API)

EDIT 14:52 UTC: The queries are available again since 14:20 UTC. This incident is over.

Elasticsearch - log4j CVE-2021-44228: Add-ons will be restarted to mitigate an information leakage

2021-12-12T14:21:00+00:00

Elastic released a security bulletin regarding the impact CVE-2021-44228 has on Elasticsearch. Elastic recommends users to apply the -Dlog4j2.formatMsgNoLookups=true JVM option and restart Elasticsearch. More information in Elastic security bulletin: https://discuss.elastic.co/t/apache-log4j2-remote-code-execution-rce-vulnerability-cve-2021-44228-esa-2021-31/291476

We will apply this option on all add-ons and restart them as an emergency maintenance. For single node add-ons, this will trigger a short downtime of minimum 1 minute (the approximate time it takes Elasticsearch to boot). For clustered add-ons, no downtime is to be expected as it will be a rolling restart.

Newly created add-ons are already patched.

The restart of all add-ons will start at 15:00 UTC. Sorry for the short notice. Feel free to contact our support if you have any questions.

EDIT 15:05 UTC: Add-ons restart is starting

EDIT 16:10 UTC: Add-ons have been restarted. The maintenance is over.

Retroactive: [PAR] An add-on reverse proxy crashed

2021-12-10T10:21:00+00:00

An add-on reverse proxy crashed at 10:21 UTC and got restarted at 10:24 UTC. During that time, some services connecting to their add-ons might have experienced unexpected connection errors (connection lost, connection refused, ...).

The issue is now fixed.

Logs ingestion issue

2021-12-08T23:32:00+00:00

We are experiencing an issue with logs ingestion pipeline. We are looking into it. The issue started at 23:30 UTC yesterday and was not caught until 08:02 UTC because of a missing monitoring alert following a maintenance operation a few days ago.

09:22 UTC: The issue is identified and fixed, logs ingestion should catch up. Logs should appear within a few minutes.

09:38 UTC: The issue is not actually fixed, there is something else blocking the pipeline. We are investigating.

09:55 UTC: The ingestion is working, there are a lot of older logs to be processed so it will take a while before you can see recent logs in real time.

13:07 UTC: The ingestion pipeline is back to normal. No abnormal delay.

SGP Zone Hypervisor is down

2021-12-01T14:19:00+00:00

One of our hypervisors in the Singapore zone is down.

Erratic network behaviour to some hosts from some PAR hosts

2021-11-25T22:15:00+00:00

We are investigating a network issue. We are seeing random TCP timeouts and ICMP packets dropped for a few remote hosts from some PAR hosts (very few hosts are affected by this). This started occurring on 2021-11-25 at around 22:15 UTC.

10:53 UTC: We are still investigating this issue. The culprit seems to be a peering node.

11:18 UTC: It seems to only affect a few routing paths between our infrastructure and some hosts of Scaleway and Azure. We are trying to narrow down the issue with their network teams.

13:05 UTC: We see improvements between Scaleway and our Infrastructure since 11:26 UTC. We do not yet know if it's a temporary resolution and are awaiting for more information on Scaleway side.

13:36 UTC: Confirming that the issue between Scaleway and our infrastructure has been fixed. We are still awaiting some details from Scaleway to know if they are indeed the ones who changed their routing configuration to avoid the faulty peer.

15:10 UTC: Scaleway tells us they did not change anything on their end. Still, no issue to report on this side since 11:26 UTC. On the Azure side of things, it seems to be better, the issues we could reproduce earlier cannot be reproduced anymore but some hosts may still be affected. We are marking this as resolved but if you have any specific problems, please contact us so we can troubleshoot the issue more efficiently.

[PAR] Hypervisor faulty memory module replacement

2021-11-23T15:09:00+00:00

One of our hypervisors needs to be shut down because of a faulty memory module. Applications have already been redeployed elsewhere and add-ons will be automatically migrated starting on December 1, 2021 at 20:30 UTC+1. Add-ons that can't be migrated will experience up to 1 hour of downtime.

Impacted users will shortly receive an email and can contact us on our technical support for any further questions.

EDIT 20:32 UTC+1: Add-ons migrations are starting

EDIT 21:31 UTC+1: Add-ons have been migrated. Add-ons that couldn't be migrated in the first place will be unavailable up to one hour. We will announce the planned downtime tomorrow (02/12/2021)

EDIT 02/12/2021: The hypervisor will be rebooted on December 06, 2021 at 11:00 UTC+1. The expected downtime is less than 1 hour.

EDIT 06/11/2021 10:59 UTC+1: The hypervisor is going down at 11:00 UTC+1 as expected. Downtime should not be higher than 1 hour.

EDIT 06/11/2021 11:09 UTC+1: The hypervisor is back up since 3 minutes, all services should be reachable again. We are making sure everything runs fine.

EDIT 06/11/2021 11:13 UTC+1: The maintenance is over.

Logs ingestion issue

2021-11-12T12:07:00+00:00

We are experiencing an issue with logs ingestion pipeline. We are looking into it.

12:21 UTC: Incident is resolved (there may be some lag for a few minutes)

PHP 7.0 to 7.2 will be removed due to end of life support

2021-11-05T15:16:19+00:00

Applications using PHP 7.0 to 7.2 will be upgraded to PHP 7.4 automatically on December 1st, 2021.

PHP versions from 7.0 to 7.2 are vulnerable to security vulnerabilities as they will not receive security updates. You can find the list of end of life versions here: https://www.php.net/eol.php.

Affected customers will be e-mailed about this change and can contact our support team for any additional questions.

Access Logs and Billing are experiencing issues

2021-11-01T10:01:00+00:00

Access Logs and Billing are experiencing issues.

EDIT 13:05 UTC: fixed.

Metrics / Access logs queries unavailability

2021-10-28T08:30:00+00:00

The Metrics / Access logs platform is currently having issues. We are investigating.

EDIT 11:00 UTC: A node from the cluster failed to reboot and was stuck in failed state. We are rebuilding this node. It will take 2 to 3 hours. No data will be lost.

Live logs unavailable

2021-10-23T09:15:00+00:00

(Times in UTC) 09:15 - The RabbitMQ cluster handling live logs started to fail with the "logs" vhost. We start creating the vhost again. 2021-10-24 08:00 - We notice that parts of the logs system are still not working. We investigate them. The Logs API keeps crashing for no apparent reason.

11:45 - The Logs API stopped crashing. We don't know why and continue to investigate the reason to fix this for the long term.

Webhook and e-mail notifications delivered with a delay

2021-10-22T12:38:02+00:00

Webhook and e-mail notifications have not been sent since 22:30 UTC on 2021-10-21. The notification service lost its connection to the message queue service and failed to reconnect automatically. This was due to a short network outage between our two Paris datacenters. This issue has been mixed in with others and left unnoticed.

At 11:12 UTC today, the queue has been emptied so all webhooks matching the events during this period have not and will not be sent out. Events from 11:12 to 12:25 UTC have all been sent at once and everything is back to normal since then.

Retroactive: [PAR] Add-on reverse proxy unreachable

2021-10-19T12:15:00+00:00

An add-on reverse proxy was unreachable on the PAR zone. Some applications might have had issues connecting to their add-ons or may have unexpectedly lost their connections to them.

The reverse proxy has been rebooted and this incident is now over.

[PAR] Hypervisor faulty memory module replacement

2021-10-18T14:01:59+00:00

Impacted users will shortly receive an email and can contact us for any further questions.

EDIT 19/10 18:35 UTC: Migration of add-ons has started

Certificate issue with *.cleverapps.io

2021-10-18T08:32:00+00:00

There is an issue with the certificate associated with the *.cleverapps.io domains, which has expired. We are renewing it ASAP.

10:40 UTC+2: The issue is resolved.

Metrics / Access logs queries unavailability

2021-10-13T14:43:00+00:00

The Metrics / Access logs platform is currently having issues, queries are returning errors. We are investigating.

EDIT 15:07 UTC: The problem has been identified and fixed. Queries should now be back, current data lag is 1 hour and 30 minutes. It should quickly come down in the next hour.

EDIT 17:58 UTC: Ingestion lag is now resolved

Experiencing networking problem on our OVH-based infrastructure

2021-10-13T07:35:00+00:00

We are experiencing networking issues with our OVH-based infrastructure, we are looking for more information from OVH.

https://twitter.com/ovh_status/status/1448185498812485633?s=20

The website travaux.ovh.com is unreachable preventing us from getting a status on the maintenance where "No impact" was expected.

09:55 UTC+2: We still have no update from OVH.

10:01 UTC+2: https://twitter.com/olesovhcom/status/1448196879020433409?s=20

10:20 UTC+2: Our Montreal zone is reachable, others zones might come back soon.

All our zones are now reachable, you might still experience DNS issues or other issues due to the OVH incident it self.

Certificate Issue with rbx-hds cellar

2021-10-11T12:35:00+00:00

There is an issue with the certificate associated with the cellar of our hds zone, we are investigating.

12:58 UTC: The issue is resolved.

Here is what we know so far:

The revocation server of the Certification Authority providing this certificate says that this certificate has been revoked on 2021-06-23, except it was still accepted just fine a few hours ago.

We have asked for a reissue of the certificate (this is an automatic operation). The reissued certificate has been installed and is working fine. Meanwhile, we have asked the CA about this revocation without any warning or notice and are waiting for an answer.

Push error on Git repositories using HTTPS

2021-10-07T09:54:00+00:00

We are currently seeing git push errors at least when using the HTTP protocol, the connection gets refused. This mostly impacts pushes from our CLI.

We are investigating the issue.

EDIT 10:00 UTC: The issue has been fixed, pushes using the HTTP protocol should now be working as intended. Pushes and clones using SSH protocol were not impacted. We'll investigate further the issue.

Wrong Java version for some applications

2021-10-06T17:58:00+00:00

Some applications have the wrong CC_JAVA_VERSION environment variable value, this may lead to unexpected deployment errors or runtime errors if the application redeploys. We are looking into it.

EDIT 18:45 UTC: CC_JAVA_VERSION should now be fixed with the right value. Impacted applications are redeploying to make sure they use the right version.

EDIT 18:58 UTC: If you changed the value of CC_JAVA_VERSION between 09:30 UTC and 18:45 UTC, the value might have been replaced with its previous version. Make sure you set it back to the right version if needed. Sorry for the inconvenience.

Documentation style is currently broken

2021-10-04T14:38:00+00:00

Our documentation is currently broken, we are looking into it. If you have any technical questions, feel free to contact our support team in the meantime.

EDIT 14:57 UTC: The problem has been fixed, the documentation should now be fully accessible at https://www.clever-cloud.com/doc/

Let's Encrypt root certificate expiration

2021-10-01T14:11:00+00:00

A Let's Encrypt root certificate has expired yesterday. This can lead to various errors TLS error for old clients, the most common being "Certificate date is invalid".

Our Let's Encrypt certificates already provide the up-to-date Let's Encrypt chain but some older clients might not be able to trust that new chain because they don't have the new root Certificate Authority in their trustore. If you are in this situation with clients you can't update, we can sell certificates that will be trusted by those older clients. You can contact us on the support with the domains you need to protect.

You can also find more information about this expiration on Let's Encrypt website: https://letsencrypt.org/docs/dst-root-ca-x3-expiration-september-2021/

Some API endpoints are returning high rates of HTTP 500 - Internal Server errors

2021-09-27T14:51:00+00:00

Some API endpoints are returning high rates of HTTP 500 - Internal Server errors. We are investigating.

EDIT 14:57 UTC: A fix has been pushed, the errors should be resolved. We continue to monitor the situation.

EDIT 15:19 UTC: No more Internal server errors are happening, this incident is now closed.

Metrics / Access Logs unavailability

2021-09-16T06:28:00+00:00

The metrics and access logs are currently unavailable. We are looking into it.

EDIT 06:43 UTC: queries should be back to normal, the ingestion lag should take a few minutes to be consumed.

EDIT 11:12 UTC: Everything is back to normal

FSBuckets are down

2021-09-15T07:20:00+00:00

FSBuckets are not mounting properly on new deployments, we are investigating

Edit: New hypervisors were added but they had no support for fsbuckets yet.

[Metrics / Access Logs] Ingestion lag and queries unavailability

2021-09-09T13:04:00+00:00

Metrics and access logs are currently partially unavailable to query. We are investigating.

EDIT 14:35 UTC: The root cause has been identified, the ingestion lag currently sits at around 2 hours so metrics queries will be out of sync for the time being. Access logs are not ingesting and are currently kept in a separate queue. We expect the lag to start decreasing later tonight. This incident is a follow-up to the urgent maintenance of yesterday which mainly aimed at better stabilizing the cluster.

EDIT 23:34 UTC: Metrics have been fully ingested, access logs are still delayed but they are currently being written. Queries might still be slow, this is expected.

EDIT 6:30 UTC: The situation is back to normal.

Metrics / Access logs: possible lag ingestion and query failure

2021-09-08T13:46:00+00:00

Following maintenance, access logs and metrics may be unavailable for some queries as well as ingestion lag. Overviews of applications/organizations requests may also be impacted.

EDIT 16:17 UTC: the maintenance is still ongoing. Reads and writes are disabled since 15:42, this is expected.

EDIT 21:50 UTC: the maintenance is finished. Ingestion is catching up.

Logs & logs drains are experiencing issues

2021-09-02T14:40:00+00:00

Logs & logs drains are experiencing issues.

EDIT 15:14 UTC: fixed, the related drains are currently catching up.

[Paris] Reverse proxies requests timeout

2021-08-30T13:49:00+00:00

Between 15:48 UTC+2 and 15:50 UTC+2, one of our reverse proxies on Paris unexpectedly timed out during a maintenance upgrade on our Paris zone.

The time outs last for about 2 minutes before the proxy was put out of the pool.

Some requests might have failed during the first minute and then, all requests handled failed during the remaining minute. Additional investigation will be performed to analyze what happened.

Slow downloads/errors using npm/yarn

2021-08-24T11:18:00+00:00

We are observing slow downloads and errors using yarn or npm, this may impact and slow down deployments of node applications with lots of packages especially. This seems to be an issue on npm side, and does not seem to be restricted to Clever Cloud.

Update: NPM Registry posted on their status page confirming the incident and are working on a fix: https://status.npmjs.org/incidents/bydjtj102gsn Update: The issue has now been fixed, node deployments are back to normal.

Elasticsearch 7.10 and above will be upgraded due to security issues

2021-08-13T12:30:00+00:00

Elasticsearch add-ons on versions 7.10 and above are subject to security vulnerabilities. Those add-ons will be updated to Elasticsearch 7.14.0 on August 16, 2021 starting at 21:00 UTC+2. Add-ons that need to be upgraded will be unavailable for about 10 minutes.

Affected customers have been e-mailed about this and can contact our support team for any additional question.

EDIT 21:05 UTC+2: Update is beginning.

EDIT 22:00 UTC+2: Updates are over and were successful for most of the add-ons. Owners of add-ons that couldn't be updated will be contacted. If you encounter any issue following this update, please reach to our support team.

Planned downtime of an hypervisor

2021-08-10T09:02:00+00:00

This maintenance operation is a follow-up to this incident : https://www.clevercloudstatus.com/incident/378.

We will be switching back to the original server (which has been fixed by the manufacturer). The server should be down for 10 minutes if our provider does not encounter any issues (may last up to an hour otherwise).

Affected customers have been e-mailed about this and can migrate their add-ons automatically beforehand.

2021-08-25 19:02 UTC: Server is going down.

19:17 UTC: This is taking longer than expected. Server management software decided to reapply firmware settings; this takes a few minutes.

19:24 UTC: Server is up. Add-ons are starting up.

19:26 UTC: Everything is up. Incident is over.

E-mails delayed

2021-08-02T21:06:00+00:00

Our e-mail provider is investigating an issue with their API. Until their incident is resolved, our e-mails will be delayed by an unknown amount of time.

You can follow their incident here: https://status.mailgun.com/incidents/jj6fx7nqwn9t

21:19 UTC: Incident is resolved.

Performance issues on one node of our free MongoDB cluster

2021-07-28T07:27:00+00:00

09:27 UTC - Monitoring agent on one of the nodes stopped responding then started responding again a few seconds afterwards. We put it on a random network error. 09:39 UTC - Some users start to have issues using their free add-on. 09:40 UTC - We investigate. It turns out that the mongodb process was not really listening to incoming connections anymore. 09:42 UTC - We try to restart the service. 09:46 UTC - We actually reboot the whole VM. 09:47 UTC - The VM is up and running, the mongodb process is cleaning itself up. 09:50 UTC - The mongodb process finishes its cleanup and starts accepting connections again.

PAR: High load on some hypervisors, leading to applications / add-ons slowness and Monitoring/Unreachable events

2021-07-27T03:54:00+00:00

Starting at 5:54 UTC and up until 6:12 UTC, some hypervisors experienced higher CPU load. This higher load may have slowed down applications and add-ons that were hosted on those hypervisors.

The root cause has been identified. Unfortunately, the higher load also triggered a lot of redeployments with the Monitoring/Unreachable cause. Most of them were cancelled on time but some of them went through. Some of the deployments that started did not correctly finish and ended-up in a blocked state. Those deployments are currently being canceled and all cancels should be over in a few minutes.

We have developed a fix that will prevent those events from happening again and it will be deploy in the next hours.

PAR: High load on some hypervisors, leading to applications / add-ons slowness and Monitoring/Unreachable events

2021-07-25T18:13:00+00:00

Starting at 18:13 UTC and up until 18:43 UTC, some hypervisors experienced higher CPU load. This higher load may have slowed down applications and add-ons that were hosted on those hypervisors.

The root cause has been identified and the issue has been fixed. Unfortunately, the higher load also triggered a lot of redeployments with the Monitoring/Unreachable cause. Most of them were cancelled on time but some of them went through. Some of the deployments that started did not correctly finish and ended-up in a blocked state. Those deployments are currently being canceled and all cancels should be over in a few minutes.

We will investigate more in depth about this increased CPU load usage and see how we can better prevent this.

Short network unavailability between our two Paris datacenters

2021-07-16T19:03:00+00:00

For less than a minute, communication between our two Paris datacenters has been unavailable between 19:03 UTC and 19:04 UTC.

We do not have any more information at the moment (though it is most likely a routing issue). Everything is working fine now except for Metrics (and accesslogs) which will come back in a few minutes.

EDIT 19:33 UTC: It happened again at 19:29 UTC. We are awaiting more information from our network provider.

EDIT 19:42 UTC: It happened again at 19:41 UTC.

This was due to a maintenance on one of the fiber optic channels between our two Paris datacenters. Our network provider was not made aware of this maintenance which caused the connection to switch back and forth between links when a link went on and off again.

Logs are experiencing issues

2021-07-15T08:44:00+00:00

The logs systems (drains included) are experiencing issues, we are working on it.

EDIT 9:22 UTC - fixed.

Logs: ingestion delay

2021-07-07T13:25:00+00:00

Logs are currently having an ingestion delay. Drain logs are also impacted. The queue is currently getting consuming at normal rate, everything should come back in order in a few minutes.

EDIT 13:42 UTC: The ingestion stopped again, we continue looking into it.

EDIT 14:05 UTC: We continue to investigate the issue. If you need to access the logs of your application, you can SSH to the VM and display them: https://www.clever-cloud.com/doc/reference/clever-tools/ssh-access/#show-your-applications-logs

EDIT 14:30 UTC: Some part of the ingestion queue couldn't have been consumed and has been lost. The queue is still being consumed so up-to-date logs are still delayed

EDIT 17:15 UTC: The queue has been fully consumed and the logs are now up-to-date.

Logs are currently unavailable

2021-07-05T12:24:00+00:00

Logs requests are currently unavailable alongside logs drains emitting. We are looking into it.

EDIT 12:35 UTC: Logs are back, query should now work again and logs drains should have been sent to their endpoints. No logs have been lost.

PAR: An hypervisor is currently unavailable

2021-06-28T14:35:00+00:00

An hypervisor is currently unavailable. Applications are currently restarting. Add-ons hosted on that hypervisor are currently unavailable. We are looking into the root cause.

EDIT 14:45 UTC: The server won't reboot as of now, we are not yet sure of the reason. We continue to look into it. In the meantime, you can create a new add-on and import last night's backup. Please contact our support team for any further assistance

EDIT 14:58 UTC: The server still won't reboot, we continue to investigate the reason.

EDIT 15:08 UTC: A ticket has been opened to the manufacturer. The server is still unreachable as of now.

EDIT 15:12 UTC: A server replacement is currently being discussed. In the meantime, we advise you to import last night's backup into a new add-on. If the hypervisor ever comes back, you will be able to access your old add-on and possibly access the data between last night's backup and now, allowing you to merge them if possible. Current ETA is 24 hours.

EDIT 16:38 UTC: No server replacement will happen, we'll have more information to share tomorrow once the manufacturer gets back to us.

EDIT 16:54 UTC: Clarification: No server replacement will happen tonight. There are no sign of disk / data corruption, it seems to only be an hardware problem, which we can't fix right now.

EDIT 29/06/21 09:30 UTC: A maintenance on the server should happen in the next few minutes. The goal is to replace the problematic hardware piece. More information to come.

EDIT 13:17 UTC: The maintenance has been performed and a hardware piece has been changed but it didn't fix the issue. We continue investigating.

EDIT 13:26 UTC: The initial hardware replacement was the network card. Another replacement, this time the motherboard, has been planned for tomorrow. We do not yet have the exact time.

EDIT 30/06/21 11:09 UTC: The motherboard has been changed, additional checks are being performed.

EDIT 13:03 UTC: The motherboard replacement did not improve the situation. The server reboots fine without the network card, which has already been changed. A full server replacement is being considered by the manufacturer.

EDIT 18:23 UTC: Our infrastructure provider has been able to provide us with a temporary replacement server which is now up and running. Add-ons and custom services are all up and running. Do note that this is a temporary replacement, once the manufacturer gives us back the fixed server or a fully working permanent replacement, we will have to switch to it (meaning a shutdown of a few minutes). Affected customers will be e-mailed about this.

Invalid build cache

2021-06-28T12:38:38+00:00

Some applications may fail to deploy because they try to compile on a runtime instance when a build instance has been configured. Explicitely triggering a rebuild should fix the issue.

PAR: connectivity issue / high latency

2021-06-22T09:56:00+00:00

2021-06-22

We are currently having connectivity issue or high latency to some part of our Paris infrastructure. Our network provider is aware of the issue and is currently investigating.

10:03 UTC: It seems like the issue is only affecting one of the datacenter. Applications that use services deployed on another datacenter might suffer from connectivity issue or increased latency

10:15 UTC: We are removing the IPs of the affected datacenter from all DNS records of load balancers (public, internal and Clever Cloud Premium customers) and are awaiting more info from our network provider.

10:19 UTC: Packet loss and latency have been going down from 10:12 UTC and it seems to be back to normal now. We are awaiting confirmation of the actual resolution of the incident.

10:23 UTC: We are working on resolving issues caused by this network instability and making sure everything works fine.

10:25 UTC: Logs ingestion is fixed. We are working on bringing back Clever Cloud Metrics.

10:31 UTC: IPs removed from DNS records at 10:15 UTC will be added back once we have confirmation that the network issue is definitely fixed.

10:41 UTC: Full loss of connectivity between the two Paris datacenters for a few seconds around 10:39 UTC. We are still experiencing packet loss now. Our network provider is working with the affected peering network on this issue.

10:45 UTC: The two Paris datacenters are unreachable depending on your own network provider.

10:49 UTC: Network is overall very flaky. Our network provider and peering network provider are still investigating.

10:57 UTC: According to our network provider, many optical fibers in Paris are deteriorated. Some interconnection equipment might be flooded. We are waiting for more information.

11:02 UTC: (Network and infrastructure inside each datacenter are safe. The issue is clearly happening outside the datacenters.)

11:13 UTC: Network is still flaky. Overall very slow. We are still waiting for a status update from our network and peering providers.

11:20 UTC: Network seems better towards one of the datacenters. We invite you to remove all IPs starting by "46.252.181" from your DNS.

11:42 UTC: Still waiting for information from our network providers. Still no ETA.

12:16 UTC: Network loss between the datacenters has lowered a bit. Console should be more accessible.

12:21 UTC: Connections are starting to come back UP. We are still watching and waiting for more information from our network providers.

12:30 UTC: Info from provider: over the 4 optical fibers, 1 is "fine". They cannot promise this one will stay fine. They are still working on it. Teams have been dispatched on the premises.

13:15 UTC: Network is still stable. We are keeping Metrics down for now as it uses a significant amount of bandwidth between datacenters.

13:48 UTC: A second optical fiber is back UP. According to our provider, "it should be fine, now". The other two fibers are still down. The on-site teams are analysing the situation.

13:41 UTC: You can now add back these IPs to your domains:

@ 10800 IN A 46.252.181.103
@ 10800 IN A 46.252.181.104

15:35 UTC: We are bringing Clever Cloud Metrics back up. It's now ingesting accumulated data in the queue while the storage backend was down.

16:45 UTC: Clever Cloud Metrics ingestion delay is back to normal.

17:16 UTC: The situation is currently stable but may deteriorate again. We are closely monitoring it. A postmortem will be published in the following days. If the issue comes back, this incident will be updated again. Sorry for the inconvenience.

17:31 UTC: A 30 seconds network interruption happened between 17:22:42 and 17:23:10, it was an isolated maintenance event done by the datacenter's network provider.

2021-06-23

07:01 UTC: This incident has been set to fixed as everything has been working fine, as expected, since the second optical fiber link has been restored except for the incident mentioned in the previous update. Do note that as of now we are not at the normal redundancy level as the other two optical fiber links are still down. We will update this once we have more information.

10:23 UTC: We have confirmation that a non-redundant third optical fiber link has been added at 00:30 UTC, this is only meant to add bandwidth capacity, it does not solve the redundancy issue. However, our network provider also tells us that their monitoring shows that the redundant link just came back up; although this may be temporary and the link may not be using the usual optical path.

16:13 UTC: The redundant link that came back at 10:23 UTC is stable. It may be re-routed to use another physical path at some point but we can now consider that our inter-datacenter connectivity is indeed redundant again.

Deployments delayed

2021-06-17T08:41:00+00:00

From 08:41 UTC to 08:52 UTC, deployments have been queued up and very few deployments were starting.

This was due to an update that has now been rolled back.

PAR: Network accessibility issue

2021-06-16T10:16:00+00:00

Post Mortem

(The original incident text can be found at the end)

A network issue caused 17 minutes of full unreachability of the Paris zone which in turn caused some applications to go down and our deployment system to slow down while restarting affected applications as well as several other services.

Timeline

10:12 UTC: The whole PAR network is unreachable from outside, cross-datacenter network is down as well.

10:16 UTC: The on-call team is warned by an external monitoring system.

10:21 UTC: Our network provider informs us that they are aware of the issue.

10:29 UTC: The network is back.

10:30 UTC: The monitoring systems are starting to queue a lot of deployments. The load of one monitoring system in charge of one of the PAR datacenters increases significantly. Other systems such as Logs, Metrics, and Access Logs (collection and query) are also impacted and unavailable. Some applications relying on FSBucket services (mostly PHP applications) are also having communication issues with their FSBuckets. This might have made some applications unreachable and their I/O very high, sometimes leading to Monitoring/Scaling deployments. This particular issue was detected later during the incident.

10:35 UTC: Our network provider confirms to us that the issue is fixed.

10:50 UTC: Deployments are slow to start because many of them are in queue.

11:00 UTC: The load of the faulty monitoring system being too high causes it to see more applications down than there actually are, and to queue even more deployments for applications that were actually reachable.

11:15 UTC: Clever Cloud Metrics is back, delayed data points have been ingested. Writing to the ingestion queue is still subject to problems.

11:20 UTC: We notice the build cache management system is overloaded, slowing down deployments and failing those that rely on the build cache feature. The retrying of these failed deployments adds even more items to the deployment queue.

11:28 UTC: We start upscaling the build cache management system beyond its original maximum setting.

11:52 UTC: We believe an issue found in the past few days within the build cache management system is responsible for the slowness/unreachability of the build cache service. This issue caused a thread leak which had been triggering more upscalings than usual. A fix was being tested on our testing environment but was not yet validated. We try to push this fix to production.

12:48 UTC: The fix pushed to production at 11:52 UTC is not effective. We upscale the build cache management system again.

13:00 UTC: Logs collection is back. Logs collected before this time were lost. Queries are also available but might still fail sometimes or return delayed logs.

13:05 UTC: We prevent the overloaded monitoring system from queuing up more deployments and empty out its internal alerting queue.

13:10 UTC: We rollback a change made on the database a few days ago, which we believe is the root cause of the ongoing issue.

13:16 UTC: The build cache management system database load starts to go up. This is caused by the application being more effective at making requests to the database thanks to the previous rollback.

13:18 UTC: The build cache management system database is overloaded.

13:33 UTC: We start looking into optimizing requests and clearing up stale data.

13:59 UTC: We manage to bring the build cache management system database load down.

14:05 UTC: The build cache management system is still overloaded/slow despite its database now working properly. A deployment is queued with an environment config change but is slow to start. We restart the application manually to apply this change.

14:10 UTC: The change of configuration is effective, the deployment queue starts to empty itself but there are still a lot of deployments in the queue.

14:15 UTC: An older deployment, performed without the environment change which was waiting to be processed, finishes successfully, leading to about half of the build cache requests failing.

14:17 UTC: We start reapplying the fix manually on live instances while a new deployment with the correct environment is started. The deployment queue size is going down.

14:29 UTC: The deployment queue is filling up again.

14:53 UTC: We realize the faulty monitoring system is still queuing deployments despite its alerting queue being empty and the alerting action being disabled.

14:57 UTC: We completely restart the faulty monitoring system and make sure it stops queuing deployments.

15:10 UTC: We are now certain the previously faulty monitoring system stopped queuing deployments for false positives. The deployment queue is back to normal and the deployment system is more reactive.

15:15 UTC: We start cleaning stuck deployments and making sure everything is working fine.

15:42 UTC: We start redeploying all Paris PHP applications which have not been deployed since the network came back.

16:00 UTC: Some PHP deployments seem to be failing due to a connection timeout to their PHP session stored on an FSBucket. We abort the PHP deployment queue to avoid any more errors.

16:10 UTC: The connection was only broken on one hypervisor and is now fixed. We also make sure every other hypervisor can contact all FSBucket servers on the PAR zone.

16:15 UTC: The PHP deployments queue is started again, with a lower delay between deployments.

16:42 UTC: Clever Cloud Metrics / Access logs ingestion is now fixed. Queries should be returning up-to-date data. Access logs were stored in a different queue and have been entirely consumed.

17:05 UTC: The PHP deployments queue is now completed. All other applications in the PAR zone, which had not been redeployed since the network came back, have also been queued for redeployment to fix any connection issue to their FSBucket add-ons.

19:10 UTC: A few applications which have the “deployment with downtime” option enabled were supposed to be UP but had no running instances. Those applications are now being redeployed.

Network incident details

Foreword: Clever Cloud has servers in two datacenters in the Paris zone (PAR). In this post-mortem, they are named PAR4 and PAR5.

A routine maintenance operation made by our Network Provider on PAR4 started a few minutes before the incident. This maintenance was about decommissioning a router that shouldn’t impact the network. Various checks and monitoring were in place, as usual, and a quick rollback procedure was planned in case anything went wrong.

The decommission triggered an unexpected election of another router, which then triggered a lot of LSA (link-state advertisement) updates between all the routers of the datacenter, sometimes doubling them. Those updates created new LSA rules on other routers, which first made them slower to update and routing traffic. Some of the routers then hit a configuration limit on the number of LSA rules. When hitting the limit, the router went into protection mode and shut itself down. This shutdown triggered other LSA updates on other routers which then also hit their LSA limit and entered in protection mode. This isolated PAR4 site from the network.

An internal equipment that had a link between PAR4 and PAR5 also propagated those LSA updates onto PAR5 routers, replicating the exact same scenario.

To fix this, our Network Provider disconnected some routers, lowering the number of LSA announcements across the network and bringing the routers back online.

Actions

Network provider

Actions taken

The equipment that had links between the two datacenters has been isolated and is now in its own network. This makes sure LSA updates aren't inadvertently sent to the second datacenter.
An isolation timeout has been lowered from 5 minutes to 1 minute, making the system react faster to failures.

Actions planned in a few days

Forbid any non-primary router to be elected as a leader to avoid any issue. According to their support contract with their suppliers, our network provider has officially sent a bug report to the manufacturer of the router which did not behave as expected and they are awaiting a fix and any relevant information.
Routers will now reject LSA rules when they hit their limit instead of going into protection mode. This will allow having a degraded network at first, instead of directly having a broken network. There are currently 4 different brands of routers and each one of them will be tested separately.
Other security measures have been taken. Additional monitoring and logs will also be added

Clever Cloud

Actions taken

Build cache management system database interaction performance improved + database performance itself improved
A deployment system bug with urgent queues is fixed, which allows us to deploy some applications before others (internal and Clever Cloud Premium customers)

Actions planned

Further improve performance and resilience of the build cache management system.
Improve the monitoring of the alerts queue, and the number of unreachable deployments being processed
Improve the visibility of urgent alerts among a high number of alerts
Improve the monitoring of the logs storage system
Improve the monitoring of the connectivity between FS buckets servers and hypervisors
Improve the monitoring of applications that should be up without having any instances
Improve our communication on our status page to post updates more frequently

Original incident details

We are currently experiencing a network accessibility issue on our PAR zone. We are investigating.

EDIT 12:21 UTC+2: Our network provider is looking into the issue.

EDIT 12:28 UTC+2: Deployments on other zones might not correctly work. But traffic shouldn't be impacted.

EDIT 12:30 UTC+2: Network connectivity seems to be back. We are awaiting confirmation of incident resolution from our network provider.

EDIT 12:35 UTC+2: Our network provider found the issue and fixed it. Network is back online since 12:30 UTC+2. Investigation will be conducted to understand why the secondary link hasn't been used.

EDIT 12:42 UTC+2: A postmortem will be made available later once everything has been figured out.

EDIT 12:50 UTC+2: The deployment queue is currently processing, queued deployments might take a few minutes to start

EDIT 13:00 UTC+2: Logs may also be unavailable depending on the applications

EDIT 13:20 UTC+2: Deployment queue still has a lot of items, the build cache feature is currently having troubles which slows down deployments.

EDIT 14:33 UTC+2: Deployments queue is now lower but there are still some issues with some of them. Logs are also partially available

EDIT 15:30 UTC+2: The build cache feature still has troubles, we are currently working on a workaround. Logs should now be back but there is a delay in processing which might affect availability on the Console / CLI. They might be a few minutes late.

EDIT 16:04 UTC+2: Some applications linked to FSBuckets systems might have lost their connection to the FSBucket, increasing their I/O and possibly rebooting in a loop for either Monitoring/Unreachable or Monitoring/Scalability. This can cause response timeouts, especially for PHP applications

EDIT 16:16 UTC+2: Build cache should be fixed, meaning that deployments should take less time

EDIT 16:53 UTC+2: There is still a lot of Monitoring/Unreachable events that are being sent, making a lot of application redeploy for no good reason. We are still working on it.

EDIT 17:18 UTC+2: The issue with Monitoring/Unreachable events has been fixed. The size of the deployments queue should go down.

EDIT 18:07 UTC+2: Most issues haves been cleared up. PHP applications may still be experiencing issues, we are working on it. If you are experiencing issues on non-PHP applications, please contact us.

EDIT 19:05 UTC+2: All PHP applications have been redeployed. If you are still experiencing issues, please contact us. All other applications which have not already been redeployed since the beginning of the incident will be redeployed in the next few hours (to make sure no apps are stuck in a weird state).

Public reverse proxies issues

2021-06-13T16:46:00+00:00

We are experiencing issues with public reverse proxies.

EDIT 16:58 UTC: we mitigated the issues.

PAR: Reverse proxies instability

2021-06-11T15:54:00+00:00

Reverse proxies on the Paris zone are experiencing instabilities. We are investigating.

EDIT 18:04 UTC+2: One of the reverse proxy stopped accepting new connections. It has been put out of the pool for further investigation. Stability should have been resumed since 2 minutes.

EDIT 18:18 UTC+2: Performance is back to normal. We are going to investigate further why this reverse proxy went into this state without being noticed.

Planned maintenance on Metrics storage backend

2021-06-11T13:50:00+00:00

Planned maintenance of the storage backend of Clever Cloud Metrics (used for access logs as well) will occur on 2021-06-15 at 11:30 UTC.

The maintenance itself should take no more than an hour. During this time, writes will be queued and reads will be partially available.

Once the maintenance is over, queued-up writes will start being ingested, reads will be available again (except for recent data until queued-up data points are ingested).

11:36 UTC: Maintenance is starting.

12:04 UTC: Maintenance is over. The ingestion pipeline is running at full speed catching up on the queued-up data.

12:18 UTC: Ingestion is caught up.

MongoDB shared cluster on Paris zone is overloaded

2021-06-11T08:11:00+00:00

MongoDB shared cluster on Paris zone is overloaded. We are investigating what is most likely due to excessive ressources usage of some users.

As a reminder, this cluster is only used by free plans labeled "DEV". This is meant to be used for development and testing purposes only, not production.

If you are using a free plan in production, we suggest you migrate to a dedicated plan using the migration tool in the Clever Cloud console.

10:43 UTC: The cluster is working fine now although it may be slower than usual for now as a node is out of the cluster and will be re-added later.

12:23 UTC: The node mentioned in the last update has been re-added. The incident is over.

Logs are deactivated while we are investigating an issue

2021-06-09T15:44:00+00:00

Logs are deactivated while we are investigating an issue.

EDIT 19:14 UTC: Logs should now be back to normal. Sorry for the interruption.

API, console and other Clever Cloud applications partially unreachable

2021-06-09T14:12:00+00:00

Dedicated load balancers for Clever Cloud's own applications (APIs, Console, website, ...) are overloaded.

We are in the process of adding capacity to resolve this issue.

14:28 UTC: Performance is back to normal.

Some applications are responding slowly or are unreachable

2021-06-09T09:30:00+00:00

At 11:30 UTC we started getting tickets about customer's applications not responding. We started investigating. It looks like the network or the reverse proxies are responsible for that.

EDIT 12:46 UTC: we are experiencing abnormal new connection rates on public reverse proxies.

EDIT 12:50 UTC: we found the responsible application for this new connection rate and are mitigating it.

EDIT 14:19 UTC: Load balancers have been upscaled so they can handle more traffic. Performance is back to normal since 13:12 UTC.

Logs ingestion malfunction

2021-06-08T21:30:00+00:00

Logs ingestion is malfunctioning. We are investigating.

08:00 UTC: New logs are being ingested. Logs emitted during the incident will not be ingested in the main logs storage system. Log drains may start receiving (part of) the older logs, we are still investigating this part.

08:15 UTC: Looks like everything that could be ingested has been ingested. Ingestion delay may still be a little higher than normal though, it should go back to normal soon.

Warsaw: reverse proxies instabillities

2021-06-03T10:40:00+00:00

Some instabilities have been detected on Warsaw reverse proxies, leading to some connections unexpectedly dropped. The problem has been fixed after an upgrade of said reverse proxies.

Cellar-c1: network connection issues

2021-05-28T07:53:00+00:00

Our cellar-c1 cluster is experiencing connection issues. Some buckets might be unavailable as the network between various nodes of the cluster is currently having issues. We are investigating.

Cellar-c2 cluster isn't impacted.

EDIT 08:23 UTC: Connection seems to be back, we have notified both network providers used for Cellar-c1 and are still awaiting an answer. We are waiting a bit more to see if the links are correctly back or if we should expect another issue.

EDIT 08:47 UTC: The connection is now down again.

EDIT 09:35 UTC: The connection has been back up for 15 minutes and the root cause may have been found. We are waiting for explanations from our network provider. In the meantime, this issue may also have affected applications that are connecting to external services. We've seen loss to Scaleway and Azure, there might have been more.

EDIT 10:25 UTC: The issue now seems to be resolved. The root cause wasn't entirely found, current investigations show that a transit provider had an issue and traffic was redirected elsewhere, maybe leading to some links saturation (which would explain why the loss wasn't 100%, but more like 80%).

Deployments are down

2021-05-26T09:02:00+00:00

Deployments seem to be unresponsive at the time, we are investigating.

EDIT: The issue has been fixed

Warp10 (metrics and accesslogs) service is unavailable

2021-05-19T07:46:00+00:00

Warp 10 read operations are unavailable. We are working on it. Service should be back in ~2 hours.

Data is still being ingested.

09:45 UTC: Incident is over.

Multiple Paris hypervisors unreachable

2021-05-17T14:51:00+00:00

Multiple hypervisors in the Paris zone are unreachable. We are investigating.

14:52 UTC: Network issue is resolved. We are assessing the damage.

15:07 UTC: API and deployments are down. We are cleaning everything and bringing it up.

15:20 UTC: API is back. Deployments are back but have a significant delay as of now.

15:42 UTC: We are still working on this. Deployments are quicker now but not yet back to normal.

16:02 UTC: This incident is over. If you are still experiencing issues, please contact us.

Post-mortem

A maintenance operation carried out by our network provider a few hours before this incident generated a faulty BGP announce. Because of this, a significant portion of traffic coming out of our Paris infrastructure was going out via a NYC peer causing significant delay and even timeouts.

Routers in one of our Paris datacenter were heavily impacted by this issue and failed to accept configuration fixes. After multiple attempts to fix this, our provider ended up power-cycling affected routers which caused most of our hypervisors in this datacenter to be cut off from the rest of the network for 3 minutes.

Corrective actions will be taken to prevent this from happening again (BGP filters, dedicated admin network for the routers which was already scheduled to be set up in a few days). We will also make sure that we are warned in due time if a significant network configuration/hardware issue occurs.

PostgreSQL shared cluster upgrade

2021-05-14T14:44:00+00:00

Following https://www.postgresql.org/about/news/postgresql-133-127-1112-1017-and-9622-released-2210/, our PostgreSQL shared clusters will be upgraded to the latest minor version of their branch.

Affected clusters are:

postgresql-c4: Paris zone
postgresql-c5: Montreal zone

This update may affect performances of the databases and their availability.

The upgrade will start in a few minutes. This maintenance will be updated accordingly

EDIT 18:28 UTC+2: Montreal cluster is now up-to-date

EDIT 19:54 UTC+2: Paris cluster is now up-to-date but postgis extension is currently broken due to the update. We are working on a fix

EDIT 20:27 UTC+2: Paris cluster: databases are currently being migrated to a newer version of postgis. It will take a few hours to run on all of the databases

EDIT 20:42 UTC+2: This maintenance is now considered as over

Hypervisor reboot

2021-05-14T13:50:00+00:00

An hypervisor needs to be rebooted. Customers that are impacted will shortly receive an email and add-ons that can be migrated will be migrated before the reboot. Estimated downtime is about 15 minutes.

Add-ons will start being migrated at 20:30 UTC+2. Hypervisor will be rebooted at 21:30 UTC+2

EDIT 20:36 UTC+2: Maintenance is starting. Applications are getting redeployed and add-ons are starting their migrations

EDIT 21:30 UTC+2: Add-ons that could be migrated have been migrated, applications have been redeployed. Server will now reboot

EDIT 22:00 UTC+2: Server has finished its reboot, add-ons that weren't migrated should have been reachable since 21:45 UTC+2. The maintenance is over.

Hypervisor unresponsive in PAR zone

2021-05-12T21:43:00+00:00

A hypervisor became unresponsive in PAR zone. It's currently rebooting.

Affected applications are being automatically redeployed. Affected addons are unreachable.

21:53 UTC: The hypervisor is back online and is starting addon VMs.

21:55 UTC: All addons are back online. The incident is over.

Metrics/AccessLogs are experiencing issues

2021-05-12T03:32:00+00:00

Metrics/AccessLogs queues are being consumed. Recent data values are currently unavailable.

06:30 UTC: Incident is over.

Core services are experiencing issues

2021-05-11T22:36:00+00:00

Core services (console, API, metrics, access logs) are experiencing issues. We identified the problem and are working to resolve it.

EDIT 23:02 UTC: the incident is related to one of our hypervisors.

EDIT 23:03 UTC: we restarted the hypervisor; related databases are down.

EDIT 23:04 UTC: hypervisor is up; VMs are starting.

EDIT 23:13 UTC: metrics are down too.

EDIT 23:25 UTC: databases are up. We are now experiencing issues with our internal reverses proxies and console and API are not available.

EDIT 23:30 UTC: we queued the linked applications for a high-priority redeploy to ensure they reconnect to their databases. Core services are still partially down.

EDIT 0:00 UTC: all applications are redeployed.

EDIT 02:56 UTC: we are still working to fix issues on our internal core services (console, API); users applications/addons are not impacted.

EDIT 03:30 UTC: internal core services are back!

HBase cluster supporting metrics down

2021-05-10T21:50:00+00:00

(All times in UTC) At 22:50 we got an alert saying access logs stopped being consumed. At 22:53 we got alerts saying hbase region servers went down.

After investigation, the hadoop namenodes were all in standby. At 23:33, after various checks, we promote one back to active. We then restarted all the hbase regionservers, then waited for the cluster to balance and heal up.

At 00:04 we restart the warp10 stores. At 00:07 everything is back to normal.

An add-on reverse proxy restarted

2021-04-30T10:03:00+00:00

An add-on reverse proxy was restarted because of a very high load. Applications connected to that proxy may have lost connections to their add-ons. An upgrade of that proxy was planned in a few weeks to avoid any chances of high load. Other proxies were already upgraded. The upgrade will be done in the next couple of days, the proxy being now outside of the pool.

Internal 500 error when using the console

2021-04-22T14:05:00+00:00

We are currently experiencing issues with the API following an update, we are rolling back to fix the issue.

16:13 - Rollback was successfully executed and everything is back to normal.

Cellar slowed down and partially unavailable

2021-04-20T17:19:00+00:00

We are investigating an issue with the Paris Cellar cluster.

17:33 UTC: The issue has been resolved. It was due to a partial upgrade (in progress) of the cluster. Upgraded nodes have been downgraded.

18:08 UTC: The upgrade was in-progress to fix the security issue labelled as CVE-2021-20288. Due to the large number of machines, some of them were not yet up-to-date, which have led to the issue we were facing. Some of the machines were unable to authenticate correctly, leading to a cascading failure of multiple machines that weren't yet patched. Another strategy will be used to continue the upgrade of the cluster.

Deployment issues starting or completing

2021-04-06T22:15:00+00:00

We currently have some issues with deployments. We are investigating.

Edit 22:48 UTC: The deployments should be fine since 22:30, we just made sure that everything was okay. Deployments that were stuck were restarted, those who failed can now be restarted without any issue. Sorry for any inconvenience.

Logs Drains

2021-04-01T14:24:00+00:00

Logs drains are temporarily unavailable.

EDIT 14:37 UTC - fixed.

Retroactive: 50% of TLS connections dropped on one of the HTTP/2 beta reverse proxies

2021-03-31T12:23:00+00:00

Around 50% of TLS connections made to one of the HTTP/2 reverse proxies were dropped indicating a lack of certificate. The issue's origin was a misconfiguration of this reverse proxy. Additional checks have been put in place to prevent this from happening again.

The error started at 12:23:36 UTC and stopped at 12:46:50 UTC, lasting around 23 minutes.

Metrics / AccessLogs storage issues

2021-03-29T10:15:00+00:00

We are experiencing issues with our metrics/accesslogs storage cluster.

EDIT 13:21 UTC - fixed.

Metrics & AccessLogs fetch are experiencing issues

2021-03-26T11:43:00+00:00

Metrics & AccessLogs querying components are temporarily unavailable.

EDIT 17:03 UTC - fixed.

Liar proxy service is unavailable

2021-03-10T00:23:00+00:00

Our Liar Proxy hosted on OVH is currently unavailable since 01:23 UTC+1. The incident on OVH side is http://travaux.ovh.net/?do=details&id=49473& but the Strasbourg (sbg) zone seems to be having a more general issue: http://travaux.ovh.net/?do=details&id=49471

We'll update this incident in the morning. Until then, if OVH fixes the issue before that, the liar proxy should recover network access.

EDIT 11:06 UTC+1: This service is in SBG1 which is currently impacted by the fire that took place in SBG. It may take several days to come back online depending on how possible it is to order new servers at OVH. If you are a user of this service, please contact us on the support if you have any questions.

RBX front reverse proxies DOWN for 12 minutes

2021-03-07T19:40:00+00:00

The 2021-03-07 at 19:40 UTC websites on the RBX went down. We started investigating the issue at 19:45 and saw the RBX reverse proxies were not accepting new connections. We restarted them and everything went back to normal by 19:54.

The culprit was a badly configured NOFILE limit on the RBX reverse proxies. We updated the setting accordingly.

Afterwards: We investigated all the reverse proxies on all the zones to make sure the NOFILE limit was correctly configured everywhere. We updated the reverse proxy software (sozu) to refuse to start when given too few NOFILE. We updated the sozu package to enforce the right NOFILE value upon installation.

Unexpected issue with a core component of the Metrics system

2021-03-02T08:41:00+00:00

We experienced an unexpected issue with a core component of the Metrics system.

The service is completely unavailable at the moment. We are working on it.

08:50 UTC: The faulty component is working. We are working on bringing everything back up.

08:59 UTC: Everything is back up. The ingestion pipeline is catching up.

09:07 UTC: The incident is over.

Investigating issues with our core API

2021-03-01T20:21:00+00:00

We are investigating issues with our core API.

EDIT 21:07 - fixed.

PAR: FS Bucket Migration

2021-03-01T11:47:00+00:00

The migration is expected to last at most 1 hour. All impacted applications will be redeployed during the migration. After the deployment, application will be able to write to the bucket. Read operations will not be impacted.

EDIT: This maintenance has been postponed to 15:00 UTC+1

EDIT 15:00 UTC+1: The maintenance is starting

EDIT 15:02 UTC+1: The buckets are now read-only

EDIT 15:14 UTC+1: Starting now, you can redeploy your applications if you want to regain write access early. Otherwise, affected applications will be redeployed automatically in the upcoming hour, starting with applications of Clever Cloud Premium customers

EDIT 17:14 UTC+1: The deployment queue finished one hour ago, everything has been working fine so far. This maintenance is over

Planned downtime of API and deployments

2021-02-19T10:57:00+00:00

On 2021-02-22 at 11:00 UTC, the API and deployment system will go down for a quick maintenance. Expected downtime is up to 10 minutes.

11:00 UTC: Maintenance is starting. Deployments are disabled.

11:02 UTC: API is down.

11:11 UTC: API and deployments are up again. Maintenance is over.

PAR: FS Bucket Migration

2021-02-15T11:20:00+00:00

Some Fs-Bucket add-ons will need to be migrated to a different server for security reasons. During this migration, the Buckets will be in Read-Only mode. Any attempt to create or update a file on the add-on will fail, including for FTP operations. Errors related to Read-only file system are expected during this migration.

Emails will be sent to customers of the impacted add-ons.

EDIT 12:00 UTC+1: The maintenance will begin shortly

EDIT 12:04 UTC+1: The buckets are now read only

EDIT 12:13 UTC+1: The redeployment queue began, it should not last more than 15 minutes.

EDIT 12:51 UTC+1: The maintenance is over, the queue ended 20 minutes ago and everything seems to be normal.

Slow API

2021-02-12T10:02:00+00:00

We are investigating performance issues with our API.

10:44 UTC: We have found the cause and fixed the issue. It was due to an internal tool unexpectedly making too many costly requests.

Metrics ingestion issue

2021-02-11T12:35:00+00:00

We are investigating an issue with Metrics ingestion. Recent data is unavailable at this time.

EDIT 13:18 UTC: Ingestion is working again, working at full speed to catch up.

EDIT 14:03 UTC: Ingestion has caught up since a few minutes ago, everything should be back to normal.

PAR: FS Bucket Migration

2021-02-08T13:21:00+00:00

Emails will be sent to customers of the impacted add-ons.

EDIT 11:55 UTC+1: The maintenance will start on time.

EDIT 12:00 UTC+1: The maintenance is starting

EDIT 12:07 UTC+1: Applications are being restarted. The restart queue should be done in about 20 minutes

EDIT 12:41 UTC+1: The migration is over.

Logs storage issue

2021-02-05T12:52:00+00:00

We are experiencing an issue with the storage of application logs. Ingestion is down and read access is partially unavailable.

13:06 UTC: The issue has been solved, ingestion is catching up.

13:10 UTC: Ingestion is all caught up. This incident is over.

API unavailable

2021-02-03T13:43:00+00:00

Our main API is unresponsive, therefore the console and CLI are unusable as well. We are investigating.

13:53 UTC: The issue is fixed. Everything is back to normal.

PAR: Add-on reverse proxy restart

2021-02-01T14:04:00+00:00

An add-on reverse proxy has restarted at 14:04 UTC, leading to connections loss on some add-ons if you used that proxy. Impacted application might have been able to connect to the add-on through a different reverse proxy, unreachable applications will be redeployed.

MTL: one of the IPof domain.mtl.clever-cloud.com is unreachable

2021-01-31T17:56:00+00:00

One of the IP (149.56.147.232) of domain.mtl.clever-cloud.com is unreachable because OVH blocked it. We are working on restoring it.

EDIT 18:07 UTC: The IP has been restored. OVH blocked it after a 4 hours email notice of phishing which has escaped our own filters. Further investigations will be conducted to avoid this incident in the future.

Logs ingestion delay

2021-01-28T13:10:00+00:00

Logs are currently delayed and may not be up-to-date. The queue is being consumed. Some messages may have been lost during because of an unexpected service reboot. Logs queries are still working.

EDIT 13:26 UTC: The queue has been consumed. Logs should now be up-to-date.

FS Buckets server unavailable

2021-01-27T09:42:00+00:00

A FS Buckets server has crashed and failed to automatically restart. An issue was preventing it from properly restarting. It is now fixed.

This server has been unavailable for 8 minutes.

Shared postgresql leader is crashing and in recovery mode

2021-01-27T09:21:00+00:00

Our shared postgresql leader is currently crashing repeatedly and entering recovery mode, we are investigating what is causing this issue.

Dedicated addons are NOT impacted.

Investigating hypervisors issues

2021-01-20T15:40:00+00:00

We are experiencing issues with hypervisors. We are investigating.

EDIT 15:45 UTC: Two hypervisors went down. The impacted services are:

Add-ons -> add-ons hosted on those servers are currently unavailable
Applications -> applications that were hosted on those servers should be redeployed or in the redeploy queue
Logs -> new logs won't be processed. This includes drains. You might only get old logs when using the CLI / Console
Shared RabbitMQ -> A node of the cluster is down, performance might be degraded
SSH -> No new SSH connection can be made on the applications as of now.
FS Bucket: a FS Bucket server was on one of the servers. Those buckets are unreachable and may timeout when writing / reading files

EDIT 15:54 UTC: Servers are currently rebooting.

EDIT 15:59 UTC: Servers rebooted and the services are currently starting. We are closely monitoring the situation.

EDIT 16:07 UTC: Services are still starting and we are double checking impacted databases.

EDIT 16:11 UTC: Deployment might take a few minutes to start due to the high deployment queue.

EDIT 16:33 UTC: Most services should be back online, including applications and add-ons. The deployment queue is still processing.

EDIT 16:45 UTC: The deployment queue is now empty since a few minutes, all deployments should go through almost instantly.

EDIT 17:13 UTC: Deployment queue is back to normal.

EDIT 17:15 UTC: The incident is over.

Logs ingestion issue

2021-01-19T15:25:00+00:00

We have detected an issue affecting our logs collection pipeline. New logs are not being ingested. We are investigating.

15:52 UTC: The issue has been identified and should be fixed. We are monitoring things closely.

16:11 UTC: Overall traffic in the logs ingestion pipeline is not completely back to normal. If one of your applications does not have up-to-date logs you can try to restart it.

16:32 UTC: We have forced the hand of a component of the ingestion pipeline making it catch up with the logs waiting in queue. It should go back to normal in a matter of minutes now.

Console and API performance issues

2021-01-19T10:22:00+00:00

We are investigating performance issues with the API and console. This issue seems to be caused by our dedicated reverse proxies (which do not affect the performance nor availability of our customers' applications).

While investigating the issue, something broke in one of the reverse proxies which is causing availability issues. We are working on this.

10:25 UTC: The availability issue has been resolved. We are still working on resolving the performance issue.

10:32 UTC: We found the culprit and have implemented a work-around. Performance is back to normal. We are still working on an actual fix.

Pulsar issues

2021-01-14T13:51:00+00:00

Our pulsar cluster is currently having issues, we are investigating the impact it may have on the cluster's usage and how to resolve them.

EDIT 14:03 UTC: The problem is now resolved. Some connection issues happened but a retry would have worked.

The API is rejecting some deployments

2021-01-13T16:15:17+00:00

16:08 UTC: The API is rejecting several deployment requests 16:10 UTC: Everything is back to normal

Retroactive: RBX zone: partial availability on some applications

2021-01-07T21:00:00+00:00

One of the reverse proxy was unreachable for 30 minutes on the OVH zone, this was due to an OVH networking issue. This is now fixed.

Redsmin is currently unavailable on Redis add-ons

2021-01-05T09:00:00+00:00

Redsmin is currently unavailable due to an expired TLS certificate. Redsmin owners have been notified, we are waiting for them to update the certificate.

EDIT 22:30 UTC: Redsmin owners updated the certificate. Redsmin should now be available again

Network issue to be expected on the MTL zone

2021-01-04T10:53:00+00:00

A 15 minutes maintenance on OVH's side is planned at 06:00 UTC-5. Network might be lost during the maintenance. Only one server is going to be impacted. Applications will be redeployed. OVH status: http://travaux.ovh.net/?do=details&id=48360

EDIT 11:02 UTC: The server currently has no network. Add-ons hosted on it are currently impacted.

EDIT 11:16 UTC: The network has come back. Waiting for OVH confirmation on the end of the incident.

EDIT 11:19 UTC: OVH closed the incident, everything should be back to normal.

Metrics and access logs unavailable

2020-12-29T10:30:00+00:00

Metrics and access logs are currently unavailable to query. We are working on a fix.

EDIT 11:42 UTC: The issue has been fixed, metrics and access logs can be queried again. There is a delay (currently 30 minutes) in the ingestion that is currently being resolved.

EDIT 12:10 UTC: The ingestion delay is now resolved, everything should be back to normal.

Advanced metrics: metrics list won't load or partial results will be displayed

2020-12-17T15:52:00+00:00

Due to an issue in the advanced metrics system listing, we will need to reset it. During the next few hours, the advanced metrics will either be unavailable or display partial results for the metrics listing. Metrics will stay available, only the listing will be unavailable.

EDIT 18:53 UTC: The maintenance is still in progress.

EDIT 00:00 UTC: The maintenance is done, the custom metrics should be available again.

Deployments issue + console / API unavailability

2020-12-16T13:48:00+00:00

We are investigating an issue related to deployments. It looks like some deployments are not starting and others are not updating the reverse proxies configuration as expected.

13:54 UTC: Related to this issue, the API is unavailable at this time. We are working on it.

13:55 UTC: We stopped the deployments to avoid any more missing updates.

13:56 UTC: The API being unavailable means that the Console and the CLI will display various errors.

14:05 UTC: Git push are also unavailable, an error will occur. The main problem has been identified and we are working toward a resolution.

14:23 UTC: We are still working on fixing the root cause of this issue.

14:49 UTC: We are still working on fixing the root cause of this issue. In the mean time, we have managed to get a fully up-to-date configuration on some reverse proxies.

15:07 UTC: We believe we have fixed the root cause of the issue and are working on cleaning everything up.

15:15 UTC: Everything is looking good now. If you still have an issue, please contact us.

Retroactive: Sozu reverse proxy TLS and HTTP errors

2020-12-10T17:24:00+00:00

Today, between 17:24 UTC and 17:34 UTC, customers using our Sozu reverse proxies may have noticed errors when connecting through one of the proxies. An upgrade maintenance was ongoing which led to stop the Sozu service and a reboot of the machine. Unfortunately, the traffic wasn't correctly redirected to an alternative instance, leading to various TLS errors or HTTP errors when connecting to the non-healthy instance. Once the machine was up again, the traffic would be correctly handled.

The root cause have not yet been found but this shouldn't have happened as we routinely do such maintenance operations without any issues. We will look further into this. Apologies for the inconvenience.

Retroactive: Missing reverse proxy configuration updates

2020-12-10T16:00:00+00:00

Today, between 16:00 UTC and 17:50 UTC, some reverse proxies configurations updates went missing. Applications that redeployed during this time frame may have not been correctly updated on some of our reverse proxies leading to HTTP 503 / This application is redeploying or HTTP 404 / Not Found error alongside the regular applications responses.

The root cause of this is still unclear, additional investigations will be performed. A bit before 16:00, we had an incident on an internal tool that may be related.

Deployments delayed

2020-12-09T16:00:00+00:00

A part of the deployment system is experiencing higher load than usual which may incur some delay before deployments actually start.

We are working on it.

16:23 UTC: This incident is over.

Metrics ingestion delay

2020-12-09T10:36:00+00:00

We are experiencing significant delay on the ingestion pipeline of Metrics.

The original incident started at around 05:15 UTC and we have been containing it since then with a lag under tens of seconds at worst.

It's now getting worse due to attempts at fixing the issue which are currently doing the opposite. This will take a while to solve.

11:17 UTC: The ingestion delay is now reduced to about 15 seconds. The issue is not completely solved, this is only a first step.

11:58 UTC: The ingestion delay is now back to normal. The root cause is not entirely fixed so this may come back but we will consider this incident as resolved for now.

Metrics and access logs down

2020-12-04T17:50:00+00:00

Metrics in the Clever Cloud Console of applications and add-ons as well as old access logs (not the live ones) are currently unavailable. Status code charts and heat map in the application overview will also be unavailable. The system is currently recovering.

EDIT 21:10 UTC: The service is now back to normal since ~30 minutes.

Network outage on multiple servers

2020-11-30T16:13:00+00:00

A network outage is currently affecting multiple servers. We are currently investigating. Multiple services may be in degraded states or unreachable. Customers applications and databases will experience the same issues.

EDIT 17:22 UTC+1: The network have been restored on those servers. We continue investigating which services are currently impacted. Applications that lost network connectivity to our monitoring are restarting. Applications that crashed because they lost their database access are also restarting.

EDIT 17:45 UTC+1: Deployments may still take some time to start or for those ongoing, to finish. We are cleaning-up the situation.

EDIT 18:17 UTC+1: Deployments are back to normal since 18:05. We are still cleaning up the rest of the mess and making sure everything is back to normal and working fine.

EDIT 18:25 UTC+1: Incident is over.

The issue came during a maintenance of our infrastructure provider during which multiple electricity cables were disconnected on active switches. Some of our servers were linked to those switches, cutting their network access for 5 minutes. Backup network links of those servers were also affected leading to a total loss of network. We will investigate this incident further with the infrastructure provider.

Maintenance on a core database of the API

2020-11-18T09:16:00+00:00

The database of the core API and therefore the core API itself will be unavailable for up to 5 minutes (~ 1 minute if everything goes to plan), starting at 11:00 UTC.

11:03 UTC: The maintenance is starting, console is in maintenance mode.

11:06 UTC: Maintenance is almost over.

11:07 UTC: Maintenance is over.

Reverse proxies are experiencing issues

2020-11-15T16:54:00+00:00

We are investigating a major issue on public and internal reverse proxies (not private).

EDIT 17:55 UTC - we identified the issue (DDOS).

EDIT 17:56 UTC - we fixed the issue on internal reverse proxies.

EDIT 19:15 UTC - we are still working to fix the issue.

EDIT 20:30 UTC - fixed and situation is back to normal. We will publish a post mortem.

Post mortem 2020-11-15

16:45 UTC: Our monitoring throws an alert: public and internal reverse proxies traffic is abnormally decreasing. Dedicated reverse proxies for Premium clients are not impacted. The on-call team starts investigations;

16:53 UTC: We see a lot of HTTP requests timing out with PR_END_OF_FILE_ERROR randomly on multiple reverse proxies.

17:00 UTC: We diagnose lots of IPs running an abnormal DDoS shape on our Paris infrastructure on identified domain names which prevents reverse proxies from accepting connections and causes reduced traffic;

17:30 UTC: After banning these addresses, new ones are used for the attack and we start banning IP ranges. During this period, we are applying custom reverse proxies configurations to limit the attack impact on various clients;

17:56 UTC: We are applying these bans on the internal reverse proxies, the internal situation comes back to normal; then we ban these on public reverse proxies;

18:00 UTC: Traffic is back to normal; PR_END_OF_FILE_ERROR disappeared and we are now facing SSL_ERROR_SYSCALL. We start investigating;

18:24 UTC: We determine these errors are due to configuration errors applied during the reverse proxies configuration changes.

20:06 UTC: All configurations are fixed, everything is working as usual. We are improving reverse proxies auto-configuration to avoid error-prone manual actions. We are fixing custom clients' configuration items and are watching monitoring data closely.

20:14 UTC: Reverse proxy improved auto-configuration is deployed.

20:30 UTC: We announce the end of the incident. The attack logs will be used to improve our DDoS detection system.

Short network incident on the Paris zone

2020-11-07T10:23:00+00:00

For about a minute around 10:23 UTC, our servers in one of our two Paris datacenters could not reach any outside network including the other datacenter.

The impact on applications deployed on more than one scaler should be null (apart from database access depending on your particular case). Applications deployed on a single instance had about a 50% chance of being affected.

This network incident had an impact on Metrics, the service was unavailable for 15 minutes after the incident and ingestion has been delayed for another 15 minutes.

As of now, we don't know exactly what happened but we expect that a router malfunctioned and went haywire for a minute.

Network incident between Free (French ISP) and our network provider for the Paris zone

2020-11-05T10:39:00+00:00

We have detected a network incident between Free (French ISP) and our network provider for the Paris zone. We are seeing 40 to 50% of packet loss on this interconnection.

Our network provider is investigating the issue.

10:48 UTC: We no longer experience packet loss on this interconnection. We are awaiting more information from our network provider on the cause and resolution of this incident.

10:57 UTC: The issue is back, we are experiencing the same amount of loss again.

11:07 UTC: The issue went away again. We are still awaiting word from our provider.

11:32 UTC: We are experiencing packet loss again on the same link.

11:35 UTC: The issue went away again.

11:36 UTC: The issue, ultimately, lies with Free and we cannot do anything about it from our side. Until the root cause is properly fixed, the loss issue may come back off and on.

14:53 UTC: Our network provider tells us that the peering link has been affected by the side effects of a DDOS targeting another customer of our network provider. They are working on providing measures to prevent more attacks targeting this network which should in turn prevent this link from getting overwhelmed.

Deployments timing out

2020-10-26T12:38:00+00:00

We are seeing some deployments timing out. It looks like the retry mechanism is doing its job just fine and deployments are starting anyway for all affected applications but you may be observing an usual delay.

We are investigating this issue.

13:47 UTC: The issue is fixed. All deployments have been working fine during this period, only delayed by a few seconds. The issue came from a misconfigured deployment component which was sending broken messages to hypervisors. The broken component has been dealt with.

Small Network downtime

2020-10-21T13:05:00+00:00

A core RabbitMQ node stopped responded and some databases were unreachable for 30 seconds. We are investigating the outage.
Some applications may register a connection loss to their database.

13:17 UTC - no other network loss. All critical parts of Clever Cloud have been checked and restarted to make sure they still communicate with each other.

Logs interuption

2020-10-16T20:00:00+00:00

Logs were interrupted for 15 minutes due to an internal issue. They were recorded and are being ingested. It may take a few minutes to receive all logs and current logs.

The issue is currently fixed and awaiting for full resolution.

Some FS Buckets are experiencing issues

2020-10-08T20:58:00+00:00

Some FS Buckets addons are experiencing issues. We identified the issue and are working on its resolution.

EDIT 21:25 UTC: The issue is fixed. The PHP applications may not work correctly. We are redeploying them.

EDIT 22:30 UTC : Applications with FS Buckets have been redeployed. The incident is closed.

Post mortem: An incorrect human action conducted the FS Buckets system to follow the wrong path between different storage nodes. We applied fix to avoid this cause.

Cellar addons are experiencing issues

2020-10-08T15:15:00+00:00

We identified issues on Cellar addons availability. We identified them and are working on their resolution.

EDIT 15:25 UTC: fixed. We are investigating the reasons.

EDIT 15:45 UTC: we identified the reasons and applied a fix.

Metrics / Access logs: requests error

2020-10-08T13:30:00+00:00

Metrics and access logs requests might experience issues following the maintenance of a core component of those features. Requests can either take a very long time to complete or simply answer an error. We are working toward a fix.

Data won't be lost, the ingestion is simply delayed.

Impacted products:

Metrics (in console or using the API)
Access logs (charts in the console's overview or using the CLI / API)

EDIT 14:03 UTC: Ingestion is now catching up on the delay, everything looks good. Looks like it may take 30 to 40 minutes to go completely back to normal.

EDIT 14:25 UTC: Ingestion has now caught up, everything should be back to normal.

EDIT 21:26 UTC: New issues are ongoing, we are investigating.

EDIT 22:16 UTC: Ingestion is running. We are consuming queues.

EDIT 23:30 UTC: Ingestion is back to normal. Fixed.

Console is currently unavailable

2020-10-06T12:50:00+00:00

Loading the console might result in various errors preventing users from logging in. We are currently investigating. The CLI shouldn't be impacted. Already loaded console webpages shouldn't be impacted either.

EDIT 13:02 UTC: A change causing this issue has been backed out. We will investigate further why it went wrong despite working correctly on our test infrastructure. Sorry for the disruption.

Redis: Redmin dashboard unavailable

2020-10-06T10:20:00+00:00

The redsmin dashboard for redis add-ons is currently unavailable. The Redsmin provider has been notified. We will update this post as soon as we have an update.

EDIT 10:54 UTC: Redsmin is currently working on a fix.

EDIT 19:54 UTC: The fix seems to be complete. Redsmin interfaces should now be able to load.

Problems on new logs collection

2020-09-17T16:45:00+00:00

There was an issue with regards to new logs collection between 16:45 and 17:00 UTC Some of these logs may have taken more time than usual to be processed. No logs have been lost.

Shared postgresql cluster Leader node has crashed

2020-09-12T05:27:00+00:00

0727 UTC: The free shared postgresql cluster Leader has crashed due to disk issue 1000 UTC: The team sees the issue. 1016 UTC: The team promotes the follower as leader. 1050 UTC: All applications using dbs on that cluster are redeployed.

Login issues

2020-09-01T14:10:00+00:00

We are currently looking into a login issue. Once you validated the form, the login process will reset, not allowing you to proceed to the wanted resource (console / CLI / other).

For any support queries, you can send us an email at support@clever-cloud.com

EDIT 14:26 UTC: The issue has been found and should now be fixed. We will investigate it further to prevent it from happening again.

Cellar C1: Cluster unreachable

2020-08-30T11:53:00+00:00

Our old Cellar cluster (cellar.services.clever-cloud.com) which still has some data nodes on Scaleway is currently unreachable due to networking issues on Scaleway's side: https://status.scaleway.com/incident/956

We are monitoring the situation. Our new Cellar cluster (cellar-c2.services.clever-cloud.com) is still reachable and works fine.

EDIT 12:02 UTC: A reverse proxy node is somehow still able to communicate with the nodes on Scaleway. All cellar-c1 traffic has been routed through that reverse proxy and requests should be served as expected.

EDIT 12:34 UTC: The network issue seems to not be on Scaleway's side per say but more on Level3/CenturyLink side which is a more global networking provider.

EDIT 15:17 UTC: The incident on Level3/CenturyLink seems to be resolved. The cluster is now fully reachable.

postgresql-c1 cluster is unreachable

2020-08-30T11:30:00+00:00

Postgresql-c1 which is an old PostgreSQL cluster still hosted on Scaleway may currently be unreachable due to some Level3/CenturyLink networking issues. Scaleway has an incident opened here: https://status.scaleway.com/incident/956

EDIT 15:17 UTC: The incident on Level3/CenturyLink seems to be resolved. The cluster is now fully reachable.

Peering issues with external services or to reach our services

2020-08-30T10:30:00+00:00

Due to an outage of the Level3/CenturyLink networking provider, you might experience issues:

reaching our services: if your FAI uses this provider, you might experience timeouts reaching our infrastructure
reaching external services from our infrastructure: if you contact external services from our infrastructure, the peering routes might use this network provider and your requests might timeout too.

This incident will group the previous opened incidents:

https://www.clevercloudstatus.com/incident/294
https://www.clevercloudstatus.com/incident/295

We do not have an ETA for the service to come back to normal.

EDIT 15:17 UTC: The incident on Level3/CenturyLink seems to be resolved. All connections either incoming or outgoing to/from our services should be working as expected. Please reach to our support if not.

An hypervisor went down unexpectedly

2020-08-27T14:01:00+00:00

An hypervisor went down (electrically shut off) unexpectedly.

This was caused by a human error, partly related to a laggy UI (low-level UI of a server manager used for a group of servers).

The person who triggered this realized the issue immediately and restarted the server which has stopped responding to our monitoring for a total of 3 minutes.

Chronology:

14:01:30 UTC: The server goes down

14:04:30 UTC: The server responds to our monitoring again and starts restarting static VMs (add-ons and custom services)

14:07:05 UTC: The last static VM starts answering to our monitoring again.

Impact:

Customers with add-ons on this server will find connection errors in their application logs during those 3 to 6 minutes and those applications most likely responded with errors to end users during that time.

Customers with applications with a single instance which happened to be on that server will have experienced about 2 to 3 minutes of downtime before a new instance started responding on another server.

Access logs unavailability

2020-08-20T12:50:00+00:00

New access logs are currently not processed. They are currently kept until they can be processed again. Access logs emitted before this incident are fine.

This impacts:

Access logs fetch using the CLI or the API
Live request map in the console
Total number of requests / status codes in the console (those are still available to display but the total of requests will be wrong in a few hours as access logs emitted won't be taken into account)

The issue has been identified and we are working toward a fix.

EDIT 14:07 UTC: The problem has been solved and the access logs stored have been processed. You should now be able to have an up-to-date livemap and fetch recent access logs using the CLI / API. Request count will be affected and won't be computed for the time window the access logs were not processed.

Metrics ingestion and reading issue

2020-08-11T14:29:00+00:00

The Metrics platform is unavailable at the moment. We are investigating the source of the issue.

14:30 UTC: It looks like an issue with the storage backend, we are working on bringing it back to life.

14:52 UTC: The storage backend looks fine but writes are still failing. We are still investigating this issue. It may take a while.

15:11 UTC: Again, the storage backend looked perfectly fine... restarting everything did fix the issue though so then again maybe it wasn't fine after all. Writes are functional, ingestion is working at full speed, fresh data will be available in ~20 minutes.

15:30 UTC: Ingestion delay is back to normal. Incident is over.

SSH connections to instances may fail

2020-08-06T00:08:58+00:00

SSH connections may have failed randomly those past few hours. The root cause has been found and a better monitoring will be put in place. Instances should have automatically reconnected after a restart of one of the main components. If it didn't, you can try to restart your application. If you absolutely need to SSH to your application to debug something before restarting it, ping us on the support

Metrics ingestion and reading issue

2020-07-29T16:08:00+00:00

The Metrics platform is unavailable at the moment due to an issue with the storage backend. We are investigating.

16:23 UTC: Some storage nodes were misbehaving. The issue is now fixed: reads are functional again and ingestion is now catching up.

16:28 UTC: Ingestion delay has been divided by two, incident should be over in under 10 minutes.

16:28 UTC: Incident is over.

Access logs unavailability

2020-07-27T11:27:00+00:00

Access logs are currently unavailable to query. Activity maps in the console may show outdated data. Querying old access logs works fine, only newer logs aren't processed for now.

EDIT 13:27 UTC: all access logs should be available again since 12:50 UTC. The root cause has been identified and will be addressed. Some logs may have been lost during that timeframe.

Add-on reverse proxy unresponsive

2020-07-23T18:53:00+00:00

An add-on reverse proxy stopped responding properly at 18:53 UTC. Due to issues during the restart process, it was finally accessible again at 19:05 UTC.

During this incident, you may have seen random issues while opening new connections to your databases.

DNS resolution errors on our domains *.clever-cloud.com

2020-07-17T21:16:00+00:00

It seems that following an outage of Cloudflare on DNS resolutions, users can experience issues resolving our domains.

The incident on Cloudflare side: https://www.cloudflarestatus.com/incidents/b888fyhbygb8

EDIT 21:35 UTC: The DNS resolution seems to be back again, our services are currently reachable from our point of view. It may vary depending on your location.

EDIT 22:51 UTC: Cloudflare implemented a fix and we did not see any new issue since then. This incident is now closed.

AccessLogs processing is experiencing issues

2020-07-17T13:48:00+00:00

The access logs map and analytics are not working since 13:45UTC. We are working on a fix.

EDIT 16:30 UTC: fixed.

MongoDB shared cluster high load issue

2020-07-03T14:25:00+00:00

During 3 minutes the MongoDB shared cluster has experiencing high load. That prevent new users connection. We are upscaling it.

EDIT 14:32 UTC - the cluster had been upscale.

DNS resolution affecting certain ISPs

2020-07-03T07:53:00+00:00

Some customers are experiencing issues with resolving our domains clever-cloud.com domains and their own domains pointing domain..clever-cloud.com.

At the moment, we know the problem is affecting customers of the French ISP Orange.

08:28 UTC: We found that Orange NS servers were indeed still using the faulty NS records from last night's incident. We have updated the zone on those name servers which should have never been used in the first place and hopefully Orange customers will be able to resolve our domains (and by extension their domains) properly.

08:42 UTC: Looks like the propagation is quite fast and this indeed fixed the issue for affected customers.

Domain provider issues

2020-07-02T19:54:00+00:00

Our domain provider briefly gave out an empty DNS zone file after a configuration change.

EDIT 20:04 UTC - fixed.

High load on the MongoDB shared cluster

2020-06-30T22:26:00+00:00

The mongodb shared cluster hosting free mongodb databases has a higher load than usual. It started going up at 15:25 UTC slowing reaching the point where it could not serve most of the requests as expected ~30 minutes ago. It is expected that requests would also fail since then because of timeouts or aborted connections.

The service has been restarted and we will monitor it closely, as well as adding monitoring to better catch this ramp up.

Dedicated databases are not impacted by this issue. If you are impacted, you can migrate your free plan to a dedicated plan using the migration feature. You can find it in the "Migration" menu of your add-on.

22:41 UTC: Load seems back to its normal state. The monitoring has been adjusted and we should then receive an alert at the start of the event instead.

23:19 UTC: The issue is back, the load is not as high as before but it might make the cluster slow.

23:54 UTC: Users impacting the cluster the most have been contacted to avoid this issue. Further actions will be taken later today if the issue persists.

2020-07-01 06:25 UTC: The node crashed due to a fatal assertion hit and restarted

06:38 UTC: The node is still unreachable for an unknown reason

07:48 UTC: The cluster is currently being repaired. For an unknown reason, nodes wouldn't listen to their network interfaces.

09:32 UTC: The repair is halfway through. The cluster might be able to be up again in ~1h30

11:50 UTC: The repair is done, the node successfully restarted. You should now be able to connect to the cluster. We are now re-starting the follower node for it to join back the cluster.

15:09 UTC: The leader node crashed again because of an assertion failure which means it is now unreachable again as mongodb reads its entire journal and rebuilds the indexes.

15:30 UTC: It usually takes 1h30 for mongodb to read the whole journal so it should be up again around 16:20 UTC.

16:34 UTC: It is taking longer than usual.

20:06 UTC: The restarts weren't successful. The secondary node successfully started at some point but was shutdown to avoid any issue with the primary one. We'll try starting it again.

2020-07-02 09:15 UTC: The first node has been accessible now and again but keeps on crashing due to user activity. The second node failed to sync to the first node so it cannot be used as primary right now. We are now trying to bring the first node back up without making it accessible to users so we can at least get backups of every database. Once this is done, we will update you on the next steps. This process will take a while as Mongo takes hours (literally) to come up after a crash.

12:00 UTC: The first node is finally back up (but incoming connections are shut off for now). We are now taking backups of all databases, you should see a new backup appear in your dashboard in the coming minutes / hours. Once this is done, we will start working on bringing the second node back in sync. Once the cluster is healthy, we will bring it back online.

14:30 UTC: Backups are over, customers who were using the free shared plan in production can create a new paid dedicated add-on and import the latest backup there. Meanwhile, we are now rebuilding the second node from the first one to make the cluster healthy again. Once it's over, we will bring the service back up (if everything goes well).

15:55 UTC: The second node is synced up and the service is available again. We are still monitoring things closely.

18:35 UTC: The service is working smoothly, no issues or anomalies to report.

Important delay in Metrics ingestion

2020-06-16T08:11:00+00:00

Metrics ingestion is delayed, we are investigating.

08:37 UTC: Ingestion delay is back to normal. Incident was caused by a few storage nodes misbehaving after a short network issue.

Core API is experiencing issues

2020-06-15T14:08:00+00:00

Our Core API is experiencing issues that impact deployments, we are working on it.

EDIT 14:50 UTC: situation is back to normal.

[Paris] PHP sessions and temporary files failed to be written

2020-06-15T09:31:00+00:00

Due to an excessive amount of temporary files created by various PHP sessions in a very short time (+6GB in two minutes), the underlying fs-bucket for PHP sessions became full. You may have noticed various errors from 'write failed: No space left on device' to some more "random" errors that were caused by this.

Applications using redis as a session backend were not impacted by the session issue. It may have been impacted if your application generates temporary files, which are on the same fs-bucket.

Our clean-up policy of temporary files was not aggressive enough, we'll reduce it to once a day and will continue to monitor if we need to upgrade the current disk space.

This incident started at 11:33 UTC+2 and was fully resolved at 11:41 UTC+2

Cellar Buckets are slow

2020-06-12T15:07:00+00:00

We are investigating issues on Cellar Addons, we are experiencing network issues.

EDIT 15:20 UTC: fixed.

[Montreal] Reverse proxy configuration fails to automatically update

2020-06-11T18:50:00+00:00

On our MTL zone, reverse proxies are currently not able to update themselves once a configuration change happens (application deployment, added domain, ...). We are looking into it.

19:00 UTC: If your application deploys, your application will not be up-to-date. It will continue to show the old content, the old instance will be kept until this incident is over.

19:12 UTC: The issue has been identified, we are fixing it.

19:20 UTC: The issue was caused by the configuration checker that took way more time than usual before applying each configuration changes. A configuration option to disable those checks inside the program handling the configuration has been enabled. The configuration remains checked by the reverse proxy itself but it is way faster.

Deployments should now be up-to-date.

Paris zone: Network outage

2020-06-10T15:04:00+00:00

It seems a global network outage happened for 1 minute, leading to possible loss of connection to most of our services. It seems to be back for now but we are investigating and we will provide further information.

15:22 UTC: We continue to investigate what's been impacted. Currently deployment are disabled to recover from the event.

15:27 UTC: Deployments are now available.

15:56 UTC: The situation on the platform is stabilized. It seems the outage was between both of our datacenters in the Paris zone. We are asking for more details to our hosting provider.

16:05 UTC: Our network provider came back to us. The network outage lasted for 1 minute and 20 seconds. One of the links was lost between those two datacenters. The backup link should have been up 2 seconds after the loss of the first link. But for some reason it did not switch (or not correctly). After a 1 minute timeout, all links were closed and reset leading to a new link election which takes ~20 seconds. From there, the connection has been restored. Our network provider will continue to investigate why the initial backup link did not switch.

Once the network started working again, our monitoring was able to check what was currently "down". The services that were down were restarted but nothing should have impacted reaching your application (it was mostly internal services). Add-ons connections should have been back at the same time from applications but if your application crashed because it couldn't reach the add-on, then it should have been automatically redeployed once the deployment system was up again which should have been a bit before 15:27 UTC.

We are sorry for the inconvenience this outage created. The time of this incident has been changed from 15:06 UTC to 15:04 UTC to correctly match the date and hours.

Metrics unavailable

2020-06-08T07:25:00+00:00

Metrics are currently unavailable.

An index node has been restarted to upscale it. Its replica did not like the surge of requests and decided to crash a few seconds later. We are currently in the process of upscaling all index nodes to avoid such issues, those 2 nodes were the last remaining on the list.

Index nodes have to scan the whole dataset on start, this will take close to an hour to resolve.

08:07 UTC: Incident is over.

Access logs not available

2020-06-04T10:34:00+00:00

Access logs are not available since 2020-06-04 10:34 UTC. Everything is operational at 2020-06-04 11:25 UTC.

Metrics unavailability

2020-06-03T13:07:00+00:00

Metrics are currently unavailable as the 2 replicas of a chunk of the index are down. Estimated time to resolution: 30 minutes.

13:38 UTC: Incident is over.

Metrics and access logs ingestion delay

2020-06-02T06:00:00+00:00

Metrics and access logs currently have some delay in their ingestion. We are currently under one hour of delay with the gap closing at a low rate. Data for older metrics / access logs remain available. We are currently working to reduce the current ingestion delay.

EDIT 06:58 UTC: Ingestion is back at its normal rate, we are currently under the 30 minutes of delay. This should be at 0 seconds of delay in the next couple of minutes. EDIT 06:18 UTC: Ingestion delay is back to normal too since a few minutes. Incident is over. Everything (access logs / metrics) should have the latest data again.

Access logs data loss and partial unavailability

2020-05-31T17:05:00+00:00

Due to a backend issue, access logs between 17:05 UTC and 18:19 UTC were lost. All access logs emitted after 18:47 UTC can't be queried because they are not yet indexed. The indexation fails because of a issue with the GEO IP location feature. Once this is fixed, logs will then be indexed. This impacts accesslogs retrieval through the command line and the various stats displayed in the console (last 24 hours of requests, heatmap, live map, status code, ...).

Update 21:04 UTC: The GEO IP feature has been fixed. It seems to have initially broke with an auto update of the GEO IP library but more tests will need to be conducted to be sure of the root cause. All access logs between 18:47 UTC and now have been consumed and you should now be able to query them. We will work on improving the monitoring of the whole system to detect this kind of issue faster.

API & deployments disabled for maintenance

2020-05-28T12:11:00+00:00

In an effort to deal with spikes of API load in recent days, the API and the deployments will be disabled for maintenance starting at 20:30 UTC tonight (22:30 Paris time) for a duration of up to 10 minutes. Thank you for your understanding.

20:30 UTC: Maintenance is starting.

20:35 UTC: API and deployments are back up, maintenance is over.

Deployment failures

2020-05-11T09:55:00+00:00

Deployments were failing without any reasons or any logs in the console / CLI. The root cause has been identified and fixed at 10:25 UTC. All deployments started after 09:55 UTC failed and must be restarted.

Sorry for the inconvenience. We keep watching the status of the deployment system to make sure the problem is indeed resolved.

EDIT 10:40 UTC: Everything is back to normal.

One of the add-ons reverse proxies had to be rebooted

2020-05-04T18:43:00+00:00

19:43 UTC: one of the add-ons reverse proxies stopped responding to a part of the requests. 19:45 UTC: We restarted it. Some still working connections were lost, but the reverse proxy is now operational.

Metrics unavailable

2020-05-04T09:51:00+00:00

Metrics cannot be read at the moment because of an issue with the index components.

A chunk and its replica are both non-responding which means the service as a whole is unavailable. We are working on it.

10:00 UTC: An index node being unavailable threw us off on the wrong track. Its replica was actually working just fine, the issue was with both front read nodes being stuck at the same time. We will improve monitoring and try to figure out what went wrong and why.

Metrics ingestion issue

2020-04-29T14:19:00+00:00

A misconfiguration on new Metrics storage nodes caused a bad cluster state which in turn causes issues with the ingestion. We are working on fixing this issue.

14:35 UTC: We found the cause of the issue and are working on fixing it.

14:47 UTC: The root cause is fixed and the ingestion is now running at full speed. The misconfiguration issue was just half the story, what caused this issue was a partial network split.

14:56 UTC: Ingestion is all caught up. Incident is over.

Access logs currently unavailable

2020-04-29T11:05:00+00:00

Access logs are currently unavailable. Queries to get access logs might not work as expected. We are working on a fix.

EDIT 13:12 UTC+2: It also impact the real time map in the console. You may not see live queries to your applications. But your application still receive the requests as usual.

EDIT 13:22 UTC+2: Fixed; but during the downtime period the access logs were deleted. We identified the root cause and are fixing it.

MongoDB shared cluster: A datanode was unreachable

2020-04-23T10:12:00+00:00

At 10:12 UTC, a MongoDB data node of the free and shared cluster went down. It has been restarted at 10:16 UTC. Connections and queries during that time frame may have failed.

Dedicated add-ons (XS SmallSpace and above) were not impacted

Metrics and access logs ingestion is experiencing issues

2020-04-21T05:23:00+00:00

Metrics and AccessLogs are currently unavailable due to issues. We are working to fix them.

06:38 UTC: Everything is back online, ingestion is catching up.

06:52 UTC: Ingestion delay is back to normal.

FSBuckets write issues

2020-04-14T12:22:00+00:00

One of our FSBucket system is experiencing issues on write actions. We have identified the issue and are working to fix it.

EDIT 13:01 UTC: fixed.

Packet loss issues with some Metrics storage nodes

2020-04-08T15:22:00+00:00

We are experiencing significative packet loss issues with some Metrics storage nodes.

Ingestion is failing. Access to metrics may be difficult.

15:42:30 UTC: The network is back to normal. We are working on getting the ingestion back to its normal state. Metrics access may be shut down temporarily during this.

16:00 UTC: Ingestion is back online, working through 50 minutes of data.

16:14 UTC: Ingestion delay is almost back to normal.

16:17 UTC: Ingestion delay is back to normal. Incident is over.

Packet loss issues with some Cellar nodes

2020-04-08T15:21:00+00:00

We are experiencing significative packet loss issues with some Cellar nodes. This may impact access to some files temporarily.

We are looking into it.

15:42:30 UTC: The network is back to normal. We are making sure the service goes back to normal.

16:15:00 UTC: Replication of objects created during the incident is ongoing. Service is operational but can be a little slower than usual.

17:05:00 UTC: Everything is back to normal

API unavailability

2020-04-02T12:00:00+00:00

The API was unavailable for about 5 to 10 minutes during which most of the requests it received were hanging. Our functional monitoring did not report this issue so it may have be related to authenticated requests only.

Our CLI and Console were impacted.

We will investigate this incident further.

Metrics ingestion halted

2020-03-26T10:43:00+00:00

Metrics ingestion is completely stuck at the moment. We are investigating.

11:02 UTC: Ingestion is back online. It's unclear exactly what went wrong at the moment but it is most likely linked to the issue from yesterday. A complete reboot of all storage nodes 'fixed' the issue. Those storage nodes now have 48 minutes of buffered data to ingest.

11:11 UTC: Ingestion delay very close to normal.

11:17 UTC: Ingestion delay is back to normal.

Requests hang during reverse proxy upgrade

2020-03-26T09:45:00+00:00

Following an usual update on one of our public reverse proxies, some requests were hanging instead of being processed.

During the upgrade process, one of the workers of this reverse proxy continued to accept connections but didn't process them and kept them until the requests timeout. The issue has been resolved by 11:10 UTC+1 and will be investigated further. This is the first time the upgrade process fails us in months and we will take extra-steps to avoid and detect this issue faster.

Metrics ingestion disabled caused by network instability

2020-03-25T17:18:00+00:00

We are experiencing an important network issue with the storage nodes of Metrics.

Because of this, we disabled ingestion temporarily which will make things easier to debug and fix.

17:26 UTC: Network issue seems to be gone, ingestion is restarted

17:31 UTC: Ingestion is going smoothly. As of now, we don't know what happened network-wise, we are awaiting word from our provider. As of now, it looks like a congestion issue from our point of view.

17:35 UTC: Ingestion delay back to normal

Unresponsive add-on reverse proxy

2020-03-19T20:35:00+00:00

An add-on reverse proxy was rejecting most of the connections it received for about 20 minutes. It was restarted at 21:55 UTC+1. Some application might have connections errors because of the interruption but should be able to reconnect.

We will investigate this further as we have monitoring for such a case and it apparently didn't trigger here.

TLS errors on *.cleverapps.io domains

2020-03-17T17:15:00+00:00

An issue with one of our reverse proxy lead to TLS errors on *.cleverapps.io for about 20 minutes.

The issue has been resolved and the root cause has been found. A patch will be applied to avoid this happening again.

Deployments issue

2020-03-11T16:16:00+00:00

We are experiencing issues on internal systems. We have disabled deployments to limit potential impacts on our internal systems.

EDIT 16:25UTC: fixed.

N.B. between issues and the deployments deactivation, some applications were responding HTTP 503. It's now fixed.

PAR unreachable from multiple networks

2020-03-11T00:14:00+00:00

From 00:14:40 UTC to 00:23:10 UTC, the PAR zone was unreachable from multiple networks.

We don't know exactly what happened at this time but it looks like the impact was fairly minimal on actual users as we can't see any meaningful dip in aggregated incoming bandwidth usage of load balancers.

This post will be updated once we get more details from our network operator.

Status of app not correctly displayed

2020-03-09T14:15:00+00:00

The status of application is not correctly displayed in the console but this has no impact on the fact that they are up or down

EDIT: it's now fixed, app status and ssh access are now operational.

Metrics unavailability

2020-03-09T13:55:00+00:00

Metrics cannot be queried currently, any request will return an empty result.

This is caused by multiple instances of the same component crashing at the same time.

We are working on fixing this, this may take a while for a definitive fix (30 minutes at best, 1h30 at worst).

14:41 UTC: Metrics are currently available but this will probably not last as there is only partial redundancy on the affected component and the cause of the crash is not fixed

15:23 UTC: Metrics cannot be queried again

15:33 UTC: Metrics can be queried, but issues may still arise from time to time, issue is still not fixed.

15:45 UTC: Two nodes of the storage backend crashed under the load caused by the reload of the first components, this caused a delay in the ingestion and a pause in the reload of the first components. At this time, ingestion is catching up on the delay and queries are running fine despite the issues. You will most likely encounter issues as we work our way through this.

16:48 UTC: We have complete redundancy, this issue is now fixed.

API is experiencing issues

2020-03-03T15:18:00+00:00

We are experiencing issues on the Clever Cloud API that can affects console and cli requests.

EDIT 15:28 - we are still experiencing issues, we are working on a fix;

EDIT 15:39 - fixed.

Impossibility to open an SSH shell on an application

2020-02-26T07:56:00+00:00

An incident on a component used by the SSH gateway occurred at 08:56 UTC.

This issue has been fixed at 09:12 UTC.

From 08:56 UTC to 09:12 UTC, all clever ssh commands would hang forever.

Since 09:12 UTC, you may get the message "Opening an ssh shell." and then nothing. If this does happen, you will have to restart the application you are trying to ssh to.

Some of FS Buckets are experiencing issues.

2020-02-06T17:19:00+00:00

Some of FS Buckets are experiencing issues.

EDIT 18:24 - We identified the issues, applications linked are redeploying.

Logs ingestion delayed

2020-01-24T14:10:00+00:00

We are investigating an issue with the logs collection pipeline which is noticeably delayed.

14:17 UTC: A component of the "live logs" part of the pipeline was a bit overloaded and started slowing everything down slowly until it became actually noticeable. It has been restarted and the pipeline is now working on the delayed logs waiting in queue.

14:21 UTC: The load came back up soon after the restart, we are working on bringing it down; we may have to shut it down temporarily to scale it up (quick note: we are working on a new pipeline which can be scaled at will without any downtime)

14:25 UTC: We are temporarily shutting down the Logs API to make things easier.

14:34 UTC: Logs API is back and delay is back to <5 seconds, we are still watching the situation closely.

14:58 UTC: Everything is indeed back to normal.

Mongodb shared cluster unavailable for 15 minutes

2020-01-15T12:45:00+00:00

The MongoDB shared cluster lost its primary node at 12:45 UTC. The primary role should have been transfered to another node, but depending on your client configuration, you might not have automatically reconnected. The node's system was unresponsive and we had to forcefully reboot it. Update 13:01: Everything is back to normal. Applications may have to be redeployed.

Access Logs dashboard experiencing issues

2020-01-13T12:35:00+00:00

BETA Access Logs dashboard is experiencing issues. We identified the issues and are working to fix them.

EDIT 13:55 UTC: fixed.

Metrics ingestion delayed + read impossible

2020-01-06T16:04:00+00:00

Metrics ingestion is delayed while we investigate an issue with the storage backend. This issue is caused by the addition of capacity to the storage backend.

Metrics cannot be read as well, this includes access logs, hence the overview of your organizations is not available.

16:20 UTC: The issue is fixed, ingestion is working again. Overview is still not loading for now (because recent data is not there).

16:34 UTC: There was another issue with the reading part, which is now fixed. Everything is now working as normal. Though there may be some hiccups with the ingestion in the coming minutes.

16:43 UTC: This issue is resolved. Sorry about the inconvenience.

Metrics / Access-logs: Upgrade of the backend storage

2019-12-24T14:05:00+00:00

This upgrade is a follow up of https://status.clever-cloud.com/incident/238 (which is also a follow up of https://status.clever-cloud.com/incident/237).

During approximately 1 hour, Metrics and access logs (dot maps / requests count in the console) will be unavailable both in reading and writing starting December 26th at 14:00 UTC+1.

All data will be kept and ingested at the end of the maintenance.

EDIT 13:00 UTC: Maintenance is starting

EDIT 13:23 UTC: Initial steps are done, WRITE have been delayed up to 8 minutes and some READ may have failed. The second phase of the maintenance will begin shortly.

EDIT 14:32 UTC: Second phase is over. There were two ingestion delays, peaking at 4 minutes each. The maintenance is not over yet but it should not impact the ingestion nor the read.

EDIT 14:58 UTC: It should not have had any impact but it still did. Ingestion is delayed, reads are impossible; we are investigating.

EDIT 15:21 UTC: The issue is solved; reads are back, ingestion is working

EDIT 15:31 UTC: Ingestion delay is back to normal

EDIT 16:00 UTC: Maintenance is over.

Reverse proxies are slow

2019-12-17T10:00:00+00:00

2019-12-17 11:00, public reverse proxies (using haproxy) start experiencing performance issues. Requests take a long time to process.
14:00, we add two (sozu) reverse proxies to the pool.
14:15, sozu proxies are actually experiencing issues too. We removed them from the pool. We start cleaning them.
15:12, sozu proxies have been cleaned, updated and restarted. We add them to the proxy pool again.
16:25, things seem to run smoothly. We consider this issue fixed.

Delayed metrics ingestion

2019-12-13T13:29:00+00:00

Metrics ingestion is delayed, this is due to a very high load of the storage platform, due to a maintenance which is linked to the previous incident.

17:45 UTC: Ingestion is back to normal performance, delay will be back to normal in 15 minutes.

Metrics: Up-to-date metrics are delayed

2019-12-10T16:50:00+00:00

Metrics collection currently has some troubles and up-to-date metrics are not available since ~30 minutes. All the metrics are stored but won't be retrievable. We are looking into it.

18:42 UTC: We are still working on it. This is a never-before-seen, massive issue so we are unable to give any ETA at this time.

22:35 UTC: The issue has been narrowed down and is now under resolution. We will wait until tomorrow morning to continue restoring this service. All metrics gathered before this incident are still accessible, only new metrics are not. Those are currently stored and will be processed once the Metrics cluster goes back to normal. More news tomorrow morning.

12:00 UTC: We have been back working on this since 7:30 UTC, things are looking good; still at least a few hours to go.

13:55 UTC: The issue with the storage platform is now finally fixed. The ingestion is now running at full speed and catching up; it's processing the 22 hours of data which have been accumulating.

15:25 UTC: We are about halfway there.

16:50 UTC: We are 4/5 of the way there. It should be resolved in under an hour.

17:30 UTC: You should now already see recent points in your applications' metrics. Delay will be back to normal in less than 30 minutes. Closing this off.

Logs collection issue on public zones

2019-11-30T19:17:00+00:00

We are experiencing issues on logs collection system on public zones.

EDIT 19:44 UTC: fixed, the logs collection is catching up its lag.

EDIT 19:49 UTC: back to normal state.

Cellar: Requests hang

2019-11-29T10:50:00+00:00

Some requests are hanging when talking to our Cellar cluster. We are investigating the issue

EDIT 11:03 UTC: The cluster is now back up. A node was shutdown for maintenance reasons as it already happened these past weeks. Somehow the data it hosted was unavailable even though replicated data is available on other nodes. We will investigate this incident further.

TLS errors on cellar-c2

2019-11-20T07:52:00+00:00

Cellar-c2 is having trouble with TLS connections, we are working on it.

08:16 UTC: The issue is resolved

Network issues on our Paris zone

2019-11-18T16:21:00+00:00

We are currently having network issues across the platform. We are investigating the issue.

EDIT 16:32 UTC: The network issue seems to be resolving, only one of our datacenter had the issue but it may have impacted applications and add-ons that weren't in this datacenter.

EDIT 16:36 UTC: Console is not stable because of Clever Cloud API issues due to datacenter network problems.

EDIT 16:40 UTC: Our network provider is already aware of the issue and is looking into it.

EDIT 17:00 UTC: Our datacenters still have issues, we working on it with our provider.

EDIT 17:17 UTC: The network issue on our datacenters is over but it included additional issues. API is currently having issues and our console is unreachable at the moment.

EDIT 17:34 UTC: Console and API are up again and we are making sure that all services are up and running again.

EDIT 19:26 UTC: The incident is currently over and nothing has come up since 17:34 UTC.

We are still waiting for more information from our Network Provider that we will add here as soon as we get it.

The network perturbation was ongoing from 16:18 UTC to 16:30 UTC. One of our datacenter experienced high packet loss due to routing issues. Those issues were only impacting the external trafic (communication between our 2 datacenters was not impacted). Applications and add-ons were UP but unfortunately, because of those routing issues, you may have experienced difficulties reaching out your applications.

Those issues also impacted some of our systems and made our API / Console unavailable for 1h during which deployments were also not working.

SSL certificates generation is experiencing issues due to Let's Encrypt organisation issues

2019-11-17T17:52:00+00:00

SSL certificates generation is experiencing issues due to Let's Encrypt organisation issues: https://letsencrypt.status.io/.

Routing issue on one of our networks

2019-11-07T04:06:00+00:00

A network issue is occurring: a node does not route correctly to one of our DCs. We are working around it in the meantime. IPs in the 46.252.181.0/24 range are unreachable.

EDIT 05:30 UTC: routes have been updated to avoid the incriminated router. Traffic is back to normal.

SSH Access on Instances is Unavailable

2019-11-04T14:45:00+00:00

We are currently experiencing problems with the SSH Gateway, we are investigating the issue to fix it.

Git push issue

2019-11-04T13:41:00+00:00

Since 13:41 until 14:06 UTC, git push no longer triggered a deployment. If you did a git push during those times, please use the "restart last pushed commit" button to actually deploy the latest commit.

Metrics unavailable

2019-10-09T14:40:00+00:00

Metrics are unavailable because of multiple nodes of the indexing system which went down simultaneously. They are reloading their index in memory.

Service should be back in 15 minutes.

Meanwhile, ingestion is still working fine.

15:01 UTC: Incident is over.

Metrics ingestion delay + slow read queries

2019-10-08T10:23:00+00:00

We are experiencing an issue on the Metrics service which is due to an error while adding capacity to the storage cluster. We are working on it.

10:26 UTC: The ingestion issue is fixed, the system is now catching up.

10:33 UTC: The ingestion delay is almost back to normal.

10:36 UTC: There is still a bit of a lag but it should come back to normal in a few minutes. Read performance is still a bit hit or miss but coming back to normal as well. We will reopen the incident if it does not.

11:06 UTC: The ingestion lag is increasing. We are investigating. This may take a while.

11:30 UTC: The cause has been identified and partially fixed.

11:37 UTC: Lag is now <5s ; we are currently working on fixing the issue in a more permanent way.

11:45 UTC: The issue is now fixed.

Logs collection issue

2019-10-01T13:08:00+00:00

We have an issue with logs collection in the Paris zone. We are working on it.

13:20 UTC: The issue has been identified and at least partially fixed. Logs are coming through but we are still making sure that everything is indeed fine.

13:25 UTC: The issue is indeed fixed. Some older logs are still being collected.

13:33 UTC: Incident is over.

Connection issues with a free MongoDB cluster

2019-09-30T07:33:00+00:00

A free shared mongodb cluster has too many connections opened which prevents new connections from working. We are looking into which user(s) are opening too many connections and we will start a new cluster to alleviate the issue. We have no immediate solution, sorry for the inconvenience.

08:40: The problem has been alleviated by allowing more connections. It will slow down the service but you can at least connect to your databases and migrate to paid add-ons if you were using this service for production. We will start a new cluster very soon to improve performance.

Monitoring is flooding the deployment system

2019-09-29T13:58:00+00:00

False positives in monitoring are causing a lot of deployments, making legit deployments harder to process.

17:21 UTC: Incident is over. A monitoring component was still complaining about a few applications in a loop, there was no actual issue, just a very overzealous alerter process. Deployments performance has been back to normal since 16:43 however.

Deployments delayed

2019-09-23T12:30:00+00:00

Deployments are delayed because of an unusual amount of deployments to be processed.

12:44 UTC: The delay is now back to normal. Some deployments may be stuck though, please contact us if you are experiencing such an issue.

Build Cache Archive Downloads and Uploads are hanging

2019-09-16T14:31:00+00:00

There seems to be a problem with download and upload of build cache archive causing them to hang. It is resolved for now but we are watching closely to see if the problem reappears

Logs experiencing issues

2019-09-10T20:40:00+00:00

Our logging infrastructure (including live logs) is experiencing issues.

EDIT 20:51 UTC: fixed.

EDIT 23:19 UTC: the logging infrastructure is experiencing issues. We are working on a fix.

EDIT 23:25 UTC: fixed.

[MySQL] Free shared cluster currently down

2019-08-31T23:53:00+00:00

The cluster is currently down. If it can't be brought up, a failover will be issued

EDIT 00:00 UTC: Cluster is now available again, no failover happened.

Logs are currently unavailable

2019-08-31T05:50:00+00:00

Logs are currently partially unavailable through the console or CLI. Logs are still collected but display might not show current logs. They may also be out of order.

EDIT 06:22 UTC: Logs are now available again. No logs should have been lost but they might be out of order until 06:15 UTC.

[Cellar] Elevated errors rates on the old cellar cluster

2019-08-25T23:55:00+00:00

We are currently seeing elevated error rates on the old Cellar cluster. A few nodes went down making operations longer than usual, leading to timeouts or 500 / 503 errors. Nodes are already getting back up.

EDIT 00:21 UTC: The cluster is getting back to normal, errors have already significantly decreased and most of the requests should now be successful. We keep monitoring failed requests.

EDIT 03:00 UTC: No more failed request over the last 30 minutes, the incident is closed. We are still in the process of migrating this cluster data to the new cluster. Until we automatically migrate your buckets, you can migrate them yourself. Feel free to contact our support for more information

Cellar C2 brief service interuption

2019-08-23T17:31:00+00:00

The new Cellar cluster (cellar-c2.services.clever-cloud.com) had a brief interruption between 19:31:30 and 19:33:20 UTC on 23/08/2019 where most of the requests couldn't be handled or were dropped if already started. The problem has been identified and automatic actions have restored access to the cluster. The main issue will be investigated.

Deployments delayed

2019-08-22T12:00:00+00:00

Deployments are delayed, we are looking into it.

12:11: An orchestrator was experiencing intermittent network issues. The issue is now fixed.

Monitoring/Unreachable false positives

2019-08-20T20:30:00+00:00

From 20:30 UTC to 21:30 UTC, 16% of the hypervisors of the Paris zone failed to resolve the monitoring service domain name.

Applications which had instances on these hypervisors have been redeployed automatically because the monitoring could not reach them (even though they were available).

MySQL c4 is experiencing issues

2019-08-13T09:26:00+00:00

The MySQL c4 shared cluster of EU zone is experiencing issues. We are investigations.

EDIT 9:29 UTC: fixed.

Metrics are unavailable

2019-07-22T13:20:00+00:00

Metrics are unavailable, we are looking into it. Write requests are still processed.

13:27 UTC: Issue fixed.

Issue on one of our hypervisor

2019-07-18T21:59:00+00:00

One of our hypervisor is experiencing issues and is unresponsive, we are restarting it. The applications on it have been redeployed on other hypervisors. Addons will be down during the restart.

EDIT 22:05UTC: the hypervisor is restarted.

EDIT 22:20UTC: incident fixed.

API maintenance

2019-07-16T19:38:00+00:00

Our API will go under maintenance at 19:40 UTC. Deployments will be disabled for a few minutes and the Dashboard won't be available either.

EDIT 19:43 UTC: The maintenance is starting, API will be shortly unavailable.

EDIT 19:49 UTC: The maintenance is over!

Network Maintenance

2019-07-16T09:11:00+00:00

An exceptional Network Maintenance is planned today at 20:00 UTC. Network interruptions are expected to happen from time to time during a few hours. They shouldn't last long. All applications will be redeployed on non-impacted servers, some add-ons will be unreachable at some point. Unfortunately, we couldn't postpone it due to calendar issues. Do not hesitate to ping us on the support if you have any questions.

EDIT 20:03 UTC: The maintenance should start shortly. We will keep you updated on its progress.

EDIT 20:53 UTC: The maintenance is still ongoing. Nothing unusual to report as of now

EDIT 21:20 UTC: Everything is going smoothly as seen in our tests. Nothing unusual to report as of now

EDIT 21:42 UTC: The maintenance is over. No network interruptions have been noticed by our monitoring systems. Everything is back to normal.

Elevated error rates on our API

2019-07-10T21:38:00+00:00

Our payment processor currently has troubles leading our calls to their API to sometimes fail. Multiple endpoints on our API request our payment processor's API and some of them will fail.

Here is a non exhaustive list of affected actions (some of them will succeed):

Application or add-ons creation
invoices payment
credit cards management

EDIT 23:30 UTC: Our payment processor issues should now be resolved. Everything should be back to normal on our side too.

Main API unavailability

2019-07-10T07:25:00+00:00

A maintenance on components used by the main API will take place on 2019-07-11 at 10:00 UTC (12:00 CEST). The main API will be unavailable for a few minutes (up to 20).

10:02 UTC: Deployments queued now will be post-poned until the end of the maintenance.

10:04 UTC: The main API is now unavailable.

10:06 UTC: The main API is restarting.

10:09 UTC: Maintenance is over. The main API is available, pending deployments are starting.

Public PAR reverse proxies unavailability

2019-07-01T09:20:00+00:00

A human error caused a configuration error on all public PAR reverse proxies which prevented them from reloading their configuration from 09:20:35 UTC to 09:24:40 UTC.

An automatic restart at 09:21:48 UTC made them unavailable until the configuration was re-generated without the error at 09:24:40 UTC.

Steps will be taken to prevent this error from happening again.

Network loss on some servers on one of our datacenters

2019-06-27T22:40:00+00:00

We are suffering network loss on some of our servers in one of our datacenters. We are currently aware of the root cause and working on it.

EDIT 22:49 UTC: Our API is also down for now, that's expected. The console is therefore down too. Clients websites remain accessible.

EDIT 23:11 UTC: Network came back 5 minutes ago, we are currently checking if everything is ok

EDIT 23:26 UTC: Applications with fs-bucket (including PHP applications) may have issues loading because their connection to the fs-bucket server, if this server was on the datacenter who lost the connection.

EDIT 00:26 UTC: Applications with fs-buckets are currently redeploying. Most of them successfully reconnected (sometimes after several minutes) to their bucket server. The incident is over.

Deployments will be disabled for up to 15 minutes

2019-06-27T12:01:00+00:00

Deployments will be disabled for up to 15 minutes on Thursday 2019-06-27 at 19:00 UTC (21:00 CEST).

We will perform a migration of the Git repositories. Once deployments are enabled again, you may have to wait a few more minutes depending on your DNS cache.

19:00 UTC: Maintenance is starting, deployments are now disabled (except for Github deployments).

19:13 UTC: The maintenance will last longer than initially planned, we are experiencing an issue and are looking into it.

19:15 UTC: The issue is fixed. We are making sure that everything is indeed fine. Some deployments may now go through, depending on your DNS cache.

19:30 UTC: Maintenance is over; if you encounter an issue, please refresh your DNS cache.

Deployments disabled for up to 15 minutes

2019-06-25T10:08:00+00:00

Deployments will be disabled for up to 15 minutes on Thursday 2019-06-27 at 10:00 UTC (12:00 CEST).

We will perform a migration of the Git repositories. Once deployments are enabled again, you may have to wait a few more minutes depending on your DNS cache.

EDIT: This has been postponed.

Deployments disabled for up to an hour

2019-06-17T13:17:00+00:00

Deployments will be disabled for up to an hour on Thursday 2019-06-20 starting at 10:00 UTC (12:00 CEST).

It should be quicker than that but if you do have deployments planned, make sure to start them well before the beginning of the maintenance.

EDIT 10:01 UTC: Maintenance is starting now, deployments are disabled.

EDIT 10:19 UTC: Deployments are enabled again.

EDIT 10:31 UTC: Deployments are disabled again. Dedicated reverse proxies for Clever Cloud APIs are out of sync, our APIs are down at the moment. We are working on it.

EDIT 10:39 UTC: Main API is back online.

EDIT 10:47 UTC: Reverse proxies are in sync, deployments are enabled again. We are cleaning up.

EDIT 10:53 UTC: Maintenance is over.

Elevated error rate and response time on cellar-c1

2019-06-17T07:52:00+00:00

This cluster is experiencing elevated error rates and response times. It is currently a bit overloaded following the restart of a few nodes which crashed.

It should go back to normal in 30 to 60 minutes.

EDIT 8:56 UTC: There are still clean-up operations in progress which slow down the cluster. Error rate is going down though.

EDIT 9:55 UTC: Incident over since 9:40

cleverapps.io domain are timing out

2019-06-12T17:05:00+00:00

Some cleverapps.io domains are experiencing time outs, we are investigating the issue

EDIT 18:41 UTC: The problem should now be fixed since a couple of minutes. We gathered information as to why this problem happened and will try to narrow it down.

False positives, automatic redeployments and deployments queue

2019-06-11T09:23:00+00:00

A human error triggered a lot of false positives regarding applications status. This in turn queued hundreds of automatic deployments.

The issue is now fixed, but deployments will take a little while longer to start until the queue is consumed.

EDIT: Incident over at 09:40 UTC

A cellar node restarted, timeouts or HTTP 500 errors sent

2019-06-10T23:00:00+00:00

A node from the old Cellar cluster restarted at 21:30 UTC. While it went okay at first thanks to the restart of the few nodes a few days ago, it started emitting HTTP 500 errors or timeouts, as it was before. Service should be back online in a few hours once the cluster stabilized itself again.

The new Cellar cluster is not impacted by those issues.

EDIT 23:40 UTC: Cluster now seems to be in a good shape again

Deployments keep getting through the build phase

2019-06-10T12:20:00+00:00

Some deployments seem to keep building (even if the build succeeds, another build starts). We are looking into it.

EDIT 12:33 UTC: We may have identified the root cause. It may be due to a change that happened this morning. We will revert it.

EDIT 12:43 UTC: The change has been reverted and we confirm that it resolves the issue. Sorry for the inconvenience.

A cellar node is restarting

2019-06-08T22:30:00+00:00

One node of our old Cellar cluster is restarting, some requests are failing (timeouts or 500 errors). This will be resolved once the node has fully restarted. We may need to restart more nodes right after.

EDIT 23:30 UTC: Other nodes need to be restarted. We saw <1% of failing requests, expect the same amount for the remaining restarts.

EDIT 02:00 UTC: Nodes have been restarted, failing requests are getting lower and lower, still under 1%.

API Unavailable or very slow

2019-06-07T15:15:00+00:00

Our main API is currently having some troubles to respond to requests in a timely manner. We are investigating it.

EDIT 15:32 UTC: The issue has been identified, we are currently re-deploying the API. Console is still unavailable.

EDIT 15:34 UTC: The API successfully redeployed and is now available. Console is now available too. The incident is over.

Logs system part restart

2019-06-07T14:19:00+00:00

We will restart a part of our logs system, it will take 2 minutes. After this interruption, the logs produced during the restart will be available but the logs ordering will be lost. This restart is a part of our new logs system development.

EDIT 14:27UTC: finished.

Two Factor Authentication

2019-06-03T10:43:00+00:00

Two Factor Authentication is currently down, resulting in users unable to log in.

Cellar c1: some nodes are down

2019-05-25T07:30:00+00:00

Cellar c1 may have issues. Some nodes do not restart correctly.

Logs ingestion issue

2019-05-17T15:02:00+00:00

Logs ingestion is currently having issue. We start investigating the issue

EDIT 15:31 UTC: The issue has been fixed. Some of the logs were lost but not all of them, you should have the last ~15 minutes, the buffer wasn't large enough to keep them all. We will increase it next week.

[MySQL] Shared cluster is under high load

2019-05-09T16:00:00+00:00

One of our MySQL shared cluster is under high load. We are investigating.

EDIT 16:20 UTC: Problematic queries have been killed and the cluster load is going down. We continue to monitor the situation but it should go back to normal. We also have a newer MySQL shared cluster on MySQL version 8. You can migrate your database to it using the "Migrate" tool.

EDIT 16:45 UTC: The performance issue is back, we are trying to narrow down the issue

EDIT 17:00 UTC: Performances are again back to normal. We will keep an eye on it. Meanwhile, do not hesitate to migrate to our new cluster to avoid this issue.

EDIT 10/05/19 08:10 UTC: The issue has come back.

EDIT 10/05/19 12:00 UTC: Owners of the potential abusive queries have been notified. Cluster performances are back to normal. As usual, we will keep an eye on it.

Logs pipeline issue

2019-05-06T15:32:00+00:00

We have an issue in the logs ingestion pipeline. We are working on it.

EDIT: Issue resolved at 15:48:20 UTC

Important error rates on Cellar

2019-04-23T09:29:00+00:00

We are recording important error rates on Cellar between 500 and 503 errors.

EDIT 09:41 UTC: 503 errors are now gone but were replaced by 500 errors that get triggered after a few seconds. We are checking the cluster's state

EDIT 10:10 UTC: Error rate is decreasing but continue to be important. Deployments are also impacted by this issue if you are using build cache.

EDIT 10:27 UTC: Error rate is still at ~20% and continue to decrease.

EDIT 11:52 UTC: We did not receive any errors since 11:40 UTC, the cluster is now in good shape and everything should be back to normal.

This cellar cluster will soon be deprecated (new cellar add-ons are already created on an up-to-date cluster) in favor of a better and maintained version.

API is unreachable

2019-04-16T15:58:00+00:00

Clever Cloud API is very slow, we are investigating.

EDIT 16:11 UTC: fixed.

Problems on applications/addons access

2019-04-15T17:45:00+00:00

We are experiencing issues on applications/addons access. We are now investigating and we will come back with more informations.

EDIT 17:56 UTC: the systems are backing to normal. It was a DNS resolver problem.

EDIT 17:58 UTC: fixed.

Addons migrations issues

2019-04-04T23:59:00+00:00

There is a problem that prevents addon migration (if you start or have a running migration, the process will fallback to previous state without problems).

cleverapps.io domains marked as dangerous

2019-04-03T07:22:48+00:00

The cleverapps.io domain has been marked as dangerous. As far as we know, one or more subdomains have been reported as dangerous and the complete domain has been added to the list.

This means that your browser may show you a security alert when visiting a cleverapps.io site.

We are looking into reporting the mistake to the relevant lists and services.

Meanwhile, we remind our users that they should never use a cleverapps.io domain for production; they should only be used for development and tests.

Issue with cellar-c1

2019-03-27T16:10:00+00:00

A cellar-c1 node crashed in a way which caused a very important load on all other nodes; this is causing general slowness and an elevated error rate.

It should go back to normal gradually and will not take more than an hour at the most.

EDIT 17:00 UTC: Error rate and performance is back to normal

Cellar issues

2019-03-15T06:50:00+00:00

We are experiencing issues on our Cellar features.

EDIT 7:25 UTC: multipart uploads are down, the fixes are ongoing. EDIT 15:38 UTC: the cluster has been fixed, everything is back to normal.

Network issue on the older infrastructure

2019-03-12T06:08:00+00:00

We are experiencing a network issue on the older infrastructure in Paris. We are investigating.

EDIT 06:12 UTC: The network issue is over. This was an issue with our provider which affected all our servers but not all at the same time. Nothing was actually fully unreachable at any point in time but there was a lot of packet loss.

Peering issues with part of the SFR network

2019-03-07T08:22:00+00:00

We are getting reports from some SFR network users who cannot access the Clever Cloud Console. It seems to impact only some SFR customers.

EDIT 9:19 UTC: This only affects the older SFR network, not the SFR-Numericable network. This specifically affects all SFR peering going through TH2.

EDIT 9:50 UTC: This has been resolved at 9:36:30; if you are still experiencing issues, please tell us.

Network outage

2019-03-06T17:30:00+00:00

A network issue is happening. Applications may be unreachable.

Console is partly down. Some apis are down.

EDIT 18:20 UTC: Here is the history and context of the network issue:

At 17:25, a maintenance on a component of a redundant network link caused one of the underlying links to fail. For reasons unknown at this time, the failing link was elected and about 30% of packets were lost until 17:29.

At 17:30, the network engineer decided to revert the change; this caused additional loss for about 30 seconds. Network was back to normal at 17:31.

Elevated errors and response times on Cellar

2019-03-05T13:10:00+00:00

We are investigating an elevated error rate and elevated response times on Cellar. Only some buckets / files are affected by this issue.

EDIT 14:01 UTC: Error rate is back to normal. Response times are going down, we are still watching the situation closely.

EDIT 15:40 UTC: We are seeing an elevated error rate again, this was caused by a restart of a node which triggered a very high load on other nodes (which is not supposed to happen). We are investigating.

EDIT 16:30 UTC: The error rate went down significantly but it's not over yet. We sadly cannot give any meaningful ETA as of now.

EDIT 16:55 UTC: The error rate is close to normal. One node is still in trouble and it's causing a few errors; it should resolve quickly.

EDIT 17:15 UTC: The failing node went back to normal at 17:02. We are still seeing a few errors for write requests as of now.

EDIT 17:23 UTC: The error rate is back to normal. A few nodes are still a bit slower than usual so performance is a bit hit or miss but it should go completely back to normal in up to an hour.

HTTPS Routing Errors

2019-02-27T15:01:00+00:00

We are experiencing TLS issues on some HTTPS requests.

EDIT 15:33UTC: fixed.

Intermittent network issues

2019-02-25T23:20:00+00:00

Intermittent network issues have been identified affecting several systems. Those issues resulted in various timeouts or longer than expected connections to databases or applications.

We didn't see any new timeout since 23:45 UTC but we continue to monitor the service.

Live logs unavailable

2019-02-25T09:40:00+00:00

Live logs are currently unavailable. Newer logs should be available by refreshing the logs panel. Logs drains may be impacted to.

EDIT 10:30UTC: fixed.

One of your provider has network issues

2019-02-20T20:30:00+00:00

Due to a incident one Online Datacenter: https://status.scaleway.com/incident/286, we are experiencing issues.

EDIT 21:00 UTC: fixed.

An hypervisor restarted, applications have been redeployed, add-ons are being restarted

2019-02-20T14:52:00+00:00

An Hypervisor restarted. Applications have been redeployed and add-ons are restarted.

EDIT 15:28 UTC: All add-ons should be back online, some of them took longer than expected to recover. The cause of the reboot will be investigated.

[Deployments] Deployments are unavailable

2019-02-17T16:00:00+00:00

Deployments are currently unavailable due to an ongoing issue with our deployment system

EDIT 16:42 UTC: The root cause has been found. We are redeploying core components to clean everything.

EDIT 16:50 UTC: Deployments are available since a few minutes now. We are still cleaning things up. Sorry about the issue

One our hypervisor has issues

2019-02-16T20:27:00+00:00

We are experiencing issues on one hypervisors which can impact addons.

EDIT 20:50UTC: we are hard rebooting the hypervisor.

EDIT 20:55UTC: the hypervisor is up, the addons hosted on it are starting.

EDIT 20:58UTC: fixed.

Deployments slow-down

2019-02-15T10:20:00+00:00

Some deployments are having a delay to start.

10:27 UTC: The issue is now fixed

Applications are reporting a bad state

2019-02-13T16:25:00+00:00

Applications in the console are reporting an "unknown" state.

We are investigating the issue. Deployments are stopped until we find the root cause.

EDIT 16:35 UTC: Applications state should now be OK. Deployments are still stopped until we figure out the issue.

EDIT 16:40 UTC: Problem has been identified. We will resume deployments in a few minutes. All deployments action were queued and will be consumed.

EDIT 16:42 UTC: Deployments are enabled again. It may take a few minutes before your actions are handled. We consider this incident over.

Abnormal amount of 503 errors

2019-02-12T11:23:00+00:00

We have an abnormal amount of 503 errors served by our reverse proxies. We are investigating.

EDIT 11:31 UTC: Cause has been identified, we are currently fixing the issue on our reverse proxies.

EDIT 11:34 UTC: All reverse proxies now have a consistent state. The issue is fixed.

The issue happened after a configuration error made during a manual operation on some of the reverse proxies. Applications that redeployed since 11:08 UTC were impacted by that issue. Other applications were fine. The changes were rollbacked and will again be tested thoroughly on our test infrastructure.

Maintenance on our main API, deployments and GIT repositories

2019-02-12T09:25:00+00:00

A maintenance is scheduled on Monday 2019-02-18 at 11:00 UTC (12:00 noon, Paris time (CET)); it will affect the main API, deployments and GIT repositories.

The maintenance will last at least 5 minutes but no more than 20 minutes.

Different parts of the system will be affected throughout this maintenance, please wait until the end of the maintenance before reporting any issues you may be having.

EDIT 11:00 UTC: The maintenance will start in a few minutes. Deployments and GIT repositories will be unavailable. The console might report an "unknown" or not up-to-date state for applications. This is expected.

EDIT 11:05 UTC: Maintenance is starting, deployments are down and so are GIT repositories (push actions will be rejected)

EDIT 11:09 UTC: Deployments are available again. Push actions on GIT repositories are still disabled.

EDIT 11:10 UTC: Our main API is entering read-only mode. 500 errors might appear during this time.

EDIT 11:11 UTC: Git repositories are now available. You might need to clear your DNS cache to be able to push again.

EDIT 11:20 UTC: Our main API should be fully available again. We are looking if everything looks fine.

EDIT 11:23 UTC: Everything is looking fine. The maintenance is over. You might experience git push errors up until 45 minutes. To avoid that, please clear your DNS cache.

Main API is down

2019-02-11T07:07:00+00:00

The main API is having issues with its databases connections, it only ever replies with a 500 error to most requests.

EDIT 07:11 UTC: The issue is resolved

Maintenance on our deployment system

2019-02-05T11:16:00+00:00

Our deployment system will be unavailable from 11:00 UTC up to 13:00 UTC on Wednesday, 6th of February.

All deployments actions will be queued and started once the deployment stack is back up. The maintenance shouldn't last longer than 2 hours.

Feel free to ask any question on our support regarding this maintenance.

EDIT 11:03 UTC: the maintenance will start soon. Deployments will be shutdown in a few minutes. Push actions on our GIT repositories are disabled.

EDIT 11:06 UTC: Deployments are shutdown

EDIT 11:20 UTC: Deployments should be back, we are still cleaning up things

EDIT 12:20 UTC: We have been keeping a close eye on deployments, everything is going smoothly. Maintenance is over.

[Paris] Deployments unavailable

2019-02-03T07:50:00+00:00

Deployments on the Paris zone are currently unavailable. We are investigating the issue and working on bringing them back

UPDATE 9:40 UTC: deployments are back since 20 minutes, we are still cleaning things up.

UPDATE 10:30 UTC: Everything is back to normal, sorry for the issue.

Deployments are experiencing issues

2019-01-29T08:10:00+00:00

We have problems on our deployments systems. We are investigating.

EDIT 8:21 UTC: fixed.

Deployments are slowed down

2019-01-26T15:55:00+00:00

Deployments are currently slowed down. We are working to bring back them to their regular speed

EDIT 18:10: Deployments should be back to normal.

Live Logs

2019-01-24T10:41:00+00:00

We are running maintenance on live logs and logs drains. There will be some unavailability of these services.

EDIT 11:05 UTC: maintenance is finished.

Addon MongoDB: new features.

2019-01-21T12:36:00+00:00

We will disable MongoDB addon creation and deletion while we are doing a maintenance to add new features. We will edit this post when the maintenance will be finished.

EDIT 16:29 UTC: the new addon dashboard is available. We are continuing the maintenance.

EDIT 17:30 UTC: maintenance finished.

Console/API issues

2019-01-17T23:36:00+00:00

We are experiencing issues on our API which is impacting the console. We are investigating.

EDIT 18/01/19 00:53 UTC: Root issue is most probably identified. The issue was coming from an internal tool. We will investigate this further. In the meantime, the tool has been deactivated and shouldn't cause any harm.

[MongDB] Free Shared cluster connections issues

2019-01-15T15:35:00+00:00

MongoDB free shared cluster is having troubles accepting connections. We are investigating the issue.

EDIT 16/01/2019 09:45 UTC: The problem might be due to old clients drivers being used on the cluster. We have set up a new cluster (version 4.0.3) which should greatly improve things. You can create a new add-on to migrate your database.

To dump your data from your existing, you can use this command: mongodump -u "${MONGODB_ADDON_USER}" -p "${MONGODB_ADDON_PASSWORD}" -h "${MONGODB_ADDON_HOST}" -d "${MONGODB_ADDON_DB}" --archive --gzip

You can then import the data into the new database by using the mongorestore command displayed in the dashboard of your new add-on.

An automatic migration tool for mongodb should be available in the next few days.

[RabbitMQ] Network loss for one node of shared cluster

2019-01-15T11:03:00+00:00

One node of the shared RabbitMQ cluster lost its connectivity during 1 minute. It then re-joined the cluster as expected. Real time logs were unavailable at the same time because of that issue.

[Redis] Redis add-on upgrade at 13:00 UTC

2019-01-14T13:00:00+00:00

Redis add-on creation will be disabled starting 13:00 UTC. Dashboard might be unavailable too. This should not include redsmin integration which should remain available.

16:15 UTC: The maintenance is over. Add-on creation and dashboard are now fully available again.

An hypervisor is down

2019-01-12T21:25:00+00:00

(Hours in UTC)

At 22:27, one of our hypervisors lost access to parts of its disks. Amongst others, It impacted a deprecated front reverse proxy for applications and a front reverse proxy for add-ons (databases). We moved the IP of one of the proxies. The other one, related to the application reverse proxy (62.210.92.244) couldn't be moved and is now unreachable. If you still use it, you should update your DNS records: https://www.clever-cloud.com/doc/admin-console/custom-domain-names/#personal-domain-names

The situation is stabilized. We still consider the infrastructure not fully recovered.

MySQL Addon

2019-01-10T12:31:00+00:00

We are adding new features on MySQL Addon. The addon dashboard and management (creation, deletion) will be offline during the maintenance.

EDIT 15:00 UTC: new addon dashboard is available, but addon creation is still unavailable.

EDIT 17.28 UTC: maintenance is now finished.

Issue with an addons load balancer

2019-01-04T16:36:00+00:00

16:36:30 UTC: A load balancer stops accepting new connections

16:38:00 UTC: An alert due to an important change in network traffic is triggered

16:39:30 UTC: The load balancer is restarted

Everything is back to normal now.

[Reverse Proxy] A reverse proxy is dropping TLS connections

2019-01-03T10:00:00+00:00

A reverse proxy is dropping some of the TLS connections it receives

EDIT 10:07 UTC: The reverse proxy has been restarted and the issue seems to be resolved. We are monitoring the situation.

Shared MongoDB experiencing downtime

2019-01-03T09:50:00+00:00

The front "mongos" component of the free shared cluster is behaving erratically. We are investigating it.

EDIT: There was sudden drops in free disk space. We change the logging method and it seems to have stabilized the system. We are still working on figuring out the issue.

MongoDB shared cluster authentication issues

2019-01-02T17:18:00+00:00

A maintenance operation is in progress on the Europe MongoDB shared cluster.

We are having issues with the authentication component. Open connections are working fine, new connections are impossible for now.

17:21 UTC: It should be fixed. We are making sure.

17:30 UTC: Incident over.

[Reverse proxy] Two reverse proxy are having intermittent network failures

2018-12-28T19:30:00+00:00

Two reverse proxies are having intermittent networking failures. Those reverse proxies are only when your domain is configured to use A records. Domains using CNAME records should be reachable as usual. We are working on it

EDIT 20:15 UTC: Incident resolved, it was due to a network miss-configuration. We will ensure this doesn't reproduce anymore.

[Deployments] Deployments will be unavailable for up to 30 minutes starting at 13:00 UTC on the Paris zone

2018-12-26T10:32:00+00:00

Deployments will be unavailable for up to 30 minutes starting 13:00 UTC because of a maintenance on our deployment system. Deployment actions like START, RESTART, STOP, ... will be unavailable but will remain in queue and will be processed at the end of the maintenance.

EDIT 13:06 UTC: The maintenance is starting EDIT 13:17 UTC: Deployments are now available again. Queued deployments have been processed.

Maintenance is over.

Clever Cloud API

2018-12-19T16:41:00+00:00

The Clever Cloud API is currently down, we are investigating.

EDIT 16:53 UTC: API is fixed. We detected a problem on our reverse proxies, we are currently fixing it.

EDIT 16:54 UTC: fixed.

PostgreSQL Addon

2018-12-19T15:02:00+00:00

The PostgreSQL Addon Dashboard is currently unavailable, we are working to fix it.

EDIT 15:17 UTC: fixed.

SSH to instances currently unavailable

2018-12-19T10:00:00+00:00

The SSH gateway is currently unavailable. We are working on bringing it back as soon as possible

EDIT 12:18 UTC: We are still trying to figure out a fix for the issue.

EDIT 12:47 UTC: The problem should now be fixed. A configuration error made this incident longer than it should have last. Applications may need to be redeployed to get the SSH service back online.

Sorry about this incident.

Unresponsive reverse proxy

2018-12-15T17:20:00+00:00

One of our reverse proxies went quite unresponsive but was still able to process some requests and report its state to our monitoring. Most of the requests it received weren't processed. This is now fixed.

Sorry for the inconvenience

MySQL shared cluster overloaded

2018-12-14T13:52:00+00:00

A MySQL shared cluster is overloaded at the moment. We are looking into which users are over-using it.

14:26 UTC: One culprit has been found. The cluster's load has been reduced significantly.

14:38 UTC: The cluster's load is back to normal since 14:30.

Deployments issue

2018-12-13T08:22:00+00:00

Some deployments fail to start and/or are not being properly reported by the API. We are investigating.

08:38 UTC: We are restarting part of the deployment system.

08:49 UTC: Since 5 minutes ago, deployments are being processed with some delay.

08:54 UTC: Back to normal.

API unavailability

2018-12-11T09:51:00+00:00

Some API endpoints seem to be currently unavailable making the console unavailable too. We are currently investigating what's causing this.

10:00 UTC: We found the root cause. The console still can't be loaded at the moment but other services should now be available (like deployments) 10:06 UTC: There was an underlying issue causing the console loading. It is now fixed. The incident is now over. Sorry for the inconvenience

Metrics unavailability

2018-12-04T13:15:00+00:00

Metrics are currently unavailable for read requests. Write requests are working as expected.

EDIT 14:04 UTC: Metrics are getting back up

EDIT 14:10 UTC: Metrics are fully recovered. Sorry for the inconvenience

Cogentco upgrades/maintenances can affect Montréal (MTL) zone

2018-11-29T23:00:00+00:00

As precised on Cogentco status page,

Cogent will be performing code upgrades in the following areas.
During these upgrades, customers in or transiting the area may experience
intermittent periods of packet loss and latency between 15 and 45 minutes 
for the duration of the window.

Location: Paris, France
Start time: 11/30 00:01 CET
End time: 11/30 06:00 CET
Work order number: NC840-119

our link with Montréal (MTL) zone can be affected by issues, so our systems (deployments, monitoring, etc.) on Montréal (MTL) can experiences issues.

Deployments on Montréal (MTL) are down.

2018-11-29T15:24:00+00:00

We have experienced issues on deployments on Montréal (MTL) zone between (from 15:24 to 15:30 UTC).

Issue with webhooks

2018-11-23T17:51:41+00:00

A validation test of an update to the webhook API has made its way to production. Clients received events not meant for them for 8 minutes. This is now fixed.

Sorry for the inconvenience.

Upgrade PostgreSQL addon.

2018-11-21T12:02:00+00:00

We are deploying a new feature on PostgreSQL addon, the creation and management of those addons is currently disabled.

Metrics partial unavailability

2018-11-20T17:00:00+00:00

A core component keeps restarting making the metrics unavailable for fetch. No metrics are lost during those restarts. We will take actions to fix this issue in the upcoming days.

An action was taken at 02:30 UTC (2018-11-21) which has successfully fixed this issue. This is only temporary though.

A permanent fix will be applied later today, which will require a downtime of that component.

EDIT 2018-11-21 16:50 UTC: The permanent fix is delayed to tomorrow, 2018-11-22.

EDIT 2018-11-22 10:40 UTC: The fix will be applied at 10:50 UTC, this will require at least one restart of that component which will lead to an unavailabiliy of Metrics for about 20 minutes.

EDIT 2018-11-22 11:25 UTC: Metrics are back since 11:08 UTC. Incident over.

Network issue

2018-11-15T19:07:00+00:00

A network issue (apparently) is affecting several hypervisors and services. We are investigating.

EDIT 19:21 UTC: Here is the incident of our provider: https://status.online.net/incident/153 (3 racks have lost public connectivity)

EDIT 20:33 UTC: The issue should be fixed. As of now, our monitoring is happy. We are cleaning up.

Network slowness / connectivity issues on dedicated add-ons

2018-11-14T11:10:00+00:00

It has been reported that some database are slower than usual because of network slowness. We investigated and took actions against our reverse proxies. One of them has been fully restarted leading to loss of established connections. We are currently monitoring if those actions are improving the situation.

EDIT 12:10 UTC: The issue seems to be resolved now

Metrics can't be accessed / viewed

2018-11-12T13:45:00+00:00

Metrics currently can't be accessed. Metrics ingestion still works, only metrics fetching will not work.

EDIT 16:25 UTC: One of the component was failing due to a network configuration error. The network configuration has been fixed and the component is currently restarting. It should be restarted in about 15 minutes.

EDIT 16:40 UTC: The component has restarted, metrics are now available again for read actions. No data was lost. Sorry for the extended interruption.

[Montreal] Network issues on our load balancers

2018-11-12T13:10:00+00:00

A network issue is currently happening on our reverse proxies on the Montreal zone. We are currently working on it.

EDIT 13:28 UTC: The network issue has been resolved since 13:20 UTC. Everything should be back to normal. Sorry for those issues.

API / Console / Deployments unavailability

2018-11-01T13:10:00+00:00

Our API was unavailable for 10 minutes. The CleverCloud console couldn't load and deployments wouldn't start. This has been fixed.

Hypervisor unreachable

2018-10-30T17:48:00+00:00

A hypervisor is unreachable.

Affected applications are being restarted automatically.

Affected addons are unreachable.

EDIT 17:56 UTC: Looks like it's a network issue, we are awaiting word from our provider.

EDIT 18:08 UTC: Our provider tells us they are working on it, no ETA nor details given.

EDIT 18:26 UTC: There was a short electrical outage in the datacenter where this server is, some routers and switches have been impacted by the switch to the backup power source. They are working on fixing affected network hardware.

EDIT 18:44 UTC: The server is back, addons should be reachable. We are making sure that everything is back online.

EDIT 18:56 UTC: Everything is working fine. Incident closed.

Unavailability of the RabbitMQ shared cluster

2018-10-15T16:10:00+00:00

Some of the nodes of the cluster crashed. They are currently being restarted. Users using this cluster may experience disconnections and failures to read / publish messages.

Update 16:34 UTC: The cluster nodes have been restarted. The cluster is UP again. Sorry for the inconvenience.

Dedicated MySQL addons unavailability

2018-10-10T12:20:00+00:00

A human error caused an issue with the configuration of the add-ons reverse proxies at 12:18 UTC. MySQL dedicated add-ons were unavailable at this point, except for already open connections.

At 12:30 UTC, we found the cause of the issue.

At 12:32 UTC, the issue was fixed and we regenerated the reverse proxies configuration.

At 12:33 UTC, add-ons were available again.

We have put the necessary protections in place to prevent this from happening in the future.

Deployments slowdown

2018-10-08T12:50:00+00:00

Deployments actions (start, restart, stop, git push, ...) are slower than usual. We are looking into this.

13:09 UTC: We are going to restart one of the deployment core system. Deployments actions (like the one above) will be unavailable for up to 30 minutes. All actions will be queued and executed at the end of the maintenance.

13:40 UTC: Another problem occurred during the restart of that system. We are now trying to fix this one.

EDIT 14:03 UTC: Deployments are available since ~5 minutes now. We are still cleaning things up before closing this incident.

EDIT 14:30 UTC: Everything should be back to normal now. Sorry for the extra maintenance time and the deployments unavailability.

Deployments for Monitoring/Unreachable and incoming slowness

2018-10-02T12:20:00+00:00

Our monitoring system had a network cut making it see a lot of applications unreachable. Those applications are being redeployed but it may add delays for new deployments (start / redeploy / stop) because of the number of ongoing deployments.

EDIT 12:50 UTC: The deployments should now be back to normal. Apologies for the delays.

GitHub integration

2018-09-30T08:11:00+00:00

The GitHub API has changed so we are patching our API to fix the auto-deployments of new applications, the fix will be retroactive.

Shared PostgreSQL cluster is DOWN

2018-09-26T22:50:00+00:00

Postgresql leader is down. Promoting follower and update domains.

DNS has been updated. Clients should connect back to the database

EDIT 12:22 UTC: The new leader is correctly serving requests since 0:30 AM UTC.

Shared Postgresql replication and backups are down

2018-09-26T22:50:00+00:00

The follower stopped replicating data and taking backups.

We are trying to restart it.

MongoDB shared cluster: Performance issues

2018-09-22T13:25:00+00:00

The shared cluster is experiencing performance issues. We are working to mitigate those issues.

Deployments with cache are failing

2018-09-21T08:10:00+00:00

Deployments using cache (build cache, dependencies cache) are failing because the cache can't be downloaded. We are investigating

EDIT 10:17 UTC: We are still working on the issue. If you have troubles deploying, you can set your application's scalability settings to which a dedicated build instance would use. Do not hesitate to ping our support if needed.

EDIT 10:25 UTC: ETA is 2 hours if everything goes well.

EDIT 12:30 UTC: The deployments with cache are back. Everything should work as expected from now. Sorry for any failed deployments or longer than expected deployment times.

*.cleverapps.io partial unavailability

2018-09-11T15:45:00+00:00

One of the two reverse proxy of *.cleverapps.io crashed and had a longer than usual restart time. Traffic hitting this server didn't complete as requests would hang until the connection timed out.

The problem has been resolved at 16:08 UTC

A node of the RabbitMQ shared cluster crashed

2018-09-03T18:41:00+00:00

One of the nodes of the shared rabbitmq cluster crashed. It's currently restarting.

EDIT 18:50 UTC: The node has successfully restarted, the cluster should now be operational as usual

Our main API is unavailable

2018-08-31T15:18:00+00:00

The main API is unavailable, the console cannot be loaded as well.

We are looking into it.

EDIT 15:30 UTC: Our API is back online. The console can now be loaded.

Hypervisor unreachable

2018-08-28T12:13:00+00:00

An hypervisor is unreachable, we are working on fixing the issue.

Applications on this hypervisor are being automatically redeployed. Add-ons are unreachable.

EDIT 12:21 UTC: The hypervisor is back online and is restarting the add-ons.

EDIT 12:32 UTC: All add-ons are now reachable.

Deployments interruption during 30 minutes

2018-08-24T08:56:00+00:00

Deployments will be interrupted during 30 minutes at 12:30 UTC+2 today. A core component upgrade will be performed. This will not impact already running applications or add-ons. All deployments will be queued and executed at the end of the maintenance.

The maintenance shouldn't last longer than 30 minutes but it may be possible that some delays occur. We will update this ticket to let you know about the status of the maintenance.

EDIT 12:25 UTC+2: New deployments are stopped to be consumed.

EDIT 12:30 UTC+2: The maintenance has started

EDIT 12:56 UTC+2: Deployments are back since ~10 minutes. We are still cleaning things up

EDIT 13:03 UTC+2: Maintenance is over and was successful. Do not hesitate to contact us if anything's wrong on your side.

Issue with deployment system in the Europe zone

2018-08-22T19:08:00+00:00

Deployments are temporarily disabled as we fixed the issue with a component of the deployment system.

EDIT 19:17 UTC: This was actually a false positive from our monitoring. After verifying that the component is working fine and fixing the monitoring probe, we re-enabled deployments.

FS Buckets unavailable

2018-08-22T05:16:00+00:00

One FS Buckets server is unavailable, we are awaiting news from our provider.

EDIT 05:28 UTC: The server is partially and randomly available: the problem has been identified by our provider: it's coming from the switch the server is connected to. They are working on fixing the issue.

EDIT 08:04 UTC: Issue is fully fixed since 07:30 UTC

Mongodb shared cluster node failure

2018-08-21T23:03:00+00:00

MongoDB cluster will not accept writes until failure is fixed.

The failing node is up again.

Cellar add-ons / bucket provisionning issue

2018-08-20T13:00:00+00:00

Creation of add-ons and buckets on cellar is temporarily failing. We are working on it

EDIT 15:30 UTC: The creation of add-ons and bucket is now fixed. It may take a little longer than usual but these slowness will be resolved in a few hours

MongoDB shared cluster is unreachable

2018-08-10T20:50:00+00:00

There is an issue with the entry point to the cluster.

Users are stretching the "fair usage" concept way above reasonnable limits. We are working with them to enforce the fair usage.

Performance issues on MongoDB shared cluster.

2018-08-10T12:20:00+00:00

Performance should have been restored.

We are still watching the cluster.

Logs collector is down

2018-08-08T12:50:00+00:00

A network issue is preventing the logs system from working.

EDIT 13:17 UTC: Logs should be available, the cluster is slowly recovering

EDIT 13:23 UTC: The logs cluster is UP and running again, logs shouldn't have been lost thanks to buffering.

Sorry about the inconvenience.

Clever Cloud provided Git repositories will be read-only for 30 min

2018-08-08T08:37:00+00:00

A maintenance of our Git repositories will be held on Thursday (2018-08-09) at 1pm, UTC + 2.

Write operations like "git push" or "clever deploy" to Clever Cloud repositories won't be possible during 30min. Read access won't be affected during this time.

Thanks for your patience.

EDIT 13:00 UTC+2: The maintenance is starting

EDIT 13:05 UTC+2: The maintenance is now complete. Do not hesitate to open a support ticket if anything goes wrong. Thanks for your patience!

FS Buckets Accessibility Issues

2018-07-26T09:54:00+00:00

We are investigation on connectivity issues on the File System Buckets

EDIT 10:27 UTC: Connections should now be working again. It seemed that already established connections were also impacted and were slower than expected. This should now also be fixed.

EDIT 10:27 UTC: FS Buckets service is now fully operational .

Deployment Issues

2018-07-19T12:48:00+00:00

We are currently experiencing issues on our deployment systems.

EDIT 13:25 UTC: Recovery takes longer than expected, we are still working on it.

EDIT 13:59 UTC: We are still working on fixing these issues.

EDIT 14:08 UTC: We are still having issues but deployments can start.

EDIT 14:41 UTC: Deployments performance has been back to normal for more than 15 minutes now. We are still watching the situation closely. If you have an issue, please contact us.

Network issue on one of our hypervisors

2018-06-29T01:45:00+00:00

One of our hypervisor had a network issue for approximately 5 minutes.

Some of our internal services were impacted by this network issue and thus, automatic re-deployment of applications has been delayed.

Everything is back to normal, applications are currently finishing their redeployment.

Logs system and a redis cluster are unreachable

2018-06-28T13:08:00+00:00

Due to an ongoing maintenance from our provider, the logs system and a redis cluster of shared (and free) redis are unreachable. Logs may be lost. It should not last than 15 minutes according to them. A few minutes might be needed to restart the logs cluster.

Redis should be back as soon as the maintenance ends

EDIT 13:35 UTC: The maintenance is still ongoing

EDIT 13:50 UTC: The maintenance is over. Redis cluster is UP. Logs cluster is getting back UP. Logs should be saved but might not be directly available through the console

EDIT 14:30 UTC: The logs cluster is now fully operational too

Hard drive failure on one hypervisor

2018-06-28T10:45:00+00:00

One of our hypervisor has hard drive I/O failures. We are looking into it

EDIT 11:08 UTC: The server was shutdown a few minutes ago. Applications on it are being redeployed. Add-ons are currently unavailable

EDIT 11:52 UTC: We are still waiting for news from our provider regarding the hard drives issue

EDIT 21:20 UTC: Our provider is still working at finding the root cause of the issue

EDIT 2018-06-29 07:05 UTC: We received an answer from our provider and the server can't be brought back online. Databases will need migration. We are waiting an answer to know if we can access the disk in a read only mode to transfer the databases. If not, backups from the the 28th June will be used.

EDIT 2018-06-29 07:18 UTC: The disks can't be read. Backups will need to be used

Metrics problem

2018-06-20T09:36:00+00:00

Metrics are experiencing issues.

Logs collector restart

2018-06-19T21:46:00+00:00

The logs collector needs to be restarted. Some logs might be lost for one to two minutes.

EDIT 22:00 UTC: Restart took approximately 30 seconds, most applications sent again the logs they couldn't send during that time

HV down

2018-06-18T21:23:00+00:00

HV is down/unreachable. There seems to be a hardware problem. We are investigating it.

Some databases are unreachable.

EDIT 2018-06-18T23:25:00 UTC: Seems to be a malfunctioning fan. The server is still down for investigation. We are waiting for more informations from our hypervisor provider. EDIT 2018-06-19T00:37:00 UTC: The malfunctioning fans have been replaced. The server is up again. All the databases are up and running.

A hypervisor is down/unreachable.

2018-06-18T16:16:00+00:00

Some databases are unreachable.

EDIT 2018-06-18 16:29 UTC: The hypervisor is up again, the databases are getting back up.

Applications that were on this HV were redeployed on another one.

Network instabilities on *.cleverapps.io on the Paris zone

2018-06-14T13:13:00+00:00

We have detected some network instabilities on one of our reverse proxy of the *.cleverapps.io domain, affecting the Paris zone. Our network provider has been notified.

EDIT 15:08 UTC: We are still waiting for our network provider to find the root cause of it.

EDIT 15-06-18 13:00 UTC: Instabilities have ceased since this morning. Everything should be back to normal

Rabbitmq shared crashed

2018-06-14T10:14:00+00:00

One of the nodes of the shared rabbitmq cluster went down. We are bringing it back

EDIT 10:40 UTC: The node has been restarted, we continue to monitor the situation.

EDIT 13:20 UTC: The cluster has been running fine since the incident

Add-on reverse proxy restart

2018-06-07T20:58:00+00:00

One of our add-on reverse proxy had to be restarted following an increasing rate of connections refuse. We will continue to monitor the situation closely

Git repository maintenance

2018-05-25T12:41:00+00:00

Our git repository will be shutdown for up to 15 minutes at 13:30 UTC, May 25th. Deployments will be shutdown and Git push / clone will be unavailable.

EDIT 13:30 UTC: The maintenance has begun. Deployments are shutdown (but are queued) and git repositories aren't available anymore.

EDIT 13:39 UTC: The maintenance is over, deployments and git repositories are available again

VPN connections time out

2018-05-25T08:23:00+00:00

Some instances have troubles reaching VPN targets through our VPN service, we are investigating. Timeouts or unreachable routes are expected.

EDIT 09:45 UTC: We might have found why connections are hanging, we are currently doing some tests

EDIT 10:10 UTC: The tests worked fine and a fix has been deployed. All connections should have been restarted. If you still experience troubles with connecting to a particular service, please let us know at support@clever-cloud.com with the service you're trying to access

Metrics are unavailable

2018-05-18T13:56:00+00:00

An operation maintenance is in progress on the storage backend of Metrics. Metrics are currently unavailable.

EDIT 14:40 UTC: Metrics are back since 14:15. Performance is gradually coming back to its usual level.

Deployment issues

2018-05-14T07:40:00+00:00

Deployments are having trouble to start or complete. We are working on it

EDIT 08:05 UTC: Deployments should be back to normal, we are keeping an eye on the situation.

EDIT 08:33 UTC: Some deployments still won't start

EDIT 09:00 UTC: Deployments should be back to normal again. We are still keeping an eye on the situation and cleaning up the remaining issues

EDIT 12:28 UTC: Again, some deployments are failing to finish even though they appear as successfully done in the logs. We are looking at it

EDIT 13:27 UTC: Deployments are going to be stopped to fully clean the system. It should not last more than 15 minutes. The maintenance is starting now.

EDIT 14:08 UTC: Deployments are available since 13:45 UTC. The maintenance period is over. We keep looking for everything to go back to normal

EDIT 16:30 UTC: Everything seems to be back to normal

DDoS one of our front proxies

2018-05-11T10:05:00+00:00

We (or a client of us) were targeted by a DDoS attack starting at 10:05 UTC. We removed this IP from our front pool. The issue has been mitigated. We are still watching it.

One hypervisor is down

2018-05-10T06:13:00+00:00

At 8:13am Paris Time today, our hypervisor hv-par2-036 has been detected as unreachable.
A hard reboot has been requested to our hosting service.
Around 20 add-ons are impacted.

9:17am Paris Time: incident is fixed. All add-ons have recovered.

Network instabilities on one of our reverse proxies

2018-05-07T13:27:00+00:00

Network instabilities are affecting one of our reverse proxy, leading to packet / requests loss.

EDIT 13:50 UTC: Instabilities have stopped for 10 minutes now, we are still closely monitoring the situation.

PHP Applications: SSH Gateway asks for a password

2018-05-03T16:01:00+00:00

The SSH Gateway asks for a password for PHP application instead of letting you connect. We are investigating the issue.

EDIT 08:00 UTC: A new version of the PHP image has been released. Redeploying your application should be enough to SSH again to the machine

Metrics unavailable

2018-05-03T15:01:00+00:00

We have started a maintenance operation on a component of the Metrics cluster. This operation takes more time than expected.

Until it's over, Metrics are not available. Metrics agents on scalers should push the data when the service is back.

EDIT 15:14 UTC: Metrics are back since 15:12 UTC

Dedicated add-ons reverse proxy issue

2018-04-20T15:28:00+00:00

A dedicated add-ons reverse proxy stopped accepting new connections at 15:28 UTC and was restarted at 15:31:30 UTC.

Traffic was back to normal at 15:32:00 UTC.

Logs drains are currently stopped

2018-04-10T09:38:00+00:00

Logs drains are currently stopped, we are working on fixing this issue.

Deployments and application status are not working properly

2018-04-08T14:30:00+00:00

Due to a network issue, deployments are not working properly. Also, the state of the applications might be displayed wrong (grey disc instead of green one) in the console.

Cellar network issues

2018-03-20T18:50:00+00:00

Cellar is having network issues on some node. Some requests are failing, both requests to GET resources as requests to send resources.

We are investigating the problem

EDIT 19:35 UTC: The problem seems to be gone.. It may be due to a maintenance operation made on the Cellar cluster which shouldn't have caused this. This maintenance has been done multiples times without problems. We will keep an eye on the cluster when this maintenance starts again, probably tomorrow.

Network slow down

2018-03-20T12:32:00+00:00

Multiple reports are indicating there is a network slow down for some clients. We are investigating the issue. Applications may take higher time than usual to respond

EDIT 15:10 UTC: The source of the problem is one of our customers receiving a DDoS on its application. While the infrastructure can handle such load, we detected a problem with the configuration of our reverse proxies which doesn't allow us to correctly handle the load of this DDoS. We are looking at how we can improve that. In the meantime, traffic targetting that customer's application has been blocked.

EDIT 16:45 UTC: Most of the traffic is filtered. We will continue watch the issue in the following hours

Dedicated add-ons connections issue

2018-03-12T11:42:00+00:00

A dedicated addons reverse proxy is refusing new connections. It is being restarted.

EDIT 11:49 UTC: Incident over since 11:45 UTC

Logs real-time delivery issues

2018-03-08T16:42:00+00:00

Real-time log delivery is affected by an outage on our message broker. Log drains are affected as well. Logs are still archived.

EDIT 17:03 UTC: Real-time delivery is back since 16:50 UTC

Network connectivity issue

2018-02-21T11:10:00+00:00

Our monitoring system has detected network connectivity issues. Issues were caused by a network configuration inconsistency, they are solved.

NodeJS build failing due to missing dependency nomnom

2018-02-17T10:21:00+00:00

NodeJS applications are failing to deploy because of the missing nomnom module. We are investigating the issue.

EDIT 10:53 UTC: You can create the following environment variable for a temporary workaround: CC_PRE_RUN_HOOK=npm install nomnom@1.8.1 -g

EDIT 11:33 UTC: A fix has been made and the new image version is now deploying on our servers.

EDIT 12:33 UTC: The new image is now live. All NodeJS applications will be redeployed to avoid using a now broken image.

Metrics

2018-02-15T12:06:00+00:00

The metrics data cluster is under unusual load. Metrics display is currently unavailable, but metrics are still collected.

EDIT 17:35 UTC: Service is back to normal and collected metrics have all been correctly persisted.

An add-on reverse proxy is dropping new connections

2018-02-13T15:39:00+00:00

The proxy is being restarted. Some add-ons may be unreachable until it's done.

EDIT 15:42 UTC: Incident over since 15:40 UTC.

Log pipeline issue

2018-02-01T09:26:00+00:00

The log storage cluster is experiencing network issues. We are working on it. In the meantime, only realtime logs are available.

An add-on reverse proxy is dropping new connections

2018-01-26T16:35:00+00:00

The proxy is being restarted. Some add-ons may be unreachable until it's done

EDIT 16:41 UTC: the proxy has been successfully restarted. Add-ons should be reachable again. Applications not supporting the loss of an established connection will be redeployed. We continue to monitor the proxy.

EDIT 17:30 UTC: the incident is now over

Redis cluster is restarting

2018-01-25T20:05:00+00:00

A redis cluster was down and is restarting

EDIT 20:17:00 UTC: The cluster has been restarted, impacted applications have been redeployed. The incident is over

PostgreSQL addon dashboards and creation unavailability

2018-01-25T12:50:00+00:00

PostgreSQL addon dashboards will be unavailable for about 15 minutes starting on 2018-01-25 at 12:30 UTC

EDIT: Delayed to 12:50 UTC

EDIT 12:50 UTC: Will start in a few seconds

EDIT 13:07 UTC: Maintenance over. If you encounter an issue, please tell us.

Logs unavailable

2018-01-14T00:45:00+00:00

Logs are currently unavailable. We are working on restoring them. All logs sent in the last 30 minutes won't be stored.

EDIT 03:15 UTC: Logs are back again

MongoDB shared cluster upgrade

2018-01-12T14:28:00+00:00

The MongoDB shared cluster needs to be upgraded to have more resources.

Performance issues and or partial outage are to be expected. We will try to keep them as low as possible.

The maintenance starts at 22:00 UTC

EDIT 02:00 UTC: the maintenance is now over

Addon reverse proxy is restarting

2018-01-11T20:43:00+00:00

An addon reverse proxy is restarting, connections are dropped and impacted applications will be redeployed

EDIT 20:45:00 UTC: The reverse proxy took ~1 minute to restart. It is now restarted

EDIT 20:48:00 UTC: Impacted applications were redeployed as expected. The incident is now over and all add-ons are now reachable again

Deployments are displayed as FAILED even if they succeeded

2018-01-11T14:40:00+00:00

All deployments from around 15:40 UTC might be shown in a FAILED state, even though they were successful. It's just a matter of display and the instances, if correctly deployed, are put into production.

The Activity pane (Console), clever status (cli) and the API endpoint /applications//deployments incorrectly report the deployment status.

Notifications (slack webhooks, mails) correctly report the deployment status (failed or successful) and can be trusted.

EDIT 21:48 UTC: It should now be fixed. Deployments with the "FAILED" state will keep their broken state.

Network instability on some of the infrastructure

2018-01-09T16:55:00+00:00

Network instability on Online DC2 makes some products unreachable:

Mysql shared cluster
Postgresql shared cluster
Mongodb shared cluster
One of the cleverapps front proxies

Shared mongodb addons connectivity issues

2018-01-09T14:40:00+00:00

The shared mongodb cluster is experiencing issues, we're working on bringing it back up.

Only last 4 days of logs stored

2018-01-07T06:00:00+00:00

Due to disk space, we need to lower the number of logs we store, for now. Only the last 4 days are kept, instead of the ideal number of last 7 days.

EDIT 2018-06-15 UTC: All 7 days are now available again.

Core component upgrade

2017-12-11T14:07:00+00:00

A core component will be upgraded. Deployments will be disabled for an hour starting at 11:30 UTC. This upgrade should fix some deployments delay among other things.

EDIT 11:31 UTC: Maintenance is starting

EDIT 12:06 UTC: Deployments are back, we are now cleaning some old artefacts

EDIT 13:00 UTC: The maintenance is over

Deployment system slow down

2017-12-01T16:38:00+00:00

Our deployment system encounter some slow down. Some application may take longer than usual to deploy. We are working on it

EDIT 19:25 UTC: Those slow downs might require an infrastructure change that will be done next week. Until then, slow downs should be less frequent and less important

EDIT 2017-12-08: 12:00 UTC: Deployments take less time after some fixes on our end. The migration will still happen to entirely fix it. Incident is considered as closed because we don't see any more extra times.

Network issues on a front load balancer

2017-12-01T11:06:00+00:00

We've observed an elevated error rate on two front load balancers newly added to the pool. We're pulling traffic back from these load balancers.

Some applications have troubles to deploy

2017-11-28T17:25:00+00:00

Some deployments might have troubles starting a deployment. We are investigating.

EDIT 17h31 UTC: Deployments are disabled for now EDIT 17h38 UTC: Deployments are now back up but may be stopped again in a few minutes if needed EDIT 17h55 UTC: The incident is now resolved. We will keep an eye on it for the upcoming days

Partial network issue

2017-11-21T13:25:00+00:00

We are experiencing a network issue on one of our front. The support team is actively working on this.

EDIT 14:56 UTC+1: Unreachable servers are being restarted and will be available shortly. In the meantimes, impacted applications are being redeployed

EDIT 15:26 UTC+1: The team is performing the final cleanup. The issue is about to be closed. The remaining apps and add-ons are being restarted.

EDIT 15:50 UTC+1: The outage is now resolved. Contact the support is you encounter any trouble.

Planned maintenance on deployments

2017-11-17T10:59:00+00:00

Due to a software update, deployements will be disabled for up to 30 minutes starting at 12:30 UTC+1

EDIT 13:00 UTC+1: The maintenance is over, deployments are back since 15 minutes

Network issues on a front load balancer

2017-11-12T19:19:00+00:00

A network issue affecting a front load balancer on the PAR zone has been identified and fixed

FS Buckets connectivity issue

2017-10-19T15:33:58+00:00

Due to a network issue, some FS buckets have been unavailable for a short period of time. All FS buckets are now available.

Hardware incident

2017-10-11T05:32:00+00:00

One hypervisor has experienced a hardware issue and is rebooting. Affected apps are being redeployed, affected addons will be available shortly.

cleverapps.io domains are reported as malicious

2017-10-03T13:53:00+00:00

Due to a phishing application deployed on a cleverapps.io domain, the whole domain name has been marked as malicious. We are working on clearing the alert. In the meantime, we'd like to warn you that cleverapps.io domain names are provided only for test purposes and that they should not be used in production.

Shared redis databases connectivity issues

2017-09-29T14:45:00+00:00

A node hosting shared redis databases has been restarted after having connectivity issues. Impacted applications will automatically be redeployed when connectivity is restored.

API slowness

2017-09-26T11:58:00+00:00

Our main API is currently slower than usual. We are looking into it

EDIT 12:09 UTC: The API is now performing smoothly. We will keep looking why it went into such state

A redis shared cluster went down

2017-09-23T17:15:00+00:00

A shared redis cluster went down. It's being restarted

EDIT 17:30 UTC: all shared redis are now available again

.cleverapps.io domains access issue

2017-09-20T13:30:00+00:00

The .cleverapps.io domains have issues resolving through multiple DNS servers. It seems like the top .io TLD DNS servers are the root cause of the problem. Users using this domain may have error messages like "Server not found".

If you need it, here is the IP of the domain: 217.70.184.38

EDIT 19:43 UTC: The incident seems to be resolved, .cleverapps.io domains now resolve correctly

High load on an Hypervisor

2017-09-18T10:00:00+00:00

One of our hypervisors is having a huge load, making it unresponsive.

Impacted applications are being redeployed

EDIT 10:15 UTC: The server is still under huge load. Services on it continue to answer correctly in most cases. Applications are still redeploying

EDIT 10:30 UTC: The server is now reachable and responsive, we are looking into why it went under such a heavy load

A physical server went down

2017-09-11T14:10:00+00:00

One of our physical server has gone down. Impacted applications are being redeployed and we are investigating the incident

EDIT 14:24 UTC: The server is still down, we are waiting for more informations from our prodiver

EDIT 14:37 UTC: One of the server's fan has died and the server won't start.

EDIT 14:43 UTC: Impacted databases will be migrated on another server on request to the support. We will also contact impacted users. Let us know if you want to start a new database using tonight's backup

EDIT 15:23 UTC: Our provider is replacing the fans, no ETA for now

EDIT 16:55 UTC: Our provider replaced the fans and the server is now back up. Non migrated databases have been started again and linked applications are being redeployed. We will continue to monitor the situation

Connection loss on addons

2017-09-08T10:39:00+00:00

A reverse proxy serving addons traffic has started refusing connections. It has been restarted and is now serving traffic correctly. The affected applications have been automatically restarted.

API and Console unavailable

2017-08-17T17:00:00+00:00

The API and the Console are currently unavailable. We are working on bringing them back. Applications are not impacted

EDIT: 17:40: Everything is back, sorry for the interruption.

Deployments are delayed

2017-08-10T13:25:00+00:00

Deployments are currently delayed, we are working on getting it back

Update 13:57 UTC: deployments are now back up, we continue to monitor the situation

Update 14:15 UTC: it's all good now

Performance issues on MySQL shared cluster

2017-08-09T09:08:00+00:00

The MySQL shared cluster is subjected to increased load. Dedicated DBs are not affected.

Update 09:29 UTC: the master node has been restarted. We're watching it closely

Update 15:45 UTC: the master has been alright since then

Network outages on the Montreal zone

2017-08-07T19:54:00+00:00

The montreal zone has some network issues, we are currently reaching to hour hosting provider

Update 20:13 UTC: Network seems more stable now. We are still waiting for more information from our provider Update 21:17:UTC: Our provider has confirmed the issue is fixed.

Network outage on add-ons reverse proxies

2017-08-07T14:10:00+00:00

A network outage happened on our add-on reverse proxies. We are currently monitoring the situation to avoid further downtime. Impacted applications will be automatically redeployed

EDIT 18:00 UTC: all good now

Payment gateway Migration

2017-07-28T15:06:00+00:00

We are migrating all credit card information from a payment gateway to another. During this migration, you will not be able to manage your credit cards. You will still be able to perform payments, though.

EDIT 29/07/17 11:35 UTC: the migration will begin at 12h15pm UTC. During the migration and for a few hours after, credit cards management might not work

EDIT 29/07/17 13:10 UTC: the migration is over, we will continue to monitor payments for a few hours

[Cellar] Maintenance - 2nd step

2017-07-24T16:02:00+00:00

We will be doing a maintenance on the Cellar cluster starting on 2017-07-26 at 08:00 UTC.

This is the 2nd step of the maintenance started on the 20th (https://status.clever-cloud.com/incident/31).

This should not have an impact on availability but may have a slightly bigger impact on performance than the first step (which did not have any noticeable impact).

It should take around 10 hours. This is a very rough estimate though, we will be posting updates along the way.

EDIT 08:01 UTC: Maintenance is starting now.

EDIT 11:55 UTC: Everything is going smoothly. Performance impact is very low.

EDIT 19:35 UTC: Maintenance is still in progress. No significant impact; so as for the 1st step, consider this event over.

[Europe] One hypervisor is unreachable

2017-07-20T11:57:00+00:00

One hypervisor is unreachable. Affected applications are being redeployed automatically. Affected addons are unreachable.

EDIT 12:05 UTC: All affected applications have finished redeploying ; we are awaiting an answer from our provider

EDIT 12:47 UTC: Our provider is "running tests" on the affected server and has not given any ETA as of now.

EDIT 13:00 UTC: The server is reporting an hardware error, not disk-related. Our provider is working on fixing the issue.

EDIT 13:31 UTC: The server fails to start. Our provider is giving us another server and will put the disks of the old server into the new one.

EDIT 14:30 UTC: The server is ready, the disks are up and running. We are now rebooting the server in operational mode and will make sure everything starts up fine and will then update the network configuration.

EDIT 15:11 UTC: All databases are available again.

[Cellar] Maintenance - 1st step

2017-07-18T16:21:00+00:00

We will be doing a maintenance on the Cellar cluster starting on 2017-07-20 at 08:00 UTC.

This is a 2-steps maintenance, the second one will be scheduled at a later stage.

This should not have an impact on availability but may have a light to moderate impact on upload / download speeds.

No ETA as of now, we will be posting updates along the way.

EDIT 2017-07-20 08:00 UTC: Maintenance is starting now

EDIT 10:00 UTC: We are expecting the maintenance to end between 21:00 UTC and 2017-07-21 01:00 UTC ; we are seeing no significant impact on upload / download speeds as of now

EDIT 14:45 UTC: The maintenance is running fine and still has no significant impact on performance, we are keeping it as-is. Consider this event over; If something goes wrong, we will create a new event.

Maintenance: Log system will be unavailable on 2017-07-18 10am UTC

2017-07-17T13:58:00+00:00

A maintenance of the logs system will happen at 10am UTC. Applications logs will be unavailable during this maintenance.

The maintenance should not last more than 1 hour.

EDIT 10:18 UTC: Maintenance started a few minutes ago, logs collection will be disabled in a few seconds

EDIT 10:44 UTC: Maintenance is over since a few minutes, logs are now available

API availability issues

2017-07-10T14:40:00+00:00

An issue occurred on the main API. It was mostly unavailable, only answering to ~30% of requests at best for close to 10 minutes, until we switched to a backup system.

At this point, most services were available except for logs, events and notifications.

30 minutes after the beginning of this issue, it's now fully available.

Network issue in Europe zone

2017-07-07T06:40:00+00:00

Network is flaky in the Europe zone, we are seeing intermittent unreachability issues on multiple elements of our infrastructure. We are investigating.

EDIT 06:48 UTC: The network seems to work fine now. Deployments are unavailable, we are working on bringing them back up.

EDIT 07:35 UTC: Deployments have been back up since 07:15, we are still cleaning up the remaining items.

EDIT 07:40 UTC: Everything is cleaned up and functional now. If you have an issue, come ping us.

Deployments suspended

2017-06-21T16:03:00+00:00

Deployments are disabled for a short maintenance operation.

EDIT 16:12 UTC: Deployments are back

Delayed deployments in Europe zone

2017-06-21T14:41:00+00:00

We are currently experiencing performance issues on a component of our deployment system. Deployments are delayed by a few minutes.

[Europe] Maintenance operation

2017-06-15T09:05:00+00:00

We are doing a maintenance operation on a component of our monitoring system. Deployments may be delayed until the end of the operation.

This should last no more than 10 minutes. Deployments should not be delayed by more than a couple minutes.

Maintenance operation will start at 09:10 UTC.

EDIT 09:19 UTC: Deployments should go back to normal in the next few minutes. Maintenance is over, we are now checking that everything is working fine.

EDIT 09:24 UTC: Deployments delay back to normal; end of incident

[Europe] One hypervisor unreachable

2017-06-12T15:08:00+00:00

One hypervisor went down, affected applications are being automatically redeployed. Addons on this hypervisor are unreachable (~2% of dedicated addons in the Europe zone).

We are awaiting news from our provider.

EDIT 15:30 UTC: We are still awaiting a manual operation from our provider

EDIT 15:37 UTC: They have rebooted the server manually but "observed an error" and are "analyzing" the issue

EDIT 16:04 UTC: The power supply is out of order and is being replaced

EDIT 16:55 UTC: The operation is over, the server just rebooted and will now start recovering / cleaning up after the forced reboot. Databases will be coming back online automatically.

EDIT 17:50 UTC: Most databases are available since 17:15 UTC. The remaining databases are now available

Monitoring issue

2017-06-11T11:22:00+00:00

An incident occurred in our monitoring tools. Old instances are unable to stop, thus causing instability in applications.

Deployments are stopped until the monitoring is back up and running.

[Europe] Monitoring system issue

2017-06-11T11:21:00+00:00

We are working on fixing an issue with our applications and addons monitoring system of the Europe zone. Deployments have been disabled to allow the monitoring to catch up faster.

Addons connectivity issue

2017-06-07T14:16:22+00:00

The addon gateway has been restarted, some connections have been forcibly closed.

Addons connectivity issue

2017-06-07T14:14:00+00:00

The addon gateway has been restarted, some connections have been forcibly closed.

[Europe] Deployment infrastructure upgrade

2017-06-06T08:25:00+00:00

A core component of the deployment infrastructure will be upgraded to improve stability and performance. As a result, deployments will be stopped for up to 60 minutes (hopefully less)

EDIT 11:05 UTC: Maintenance is fully over now, deployments have been available since 10:50 UTC.

Deployment delays

2017-06-01T14:30:00+00:00

Deployments take more time to start due to higher than usual activity. We are working on fixing the problem.

EDIT 16:00 UTC: The deployment starting time is back to normal

Deployment delays

2017-05-30T13:40:00+00:00

Deployments take more time to start due to higher than usual activity. We are working on fixing the problem.

Deployments are disabled in the Europe zone

2017-05-18T15:59:00+00:00

Deployments are disabled following an incident on a component of our deployment system. We are working on bringing it back up.

ETA is about an hour.

Network split triggered redeployments and causes delays

2017-05-06T14:05:00+00:00

Our monitoring system had a small network split making it think applications were unreachable. This triggered a lot of redeployments. This does not make applications unreachable. You might receive some emails with a "Monitoring/Unreachable" deployment reason.

Also, deployments are delayed until we clean the non-important redeployments

UPDATE 5:07PM UTC: Incident has been resolved, sorry for those redeployments

One of our reverse proxy for addons is unreachable, leading to addons being unreachable

2017-04-20T10:40:00+00:00

We are investigating the problem.

UPDATE 12:43PM UTC: The problem has been resolved, we will investigate about why it happened and how to prevent this from happening again.

Network issue affecting add-ons

2017-04-11T15:39:00+00:00

We are investigating a network issue affecting a reverse proxy for the addons.

EDIT: The issue is gone. It looks like it was a temporary network issue of our provider.

One of the shared redis cluster has crashed and is restarting

2017-03-31T14:00:00+00:00

Impacted applications are redeploying. Edit: resolved at 14:10 UTC

One of the shared redis cluster has crashed

2017-03-29T05:30:00+00:00

We are restarting it

Update 5:43 UTC: the redis machines are now available, impacted applications are restarting

Connection troubles on a server, some addons are unreachable

2017-03-28T16:10:00+00:00

One server experiences connection troubles, some add-ons are therefore unavailable. We are looking into it

Update: The problem has been fixed at 16:20 UTC

Logs unavailable

2017-03-24T15:00:00+00:00

Logs are currently unavailable, we are working on it.

Update at 15:07 UTC: Problem fixed

Short internal network outage

2017-03-23T07:30:00+00:00

A small internal network outage happened a few minutes ago. Deployments may be impacted.

High I/O on an applications / addons server

2017-03-14T16:00:13+00:00

One of our servers is having unexpected I/O. The server is being removed from the pool and applications are being redeployed. Addons on it might be slower than usual.

Incident affecting some dedicated databases

2017-03-13T17:43:30+00:00

One of our hypervisors just went down. Some addons are therefore unreachable. We are still waiting for an update of our provider and still no ETA.