EDIT 19:46 UTC+1: The underlying storage system is currently having issues and is rebalancing the data. No data loss is to be expected but timeouts may occur. We are looking to stabilize the system.
EDIT 20:09 UTC+1: The underlying storage system has stabilized the last 5 minutes. We keep an eye to make sure everything is okay
EDIT 21:47 UTC+1: The service is now stable. We will need to perform additional maintenance to fully fix the underlying issue. We will create the maintenances in the following days accordingly.
]]>EDIT 14:00 UTC : We have updated the load balancer configuration, you may have seen some connections cut during the reload.
EDIT 14:30 UTC : We have seen the same instabilities on RBX HDS database load balancer configuration, we have applied the patch that thr RBX database load balancer.
]]>We suspect that we may have impacted by one of this maintenance of our infrastructure provider. See:
EDIT 14:00 UTC : we are confident that was not link to the infrastructure provider and was an isolated incident. We are still monitoring the issue, but it seems to be solved.
]]>Scope:
Expected Impact:
Additional Information:
EDIT 11:00 UTC : We have updated the cleverapps load balancers, we will soon restart it. We will proceed to another upgrades this afternoon.
EDIT 14:20 UTC : We are beginning the updates of paris load balancers.
EDIT 18:00 UTC : We are still updating load balancers
EDIT 19:30 UTC : We have stopped the updates process. we will continue the updates process tomorrow
EDIT 10:00 UTC : We are beginning updates of load balancers.
EDIT 10:45 UTC : We have finished the updates.
]]>Scope:
Expected Impact:
Additional Information:
EDIT : The maintenance window has been update to next tuesday.
EDIT 14:00 UTC : We are beginning the maintenance.
EDIT 16:00 UTC : We have finished to install new hardware alongside the existing one on rbxhds, we will beginning to switch the traffic on database and software load balancers. we also start the installation of load balancers on rbx region.
EDIT 16:20 UTC : We are switching the instance of database load balancer.
EDIT 16:45 UTC : We have fully switch the load balancer of rbxhds region, we have finished to install alongside the current one, load balancers on rbx region. We will begin to switch the traffic to the new instances.
EDIT 17:15 UTC : We have finished to switch the traffic from old load balancers to new ones. The maintenance is over.
]]>We are working on it
[2024-02-27 08:50 UTC] we detected the root cause and fix it. Everything should now be ok
]]>Query is back online
]]>EDIT 17:45 UTC: we have passed a configuration to try to mitigate the issue. We are watching.
EDIT 18:00 UTC : we have done a rolling reboot of load balancers to give them more capacity.
]]>EDIT 16:50 UTC : The shared cluster is up and running.
]]>EDIT 16:45 UTC : The hypervisor has rebooted and now operational
]]>EDIT 2024-02-19 21:00 UTC : Maintenance was successfully completed, maintenance end.
]]>EDIT 18:02 UTC: The issue has been identified and fixed. If you pushed any commits that didn't get applied, please let our support know about it so we can force a deployment.
]]>EDIT 15:30 UTC : We have resolve the git authentication issue.
]]>Scope:
Expected Impact:
Additional Information:
EDIT : We will do the maintenance next week on 2024-03-07.
EDIT 17:30 UTC : We have finished to deploy the load balancers alongside the current ones. We will switch the traffic from the old one to the new one.
EDIT 17:32 UTC : We have switch the traffic from old to new database load balancer with a unexpected behavior which is fixed now. We may have seen unexpected connections refused.
EDIT 17:35 UTC : We will begin to switch the application load balancer soon.
EDIT 17:50 UTC : We have switched the first instance of application load balancer.
EDIT 18:00 UTC : We have finished to roll the application load balancer. We are watching.
]]>Scope:
Expected Impact:
Additional Information:
EDIT : We move the maintenance to 2024-02-23 instead of 2024-02-22.
EDIT 15:00 UTC: We are beginning the maintenance alongside the current installation. We will do the failover next week.
EDIT 2024-02-26 16:30 UTC : We have added two new IP address to domain.mtl.clever-cloud.com. DNS records.
EDIT 2024-02-26 17:00 UTC : We have removed the two old IP address from domain.mtl.clever-cloud.com. DNS records.
EDIT 2024-02-26 17:15 UTC : We will update the DNS records for database load balancers.
EDIT 2024-02-26 17:30 UTC : We have updated the DNS records for database load balancers, we are watching
]]>Scope:
Expected Impact:
Additional Information:
EDIT 13:30 UTC : We are preparing the hardware and software upgrade along-side the current stack.
]]>Scope:
Expected Impact:
Additional Information:
EDIT 13:30 UTC : We are starting the preparation to do the upgrades
EDIT 13:50 UTC : We have finished the preparation, we are beginning the rolling of the application load balancer
EDIT 14:20 UTC : We have finished to do the rolling of the application load balancer
EDIT 14:30 UTC : We are starting to roll the database load balancer
EDIT 14:35 UTC : We have finished to roll the database load balancer.
]]>Scope:
Expected Impact:
Additional Information:
EDIT 13:15 UTC : We are beginning the hardware upgrade along side the current hardware.
EDIT 14:20 UTC : We have finished the hardware upgrade, we will start the rolling by application load balancer.
EDIT 14:35 UTC : We have rolled the first application load balancer, we are beginning the second one.
EDIT 15:00 UTC : We have finished to roll the application load balancer, we are beginning the database load balancer.
EDIT 15:15 UTC : We have rolled the first database load balancer, we are watching.
EDIT 15:25 UTC : We have rolled the second database load balancer.
EDIT 15: 25 UTC : We have rolled all load balancers, we are keeping an eye on them, but the maintenance is over.
]]>Expected downtime of the service is 30 minutes. During that time, git and mercurial operations might fail as well as loading the UI.
EDIT 16:45 UTC: The update is over.
]]>EDIT 10:15 UTC : We have started the maintenance
EDIT 11:15 UTC : We have finished the maintenance
]]>EDIT 21:30 UTC : The hypervisor is now responding after an hard reboot. We are currently ensuring that every virtual machines in a healthy state and investigating the HV crash root cause.
EDIT 22:00 UTC: Every VM on the hypervisor are running as expected, the root cause was a kernel panic (the kernel is now in a more stable version)
]]>EDIT 15:00 UTC : The software upgrade is still in progress
EDIT 15:30 UTC : The first server that host a load balancer instance has been updated
EDIT 15:45 UTC : We are proceeding to the others load balancers.
EDIT 16:30 UTC : We have updated 2/3 of load balancers.
EDIT 17:00 UTC : We have updated all load balancers.
]]>EDIT 10:30 UTC : we have begun the maintenance procedure for one of the two instances.
EDIT 11:10 UTC : we have finished the upgrade, we will restart the instance this afternoon around 14:00 UTC.
EDIT 15:00 UTC : we have restart one the two load balancer instances, we are watching the metrics to get more insights between the two versions.
EDIT 9:30 UTC D+1 : since yesterday, we have observed telemetry and saw enhancement of them, we will begin the update of the second one
EDIT 11:00 UTC D+1 : the update is achieved without issues.
]]>EDIT 15:34 UTC+1: Patches were applied and services were restarted. The maintenance is now over.
]]>Edit Tue Jan 23 17:59:56 2024 UTC: A faulty configuration has been applied to a node to investigate a memory-leak. The configuration backfired on the whole cluster, making it unhealthy. The configuration have been rollback. The storage layer is currently under healing mode. To speed-up the recovery, query have been disabled.
Edit Tue Jan 23 19:51:21 2024 UTC: cluster is now healthy and recovering lag, which should last a few hours. Query will be opened when lag is resorbed.
Edit Wed Jan 24 00:04:59 2024 UTC: datalag is now ok. We are still reloading metrics's metadata, so query is still not available. Should be up in a few hours
Edit Wed Jan 24 01:54:22 2024 UTC: metadata lag is now ok, query is back online
]]>EDIT Edit Thu Jan 25 11:00:00 2024 UTC : Platform is now ok, we're ingesting lag
EDIT Edit Thu Jan 25 16:54:00 2024 UTC : Lag ingested, Some applications may not have accesslog reachable.
]]>Update Tue Jan 16 17:11:02 2024 UTC: cluster is no longer applying rate-limit
]]>We will update this status accordingly.
EDIT 2024-01-10 20:00 UTC: Maintenance is over, no impact during the operations.
]]>EDIT 15:58 UTC: The issue has been identified and deployments should be back to normal since 15:40 UTC.
]]>Update Thu Jan 04 14:48:00 2024 UTC: We have triggered some data balancing. Some queries may take longer than expected. This can impact some of the grafana dashboards or API queries. Write performance may be impacted.
Update Thu Jan 04 20:44:01 2024 UTC: data balancing is more aggressive than expected, overloading some components. Query may be unavailable during that time
Update Fri Jan 05 02:26:05 2024 UTC: some components are still overloaded. We are currently catching up the lag, but query is disabled for now.
Update Fri Jan 05 08:01:45 2024 UTC: our write-path is still overloaded. We are searching for the bottleneck
Update Fri Jan 05 16:03:48 2024 UTC: a cleanup subroutine has been triggered to balance and remove slack space from our internal Btree storage. Query is still disabled to speed-up the process.
Update: Sat Jan 06 11:25:28 2024 UTC: lag has been absorbed. Query is now up, the cleanup subroutine is still in-progress. You may notice latency spikes during query.
Update: Mon Jan 08 14:36:57 2024 UTC: cleanup subroutine is still in-progress, and some workloads triggered an overloading of some components. Query is disabled to speed-up recovery
Update: Mon Jan 08 16:36:18 2024 UTC: query is now open.
Update Tue Jan 09 14:38:34 2024 UTC: Some StorageServers are late, meaning that a really small portion of the data is not available for the query. We are currently catching up with the lag
Update Tue Jan 16 14:56:55 2024 UTC: closing the ticket.
]]>EDIT 15:15 UTC: we are still digging the issue, the abnormal traffic is over and everything seems going back to normal
EDIT 16:30 UTC : we have put back the ip address in the load balancer pool 46.252.181.103
]]>EDIT 2023-12-30 00:51 UTC: The problem has been identified and resolved. The component is back in the pool and is working as expected. This incident is now over.
]]>EDIT 09:44 UTC: The issue is not fully resolved yet but we are seeing improvements. We continue working on the issue.
EDIT 11:04 UTC: Queries are now working since 10:15 UTC, we continue monitoring to ensure everything is working as intended.
EDIT 15:43 UTC: Everything is back to normal, this incident is now over.
]]>EDIT 03:17 UTC : There is no database affected on this hypervisor and applications has been redeployed.
EDIT 03:30 UTC : The hypervisor has been reboot and everything comes back to normal
]]>EDIT 3:37 UTC : The issue seems to be related with the following OVH incident : https://bare-metal-servers.status-ovhcloud.com/incidents/x135vv46x85l
EDIT 3:45 UTC : Applications on this hypervisor are currently redeploying and there is no such addons on it, we also have remove temporarely the A record from domain.rbx.clever-cloud.com to solve connection issues
EDIT 4:00 UTC : Applications have been redeployed, we are waiting after ovh folk to go further
EDIT 05:30 UTC : The hypervisor is reachable again, we are starting the recovery process
EDIT 05:45 UTC : The recovery process is over, everything works normally, the load balancer ip affected by the incident will be put later in the pool. for the record, the ip is 87.98.177.176 for domain.rbx.clever-cloud.com.
]]>EDIT 10:20 UTC : the investigation is still in progress and we are mitigating the issue with a rise a maximum connexions
EDIT 11:00 UTC : We are now on the nominal values, we are still watching
]]>We will update this status accordingly.
EDIT 15:10 UTC: Maintenance is over, no impact during the operations.
]]>EDIT 06:07 AM UTC: the hypervisor has become unresponsive due to a really high cpu load average. It has been rebooted. Almost all databases are reachable, we are fixing the last ones.
EDIT 06:45 AM UTC: all databses are now up
]]>EDIT 2023-12-21 16:00 UTC+1: We found and fixed the rood cause. Matomo add-ons can now be ordered again.
]]>We will update this status accordingly.
EDIT 17:30 UTC: Maintenance is over, no impact during the operations.
]]>EDIT 16:00 UTC : We have found that one of our customers is under ddos, we are mitigating the issue.
EDIT 16:30 UTC : The ddos seems to be mitigated, we are watching.
]]>EDIT 10:55 UTC: The hypervisor went back online at 10:33 UTC. All applications were redeployed to another hypervisor. The incident is now over.
]]>Some databases went unavailable, We are checking that they all rebooted correctly.
EDIT 15:51 UTC: all checks have completed. All the services are operational.
EDIT 04/12/2023 11:00 UTC : It seems that the load balancer behind the ip 212.129.27.183 was impacted by the incident. The issue is solved.
]]>Consequences: some applications on SCW may have lost connection to their database for a few minutes. They may have crashed and been redeployed by our monitoring.
]]>EDIT 19h UTC : The issue has been solved
]]>We will update this status accordingly.
EDIT 17:30 UTC: All updates are now over. Operations went smoothly and no impact was detected.
]]>We will update this status accordingly.
EDIT 23:15 UTC: All updates are now over. Operations went smoothly and no impact was detected.
]]>While performing the move, a network configuration issue arose, impacting only customers using TCP redirections on the PAR region.
As the team was focused on monitoring and fine-tuning the configuration of the new LB, it failed to see the error reports until 14:30 UTC. To prevent such an incident in the future, we have since improved our monitoring and alert tools for TCP redirects.
The issue was fixed by 14:55 UTC.
]]>We are investigating.
Edit 27 Nov 2023 11:02:23: Query is now functional. We are also observing an issue with metrics from add-ons. We are on it. Edit 27 Nov 2023 06:00 PM: A regression on token's regen has been fixed, and all tokens have been updated.
]]>EDIT 17:00 UTC : Cellar is available
]]>After running more tests, we discovered performance issues on long-distance connections, possibly caused by HTTP/2, which we activated on Cellar a few weeks ago. Our analyses confirmed that uploading data to Cellar using HTTP/2 in such conditions could heavily limit the throughput, whereas HTTP/1.1 gave us better and consistent results. The improvements seen for customers affected by the identified problems far outweigh the benefits of HTTP/2 seen in few cases. So we're disabling HTTP/2 and monitoring throughput to confirm this on a larger scale.
We will begin to include new load balancer instances deployed yesterday in the load balancer pool starting 14:00 UTC. New load balancer IP addresses that will be added with the current ones are :
EDIT 15:30 UTC : The monitoring saw an increasing number of 404 response status code. We rolled back the modification and investigate the issue. It was an overlapping of internal ip addresses with the cellar load balancer which is fixed now.
EDIT 15:45 UTC : After further investigation, we could resume the maintenance.
EDIT 18:05 UTC : We have finished to deploy new instances.
We have installed new load balancers. We will review and test them tonight and will add them to the lb pool tomorrow morning (2023-11-28).
We are still seeing a few random SSL errors here and there. We are investigating. The culprit may be a lack of allocated resources. We are following this lead.
… we have fine tuned the load balancers, which have caused temporary more SSL Errors for a minute. The traffic seems to be better.
We are experiencing new errors on the load balancers: customers report PR_END_OF_FILE_ERROR
errors in their browsers while connecting to their apps and SSL_ERROR_SYSCALL
from curl.
We are able to reproduce these errors. They look like the incident from friday 24th in the morning. We are looking for the configuration misshap that may have escaped our review.
✅ It's fixed. We started to write a monitoring script for that kind of configuration error, we will speed up the writing and the deployment of this monitoring in production.
We've been monitoring the load balancers all week-end: The only desync was observed (and fixed right away by the on-call team) on old sōzu versions (0.13) that are still processing 10% of Paris' public traffic! We plan to remove these old load balancers quickly this week.
We consider the desynchronization issue resolved.
Last Friday, we configured Cellar's front proxies to lower their reload rate. We haven't seen any slowness since, but it was already hard to reproduce on our side. No slowness on Cellar was reported during the week-end, but we are still on the look.
After more (successful) load tests, the new version of sōzu (0.15.17) is being installed on all impacted public and private load balancers. Upgrades should be over in the next two hours.
The team continues to investigate the random slowness issues still encountered by some customers, which we are trying to reproduce in a consistent way.
We've tested our new Sōzu release (0.15.17) all night with extra monitoring and no lag or crash was detected. The only remaining issues were on the non updated (0.13.6) instances. They were detected by our monitoring and the on-call team restarted them.
We are pretty confident that this new release solves our load balancers issues. We plan to switch all private and public Sōzu load balancers to 0.15.17 today and monitor them over the coming days.
Temporary incident:
While updating our configuration to grow the traffic shares of the new (0.15.17) load balancers, a human mistake (and not a newly discovered bug) broke part of the configuration, causing many ssl version errors on 15% of the requests between 09:25 and 09:50 UTC.
As we planed earlier, the renewal of all certificates in RSA 2048 has been completed, except for a few wildcards (mostly ours) which require manual intervention. This will be dealt with shortly.
We were able to identify the root cause of our desync/lag in Sōzu. A specific request, a ‘double bug’, was causing worker crashes. We developed fixes and are confident they will fix our problems. We’ll test them and be monitoring the situation before deploying them fully in production.
We’ve upgraded our load balancers infrastructure and monitoring tools to check whether this will improve the various types of problems reported to us.
Background: Two months ago, we migrated our Let’s Encrypt aumotatic certificate generation from RSA 2048 keys to RSA 4096 keys. Following a major certificates renewal in early November, this led to timeouts when processing requests, and then 504 errors.
Actions:
Back to normal: Within the day, while we finish regeneration.
Next steps: We have also explored a migration to the ECDSA standard, which according to our initial tests will enable us to improve both the performance and security levels of our platform. Such a migration will be planned in the coming months, after a deeper impact analysis.
Background: We noted a significant drop in HTTPS request processing performance, with capacity reduced from 8,000 to 4,000 requests per second, due in particular to an excessive number of syscalls via rustls.
Actions: We developed a Sōzu update and pushed it on November 16.
Back to normal: The problem is now resolved.
Background: Load balancers are sometimes out of sync, Sōzu gets stuck in TLS handshakes or requests. The workers no longer take the config updates, causing the proxy-manager to freeze. The load balancers then miss all new config updates until we restart them.
Actions: We have improved our tooling to detect the root cause of the problem at a deeper level. We have been able to confirm that this concerns both Sōzu versions 0.13.x and 0.15.x.
Next steps: We'll be tracing the problem in greater depth within the day, to decide what actions to take in the short term to mitigate the problem.
Background: Customers are reporting slowness or timeouts on Cellar, which we are now able to identify and qualify. If the cause has not been fully spotted, we have several ways of mitigating the problem.
Actions: Add capacity to front-ends infrastructure and enhance network configuration.
]]>The maintenance will start today November 20, 2023 at 12:00 UTC+1.
EDIT 2023-11-20 12:10 UTC+1: Maintenance is starting
EDIT 2023-11-20 13:00 UTC+1: Maintenance is now over, the addon cellar API is fully available
]]>EDIT 18:20: All addon load balancers have been fixed, we are currently actively monitoring their state
EDIT 2023-11-20 18:49 UTC: The fix has not been as effective as we would have hoped. We are currently issuing another fix. During the next few minutes, you might encounter some connection refused errors when connecting to some add-ons.
EDIT 2023-11-20 18:59 UTC: The operations are done. We are now monitoring the situation.
]]>EDIT 15:00 UTC : The amount of errors is decreasing, we are still investigating EDIT 15:30 UTC: We have the issue, we are deploying a patch EDIT 15:45 UTC: The patch has been applied successfully EDIT 16:00 UTC : The situation has came back to normal, we are watching
]]>EDIT 15th of november 09:09 AM UTC: cluster has been scaled up and partitions distributed among new brokers
]]>(times in CET)
The maintenance will start today November 14, 2023 at 12:00 UTC+1.
EDIT 2023-11-14 12:09 UTC+1: API is down, we report this maintenance
EDIT 2023-11-14 15:20 UTC+1: Maintenance is starting
EDIT 2023-11-14 17:10 UTC+1: Maintenance has been correctly completed
]]>EDIT Mon Nov 13 18:51:01 2023 UTC: config tuning has been made, cluster is now fully recovered. Lag will be resolved within minutes
]]>EDIT: fixed
]]>EDIT 16h35: This issue has been fixed.
]]>This concern only Jenkins, ElasticSearch, MySQL, PostgreSQL, MongoDB and Redis API.
For each kind of add-ons expect a downtime of 20 to 30 minutes.
The maintenance will start tonight November 9, 2023 at 21:00 UTC.
EDIT 2023-11-09 22:00 UTC+1: Maintenance is starting.
EDIT 2023-11-10 01:00 UTC+1: Maintenance is now completed.
]]>Edit 2023-11-09 : we are keeping this incident open as the performance issues seem to have lowered but not vanished. There seems to be a seasonality with these issues, we are still searching why we have these surges in load.
]]>We are investigating this issue.
]]>EDIT 16:30 UTC : The main api is reachable, we are investigating the root cause which may seems related to the database.
EDIT 16:40 UTC : We have detected that we missed of capacity on the database, we have update the capacity and reboot the database, we are deploying the api again.
EDIT 18:00 UTC : The main api is reachable
]]>All the applications on that HV are being redeployed. A few add-ons that are on it are unavailable.
The hypervisor was not rebooting from our OVHCloud interface. We asked the support and they put it back up again.
12:28 UTC: the HV is running, we are starting the cleaning procedure and making sure all the add-ons have restarted correctly.
]]>All the access logs are still stored, but the API will not give you the recent ones (up to two weeks).
Edit 2023-11-14: we are still working on making the accesslogs available from the API.
]]>update 08:40 UTC - the reverse proxies have been resynchronized. We are watching it and looking for the reason of the desynchronization.
]]>Update 20:04 UTC - We have fixed the broker issue and restarted every service that failed to reconnect. The situation is back to normal.
]]>These issues impact TLS and the ability to answer correctly.
EDIT 20:32 UTC - fixed.
]]>Edit Sat Oct 28 14:51 2023 UTC: infrastructure have been scaled up, optimizations on LBs are underway, you may still experience errors during queries
]]>Edit 15:00 UTC : We start rolling the load balancer records for domain.par.clever-cloud.com
Edit 15:50 UTC : We have finished to do the rolling of the first ip address (46.252.181.103), next ones should be faster.
Edit 16:00 UTC: We have removed the second record (46.252.181.104), we are waiting for the ttl to expire before beginning
Edit 16:10 UTC: We have added back the second record (46.252.181.104), we are waiting for the ttl to expire before going further.
Edit 16:15 UTC : We have removed the third record (185.42.117.108 ) we are waiting for the ttl to expire before beginning
Edit 16:25 UTC: We have added back the third record ((185.42.117.108), we are waiting for the ttl to expire before going further.
Edit 16:30 UTC : We have removed the fourht and last one record (185.42.117.109 ) we are waiting for the ttl to expire before beginning
Edit 16:40 UTC : We have added back the third record ((185.42.117.109), we have finished the maintenance
Edit 17:38 UTC: We have an increase in TLS errors for incoming requests, we are looking into it.
Edit 18:08 UTC: We found a potential issue. We are deploying a fix and will monitor the situation closely.
Edit 19:06 UTC: The fix has been deployed since 18:55 and we are monitoring the situation
Edit D+1 16:00 UTC : We have find the issue on the update and patch the software. We will apply it in a few moment.
Edit D+1 16:30 UTC : We will update the first ip address 46.252.181.103.
Edit D+1 17:15 UTC : We have updated the second ip address 46.252.181.104, we will begin the third address 185.42.117.108.
Edit D+1 17:30 UTC : We have updated the fourth ip address 185.42.117.109.
Edit D+1 18:30 UTC : We have finished the operation, we are watching it
]]>EDIT 13:00 UTC: The problem has been fixed and will be investigated further to pinpoint the origin. EDIT 13:30 UTC: We have applied a patch to solve the issue.
]]>07:34 UTC : we have fixed the issue and we keep watching the issue
13:00 UTC: The issue did not occur again. This incident is now over.
]]>There are 3 kinds of Logs :
There are multiple use of Metrics data:
Edit 18:08PM UTC: We start the maintenance operation with redeployment of apps with Token dependencies. (grafana, scheduler, etc.)
Edit 18:11PM UTC: Grafana is being shut to reconfigure the managed service behind.
Edit 18:40PM UTC: Token manage is successfully up to date. Apps are being redeployed to switch their metrics endpoint
Edit 18:46PM UTC: Web console metrics are unavailable for a few minutes (this is expected)
Edit 19:31PM UTC: Web console has now server metrics available
Edit 20:16PM UTC: All Grafana dashboards are back online. If you encounter an issue with a "Error 500: invalid token", then you can go to your org home page > Metrics in Grafana > and click on the RESET ALL DASHBOARDS button.
Edit 21:20PM UTC: Only access logs based dashboards remain unavailable.
]]>DEV
PostgreSQL services on the Paris (PAR) region. Applications using those services will be impacted.
For this reason, we have deployed a new cluster in version 15. Starting from today, you can already migrate your DEV
add-on to this new cluster and by Thursday last delay, we will automatically migrate all add-ons that are compatible with PostgreSQL version 15.
For incompatible add-ons, we are planning a maintenance in order to update the par dev cluster. This maintenance will take place on Thursday the 26st of October 2023, between 15:00 UTC+2 and 17:00 UTC+2.
For the entire duration of the update, services will be unavailable. The time required to perform the update is estimated between 1 and 2 hours. However, total downtime might be longer as every application using the cluster will need to be restarted.
In case you have connection issues after those updates, you can manually trigger a redeployment of your linked applications.
If you do not want to be impacted by your DEV
add-on being offline, you can still order or migrate to a dedicated one before this maintenance starts.
Our support team is available for any questions via the ticket center in the console.
EDIT 2023-10-25 15:00 UTC+2: We will delayed the maintenance to 15:00 UTC+2 the 26st of October 2023.
EDIT 2023-10-26 15:00 UTC+2: Most of the DEV addons have been migrated, we are going to start the maintenance
EDIT 2023-10-26 15:35 UTC+2: Dev cluster par-postgresql-c4
is back online.
EDIT 2023-10-26 16:30 UTC+2: Everything is now back to normal. Maintenance end
]]>The maintenance will take place on Sunday 22 October 2023, between 14:00 UTC+2 and 20:00 UTC+2.
During the maintenance, applications and add-ons on this region will experience unexpected connection closed or reset, specifically on long running connections, beginning at 16:00 UTC+2. To prevent issues, you could restart your application if you see connection issues.
To check which of your services are impacted, you can consult the information section of your applications and see the region where your application is deployed.
14:45 UTC+2 : we are beginning the preparation steps to update load balancer that received cleverapps.io traffic 16:00 UTC+2 : we have identified a bug, so we will skip the update for now of cleverapps.io load balancers 16:30 UTC+2 : we are beginning the update of the last load balancer. 18:00 UTC+2 : we will soon update dns records to send traffics to new load balancer. 18:15 UTC+2: dns records has been updated 18:20 UTC+2 : monitoring is green, the maintenance is done
]]>As a result, we will need to shutdown the deployment component for approx. 1 hour.
The maintenance is over, deployments are now usable again.
]]>The maintenance will take place on Saturday 21 October 2023, between 14:00 UTC+2 and 20:00 UTC+2.
During the maintenance, applications and add-ons on this region will experience unexpected connection closed or reset, specifically on long running connections, beginning at 16:00 UTC+2. To prevent issues, you could restart your application if you see connection issues.
To check which of your services are impacted, you can consult the information section of your applications and see the region where your application is deployed.
14:15 UTC+2 : The maintenance will start soon, we are ending preparation steps 15:15 UTC+2: Preparation steps took more time than estimated, we are rolling some configuration update on dedicated load balancers 16:15 UTC+2: Update in rolling of dedicated load balancers is terminated, we are beginning the public shared load balancer. 17:15 UTC+2: We are udpating the domain name resolutions for public shared load balancer of addons 18:30 UTC+2: We have updated two of eights servers of public shared load balancer of addons. 19:00 UTC+2: We have updated four of eights servers of public shared load balancer of addons. 19:15 UTC+2: We have updated six of eights servers of public shared load balancer of addons. 19:15 UTC+2: We have updated seven of eights servers of public shared load balancer of addons. 19:50 UTC+2: We have updated all servers of public shared load balancer of addons. As it is late and we are reaching the end of the window, we will update last load balancers tomorrow afternoon
]]>DEV
MySQL services on the Paris (PAR) region. Applications using those services will be impacted.
Only the par
dev cluster will be updated during this maintenance.
The maintenance will take place on Monday 23rd of October 2023, between 11:45 UTC+2 and 15:00 UTC+2.
For the entire duration of the update, the services will not be available.
The time required to perform the update is estimated between 1 and 2 hours. However, total downtime might be longer as every application using the cluster will need to be restarted.
In case you have connection issues after those updates, you can manually trigger a redeployment of your linked applications.
If you do not want to be impacted by your DEV
addon being offline, you can still order or migrate to a dedicated one before this maintenance starts.
Our support team is available for any questions via the ticket center in the console.
EDIT 2023-10-23 11:50 UTC+2: Maintenance is starting.
EDIT 2023-10-23 12:20 UTC+2: Dev addons are now available again. We will restart linked applications
EDIT 2023-10-23 12:22 UTC+2: We investigate an error while creating new DEV
addons
EDIT 2023-10-23 12:40 UTC+2: New DEV
addons can now be created. All applications linked to DEV
addons are currently restarting
EDIT 2023-10-23 13:00 UTC+2: All applications linked to DEV
addons have restarted
Downtime is expected to last between 30 minutes to 1 hour.
[14:15 UTC] All notifications services are now up and running
]]>The configuration has been fixed at 21:00 UTC and disk access time are now in the normal range. We will keep monitoring the situation in the upcoming days to make sure performance stays in normal ranges.
]]>If you use the mongodb uri correctly, it should only disrupt your application for a few seconds. Otherwise, expect up to two hours of maintenance.
]]>The fetch of logs can take a while.
EDIT 21:37 UTC - fixed.
]]>The logs collection (logs drains too) will be unavailable during the maintenance.
EDIT 00:00 UTC: The maintenance is now over.
]]>FS Bucket hosts that will be updated during this maintenance are: n19 and n20.
The maintenance will take place on Friday 20 October 2023, between 12:00 UTC+2 and 14:00 UTC+2.
During the update of each server host, the services will only be available in read-only mode. Once the update is complete, linked applications will be restarted automatically to take into account the environment variables of the updated services and to restore write capacity.
The required update time is estimated at 1 hour but the total time until the applications are restarted might be longer.
In case you have write issues after those updates, you can manually initiate a redeployment of your linked applications in order to avoid waiting for the automatic redeployment.
To check if your services are impacted, you can consult your FS Bucket’s server in the Dashboard tab of your add-ons, in the “Cluster information” section and thus determine the update day(s) that concerns you.
Please check if you have any old applications (>5 years) that are still using a buckets.json file in their code repository, as we will not be able to prioritize the redeployment of these applications and they will most likely suffer from read-only FS Bucket for an extended time. We therefore recommend that you now mount FS Bucket by environment variable (ideally by linking the add-on to your application). See more details in this documentation page: https://www.clever-cloud.com/doc/deploy/addon/fs-bucket/#configuring-your-application
Our support is available for any questions via the ticket center in the console.
EDIT 2023-10-20 12:00 UTC+2: Maintenance is starting.
EDIT 2023-10-20 13:24 UTC+2: Applications are currently redeploying.
EDIT 2023-10-20 14:20 UTC+2: Applications have redeployed. We are cleaning things up.
EDIT 2023-10-20 16:12 UTC+2: The maintenance is over.
]]>FS Bucket hosts that will be updated during this maintenance are: n10 and n17.
The maintenance will take place on Thursday 19 October 2023, between 12:00 UTC+2 and 14:00 UTC+2.
During the update of each server host, the services will only be available in read-only mode. Once the update is complete, linked applications will be restarted automatically to take into account the environment variables of the updated services and to restore write capacity.
The required update time is estimated at 1 hour but the total time until the applications are restarted might be longer.
In case you have write issues after those updates, you can manually initiate a redeployment of your linked applications in order to avoid waiting for the automatic redeployment.
To check if your services are impacted, you can consult your FS Bucket’s server in the Dashboard tab of your add-ons, in the “Cluster information” section and thus determine the update day(s) that concerns you.
Please check if you have any old applications (>5 years) that are still using a buckets.json file in their code repository, as we will not be able to prioritize the redeployment of these applications and they will most likely suffer from read-only FS Bucket for an extended time. We therefore recommend that you now mount FS Bucket by environment variable (ideally by linking the add-on to your application). See more details in this documentation page: https://www.clever-cloud.com/doc/deploy/addon/fs-bucket/#configuring-your-application
Our support is available for any questions via the ticket center in the console.
]]>FS Bucket hosts that will be updated during this maintenance are: n15 and n16.
The maintenance will take place on Wednesday 18 October 2023, between 12:00 UTC+2 and 14:00 UTC+2.
During the update of each server host, the services will only be available in read-only mode. Once the update is complete, linked applications will be restarted automatically to take into account the environment variables of the updated services and to restore write capacity.
The required update time is estimated at 1 hour but the total time until the applications are restarted might be longer.
In case you have write issues after those updates, you can manually initiate a redeployment of your linked applications in order to avoid waiting for the automatic redeployment.
To check if your services are impacted, you can consult your FS Bucket’s server in the Dashboard tab of your add-ons, in the “Cluster information” section and thus determine the update day(s) that concerns you.
Please check if you have any old applications (>5 years) that are still using a buckets.json file in their code repository, as we will not be able to prioritize the redeployment of these applications and they will most likely suffer from read-only FS Bucket for an extended time. We therefore recommend that you now mount FS Bucket by environment variable (ideally by linking the add-on to your application). See more details in this documentation page: https://www.clever-cloud.com/doc/deploy/addon/fs-bucket/#configuring-your-application
Our support is available for any questions via the ticket center in the console.
]]>FS Bucket hosts that will be updated during this maintenance are: n12 and n13.
The maintenance will take place on Tuesday 17 October 2023, between 12:00 UTC+2 and 14:00 UTC+2.
During the update of each server host, the services will only be available in read-only mode. Once the update is complete, linked applications will be restarted automatically to take into account the environment variables of the updated services and to restore write capacity.
The required update time is estimated at 1 hour but the total time until the applications are restarted might be longer.
In case you have write issues after those updates, you can manually initiate a redeployment of your linked applications in order to avoid waiting for the automatic redeployment.
To check if your services are impacted, you can consult your FS Bucket’s server in the Dashboard tab of your add-ons, in the “Cluster information” section and thus determine the update day(s) that concerns you.
Please check if you have any old applications (>5 years) that are still using a buckets.json file in their code repository, as we will not be able to prioritize the redeployment of these applications and they will most likely suffer from read-only FS Bucket for an extended time. We therefore recommend that you now mount FS Bucket by environment variable (ideally by linking the add-on to your application). See more details in this documentation page: https://www.clever-cloud.com/doc/deploy/addon/fs-bucket/#configuring-your-application
Our support is available for any questions via the ticket center in the console.
EDIT 2023-10-17 12:10 UTC+2: The maintenance is starting. FSBucket servers are set in read-only mode.
EDIT 2023-10-17 12:47 UTC+2: Applications are being redeployed to use the new FSBucket server. You can also start a deployment on your side to speed things up.
EDIT 2023-10-17 16:15 UTC+2: The maintenance is over. All applications should now have access to their fsbucket since 14:00 UTC+2. Please reach out to our support team if you have any issues following this maintenance.
]]>PHP FTP hosts that will be updated during this maintenance are: n11 and n18.
The maintenance will take place on Monday 16 October 2023, between 12:00 UTC+2 and 14:00 UTC+2.
During the update of each server host, the services will only be available in read-only mode. Once the update is complete, linked applications will be restarted automatically to take into account the environment variables of the updated services and to restore write capacity.
The required update time is estimated at 1 hour but the total time until the applications are restarted might be longer.
In case you have write issues after those updates, you can manually initiate a redeployment of your PHP+FTP applications in order to avoid waiting for the automatic redeployment.
Our support is available for any questions via the ticket center in the console. This maintenance will be updated during the maintenance window.
EDIT 2023-10-16 12:03 UTC+2: The maintenance will begin shortly. FSBucket add-on hosted on those servers will soon become read-only.
EDIT 2023-10-16 12:09 UTC+2: FSBuckets are now read-only
EDIT 2023-10-16 12:54 UTC+2: Applications are being redeployed to use the new FSBucket server. You can also start a deployment on your side to speed things up.
EDIT 2023-10-16 14:04 UTC+2: The maintenance is over. All applications should now have access to their fsbucket since 13:30 UTC+2. Please reach out to our support team if you have any issues following this maintenance.
]]>EDIT 15:49 UTC: All services are now reachable again, the incident is now over.
]]>EDIT 10:00 PM: lag has been fully absorbed
]]>It may impact some databases that are hosted on top of this hypervisor.
06:20 The hypervisor seems to have encountered a kernel panic. It has been rebooted and we fixed the kernel version to avoid future Kernel Panic.
06:45 Now that the hypervisor is back up, we are cleaning the situation: checking all add-ons instances have rebooted successfully, that all applications have redeployed successfully.
07:21 Everything is now back to normal
]]>As a result, deployments on all our zones will be disabled between 11:00 and 01:00 (2023-10-07).
EDIT 2023-10-07 01:32 UTC: The maintenance is over. Deployments are now working again.
]]>09:25 PM UTC: Network is back online. We are bringing back services which are not healthy.
09:30 PM UTC: Network is still flappy, we are on it.
10:06 PM UTC: Network seems stable. We are bringing back services which are not healthy.
10:37 PM UTC: We are bringing back services which are not healthy.
10:52 PM UTC: all services are back online. Good night.
]]>ssh -t ssh@sshgateway-clevercloud-customers.services.clever-cloud.com
or using our CLI command clever ssh
will be unavailable. Existing SSH connections through the gateway to services will be interrupted.
Once the maintenance is over, it is possible that some applications will need to be restarted to be able to be accessed through the SSH Gateway again.
The maintenance is planned to last less than 30 minutes.
EDIT 20:05 UTC: The maintenance is starting.
EDIT 20:27 UTC: The maintenance is now over. We are monitoring the results. You should now be able to access your services using the SSH gateway.
EDIT 20:27 UTC: Everything is working as intended. If you have any issues using the SSH gateway, you can try to redeploy your service and contact our support team.
]]>EDIT 09:06 UTC: We implemented a fix and are monitoring the results. If you pushed new commits that didn't get deployed, you can either contact us through the support with your application id and the associated commit, or use our CLI with clever restart --commit <commit>
.
EDIT 12:25 UTC: The incident is now resolved.
]]>The maintenance is expected to last less than 30 minutes.
EDIT 22:00 UTC: The maintenance is starting
EDIT 22:35 UTC: The maintenance is mostly over. Deployments and git repositories are back since 15 minutes. We continue to make sure everything is running smoothly.
EDIT 23:05 UTC: Everything is back to normal since 22:20 UTC. The maintenance is over. Thanks for your patience.
]]>EDIT 02:55 PM UTC: the network instability has been fixed. Some customers may experiences a connection reset.
]]>EDIT 14:46 UTC: This incident is no over. No more incorrect Monitoring/Unreachable alerts were emitted.
]]>EDIT 02:46 PM UTC: Latencies have been fixed by rebalancing data
EDIT 27/09 at 13:00 UTC: Queries were not available
EDIT 27/09 at 14:10 UTC: Queries are re-open
We continue to investigate
EDIT 27/09 at 16:10 UTC: Closing incident
]]>EDIT 15:02 UTC: we deployed a new version of the API that will survive future pulsar outages.
]]>We have identified the issue and are working on it.
EDIT 13:31 UTC - we are still working on the issue.
EDIT 14:44 UTC - we are still working on the issue.
EDIT 16:09 UTC - fixed.
]]>EDIT 13:45 UTC : We have found that we have a network issue which cause storage nodes to timeout and then crash. Those nodes are now up and running, we are beginning the recovery process
EDIT 15:10 UTC : We have finished the recovery process and we are consuming the lag.
EDIT 18:52 UTC : We have almost consume all the data lag (estimate duration is 30 mins left), but there is still 2h of metadata lag.
EDIT 21:00 UTC: We have catched up the data and metadata lag, the query is now open
]]>EDIT 12:56 UTC: The main issue is now resolved and the API is back online. We continue to see some errors and are working towards identifying their source.
EDIT 14:25 UTC: The API has stabilized but we are still looking for the origin of the troubles.
EDIT 13/09 09:03 UTC: The API is unreachable again, we are working on it
EDIT 13/09 09:15 UTC: The API is now operational, the root cause has been identified.
]]>Our main API may be unavailable for 1 hour.
EDIT 00:30 UTC: The maintenance is now over since 25 minutes ago. We are monitoring the results.
]]>The maintenance is planned for one hour but is expected to last a few minutes at most.
EDIT 20:00 UTC: The maintenance is starting.
EDIT 20:02 UTC: The API is now unavailable as well as the Console.
EDIT 20:16 UTC: One of the steps took a bit more time than expected, we are back on track.
EDIT 20:44 UTC: Unexpected problems occurred and we are currently doing a rollback of the changes.
EDIT 20:54 UTC: The maintenance is over, changes were rollback and everything should now be operational again.
]]>EDIT 18:50 UTC: The hypervisors are back online since 25 minutes now, all services were restarted by our monitoring.
]]>EDIT 18:45 - The hypervisor had a kernel panic. During the reboot operation the kernel has been upgraded and this issue should not occur again.
]]>EDIT 06:34 UTC: The problem is back with elevated packet loss. Our network provider is currently having an incident and is looking into the issue.
EDIT 06:46 UTC: Some DNS domains for services hosted on other regions may also have issues to resolve because their authoritative server is currently hosted on the PAR region.
EDIT 06:55 UTC: The incident is still ongoing and our network provider is still looking into the issue.
EDIT 07:20 UTC: Our upstream network provider is currently experiencing a DDoS attack. We are currently looking to use an alternative network transit to avoid going through the upstream network provider.
EDIT 07:47 UTC: We are seeing improvements for the last 20 minutes. We still are waiting for a confirmation of the issue resolution.
EDIT 07:58 UTC: We are seeing some loss again.
EDIT 08:15 UTC: The DDoS is still happening. It's partially mitigated. We still see some loss, but there is less impact globally.
EDIT 10:54 UTC: We still see loss from time to time, but much less that before. We keep an eye on the situation.
EDIT 15:45 UTC: Most of the ddos is mitigated, we didn't have any loss those past few hours, we still monitor the situation.
EDIT 2023-09-06 15:24 UTC: No more instabilities were detected since yesterday. The incident is now over.
]]>[EDIT] 21:38 UTC the root cause was identified and a path deployed
]]>Metrics on the proxies seem ok. We are investigating why they are acting like that.
Seems some applications where causing connections to enqueue and blocking new connections. We are looking into ways to avoid this to happen.
The issue is resolved
]]>EDIT 14:53 UTC: We are seeing signs of improvements since 14:50. We continue monitoring the situation.
EDIT 15:23 UTC: We confirm that the issue has been resolved since 14:50. Sorry for the inconvenience this incident may have caused.
]]>EDIT 11:00 UTC: The problem is now resolved. Some logs may have been lost for that period. We apologies for the inconvenience.
]]>EDIT 18:00 UTC: The issue is resolved
]]>Some applications may have been redeployed multiple times with the Monitoring/Unreachable reason. Most of those deployments were false positives. Other applications may currently have troubles deploying.
We are working on restoring the service.
EDIT 15:49 UTC: The underlying issue has been found and fixed. Some deployments may have failed even when there were no reason for them to fail. You can start them again if needed. If you still have deployment issues, feel free to reach our support team.
]]>The issue has been identified and has been fixed.
]]>EDIT 12:39 UTC : The ip address is reachable
]]>EDIT 09:18 UTC : We are recovering from the events and consuming the lags. The storage layer is now operational
]]>EDIT 22:48 PM UTC: all reverse proxies are now working properly
]]>Edit 12:50 UTC: Control plane has recovered, everything is now OK
]]>EDIT 15:05 UTC: The issue appears to be limited to the Paris zone
EDIT 15:20 UTC: A counter measure has been deployed to mitigate issues. Deployments are now scheduled as expected. Some errors may still appear in your Logs. We're processing stuck deployments, but you may cancel or start a new one if you want to prioritize your deployment.
]]>EDIT 15:00 UTC : we have found that NS records and SOA records was not good, we have updated it. EDIT 16:00 UTC: everything is back to normal.
]]>EDIT 09:00 UTC The issue was found and fixed
]]>21h11: only apps with redirect_https enabled are impacted
21h56: we rollback to the old cleverapps loadbalancers
]]>EDIT 16:44 UTC: The root cause has been identified and a fix has been applied. We are monitoring the results.
EDIT 16:50 UTC: The service is now operational.
]]>EDIT 10:49 UTC: The underlying issue has been fixed. Some applications may have had troubles mounting FSBuckets, writing or reading files stored on that server between 08:50 UTC and 10:25 UTC. Impacted applications are currently being redeployed out of caution (most of them successfully reconnected to the server after the fix has been issued).
]]>Edit 14:10 UTC: query is re-open
We continue to investigate.
]]>EDIT 16:27 UTC: The hypervisor took some time to reboot but it is now up and running. We are making sure services are working fine following this incident.
EDIT 17:10 UTC: The incident is now over. The underlying problem has been identified but the hypervisor is currently in the upgrade queue.
]]>EDIT 18:14 UTC: The maintenance is starting
EDIT 22:00 UTC: The maintenance is now over
]]>EDIT 18:13 UTC: The maintenance is starting
EDIT 23:11 UTC: The maintenance is now over
]]>EDIT 20:43 UTC - fixed.
]]>Edit 04:58 PM UTC: A storage node had a hardware issue, it has been rebooted.
]]>Maintenance will start 20 of June, at 03:30 PM UTC.
Edit 03:45 PM UTC: maintenance is starting.
Edit 04:58 PM UTC: maintenance is over.
]]>We are investigating the issue.
EDIT 09:00 PM UTC: the root cause has been corrected.
]]>Maintenance will start 18 of June, at 02:30 PM UTC.
EDIT 02:36 PM UTC: maintenance is starting.
Edit 08:21 PM UTC: maintenance is still on-going, storage layer is a few minutes late on average.
EDIT 08:51 PM UTC: maintenance is over, we are catching up lag
EDIT 08:00 PM UTC. An error during catching up the lag has put the storage layer into an inconsistent state. Queries are disabled for now
EDIT 11:00 PM UTC: storage layer is still inconsistent
EDIT 00:47 PM UTC D+1: storage layer is (finally?) consistent. We are catching up the lag
EDIT 04:30 PM UTC D+1: We have catch up the lag.
EDIT 07:29 AM UTC D+1: storage layer got inconsistencies. We are investigating the reason why.
EDIT 08:10 AM UTC D+1: storage layer is up and running. We are consuming the lag. Queries are disable during this phase.
EDIT 08:45 AM UTC D+1: We have consumed the lag. Queries are available.
]]>EDIT 00:50 UTC : The monitoring do not see network issues anymore.
EDIT 01:00 UTC : The monitoring has detected connectivity issues, we are fixing.
EDIT 01:30 UTC : The monitoring has detected new connectivity issues, we are on it.
]]>08:10 UTC : We have found the component causing this issue and restarted it. We are still investigating the root cause.
21/06 : The problem was most likely caused by the network instability observed at this time. We haven't detected any problems since.
]]>19:57 UTC: We are going to reboot it. Some databases (that run on this hypervisor) will become unresponsive for a few minutes.
20:18 UTC: Hypervisor has been rebooted. All services hosted on it have been checked: everything is up and running.
Logs show a kernel panic.
]]>EDIT 09:30 UTC : Following the incident https://www.clevercloudstatus.com/incident/669, the storage layer did not perform scheduled tasks.
EDIT 09:45 UTC : The storage layer is accepting write. Logging system is operating normally.
]]>EDIT 00:27 UTC: The issue has been identified and fixed around 00:11 UTC. We continue identifying the impact on customer and internal services.
EDIT 01:00 UTC: We have identified services impacted by the incident and we have started to recover from the network issue. Identified impacted services are Metrics and access logs that are taking time to recover, others services should be working normally.
EDIT 02:30 UTC: Metrics and access logs are recovering from the network issue.
EDIT 04:00 UTC: Metrics and access logs are still recovering from the network issue. To follow, the incident you can go on https://www.clevercloudstatus.com/incident/669
]]>EDIT 06:05 UTC: The storage layer is now up and healthy. We are now consuming the ingestion lag, it should take a few hours to fully resolve. Queries are now available but will show outdated data. We will update this status accordingly.
EDIT 10:00 UTC: We've had a slower ingestion than initially anticipated so queries are still returning out of date data. We've made some adjustments and saw an increase in ingestion for the last hour. We will still need a few hours to fully consume the lag.
EDIT 15:00 UTC: The lag has been consumed, the metrics and access logs stack is operating normally.
]]>EDIT 08:32 UTC : We have found the issue and the hypervisor is rebooting
EDIT 08:50 UTC: The hypervisor has finished to reboot and services is working
]]>EDIT 11:40 UTC: All impacted applications have been redeployed automatically. We will investigate further why this server rebooted. The incident is now over.
]]>12:26 UTC: we restarted the node responsible for the issue. While it re-converges, we stop the egress servers. We will put them back on in a few minutes.
13:31 UTC: Query is back online. We are still catching up the lag, so new datapoints may not be available
14:35 UTC: lag has ben catched up
]]>EDIT 16:00 UTC : The storage layer is restarted and we are consuming the ingestion lag
]]>We will investigate to understand why this hypervisor rebooted in the first place.
]]>EDIT 13:00 UTC : We have located the root cause, we are applying a fix.
EDIT 14:20 UTC : The issue is resolved
]]>EDIT 03:58 UTC: server is back online. All databases should now be reachable.
]]>EDIT Lag has been catched up
]]>2023-06-02 09:15 UTC : after customer complaints we found out about the LB misconfiguration and fixed it.
2023-06-02 09:28 UTC : monitoring checks have been added to catch this kind of issues right away.
]]>EDIT 17:51 UTC : We have found the issue and the fix is passed. Everything is operating normally
]]>EDIT 13:30 UTC: Our ticket center provider told us that the issue has been mitigated on their end and that it is now resolved. We keep monitoring the situation for now but we can indeed see that service are operating normally those last few minutes.
EDIT 14:47 UTC: We did not see any other issues. We consider this incident to be over.
]]>EDIT 11:46 UTC : We have found the issue and fixed it. We are recovering the lag.
EDIT 13:19 UTC: The lag has been consumed, everyhting is operating normaly
]]>We are awaiting information from our infrastructure provider regarding this incident.
EDIT 19:53 UTC: It seems like multiple servers are impacted at the same time, we believe it to be an issue with a specific OVH rack or room. Multiple services on the zone are thus impacted. We are looking at ways to mitigate the issues.
EDIT 20:05 UTC: The servers are reachable again since a few minutes. We are currently making sure everything is fine. OVH incident can be followed here: https://bare-metal-servers.status-ovhcloud.com/incidents/k664s90jxfj0
EDIT 20:15 UTC: Servers in the impacted rack couldn't reach each other up until now. It could have prevented some services to correctly work. It seems like OVH fixed it before we could report it to them. We continue to making sure everything is working as expected.
EDIT 20:36 UTC: The incident is over. We are redeploying all the applications of the zone to be on the safe side.
]]>EDIT 14:14 UTC: Metrics ingestion is now back to normal. Access logs are being re-queued and are currently lagging a bit.
EDIT 14:20 UTC: Access logs have been ingested and are now up-to-date. The incident is now over.
EDIT 16:25 UTC: The problem came back, we are working on it.
EDIT 16:56 UTC: The problem is now solved again. Another root cause has been identified and has been fixed.
]]>EDIT 15:05 UTC: The issue has been found and fixed. Performance went back to normal around 13:45 UTC. Additional measures will be taken to avoid this issue in the future.
]]>09:30 UTC : A huge number of add-ons recently created by malicious users was detected. It was issuing a lot of configuration changes on our reverse proxies, making them unstable.
We banned those users and are watching the situation closely.
]]>12:20 UTC: The deployments are still running slow. We are still cleaning the situation.
13:16 UTC: we have found a deployment loop with the monitoring. We are stopping it…
13:51 UTC: cleaning is done, we are watching to see if deployments are running as expected
14:00 UTC: we have found an abnormal behaviour, we are investigating
D+1 14:30 UTC: we have made a patch for the abnormal behaviour and we are watching deployments
]]>--
Thursday 27 of april at 2:00 PM CEST (12:00AM UTC) we will apply a major update concerning Clever Cloud APIs.
This update prepares some work for future and actual services.
All Clever Cloud public regions are concerned. Gov and Private regions aren't concerned, neither On Premise regions.
All Applications and Cloud Services will continue to run as expected.
Some APIs calls may be delayed or refused for a few minutes. Deployments may take a bit longer than expected.
We expect services to be fully operational for 3PM CEST.
If you're driving your scalability, please anticipate your requirements to be fulfilled by 1PM CEST, since autoscaling won't be as reactive during the maintenance window.
--
We will keep you posted with the process in here and via this twitter thread
EDIT 20:18:00 UTC : The issue is mitigated and we are watching
EDIT 20:50:00 UTC : Everything goes to habitual levels
]]>EDIT 25 of April 08:04 AM UTC: We are still experiencing some deployments issues. Issue have been identified, we are working on a fix.
]]>EDIT 8:00 AM UTC: VMs are no longer stuck.
Reason: a bad user found a way to start a lot of huge instances and run resource-heavy cryptomining operations. It loaded the hypervisors and made some APIs unresponsive. We blocked them and took actions to prevent future abuse of our service.
]]>EDIT 10:40 AM UTC: We have found the issue and we are currently fixing it.
EDIT 04:01 PM UTC: The issue is resolved
]]>EDIT 09:35 UTC: The problem should now be mostly resolved. Some services might still have troubles, dedicated incidents will be opened. We continue monitoring the situation.
]]>EDIT 12:42 UTC: The maintenance operation is complet'ed and no more lag is present
]]>EDIT 20:30 UTC: The problem was due to an increased load and capacity has been added to handle it. We continue to monitor the incident.
EDIT 00:53 UTC: We did not see any other issues since 20:30 UTC. This problem is now fixed.
]]>We have disabled the creation of new MongoDB DEV plans. This will give us time to setup a new cluster and clean the existing one.
You can still provision the other MongoDB plans.
]]>It seems that the cluster got a lot of connections and could not handle the load.
The cluster is currently reconstructing. Waiting for it to finish.
19:40 The cluster has finished reconstructing and is now taking connections.
]]>EDIT 09:25 UTC : We begin the recovrey process. We are waiting for the process to terminate
EDIT 09:32 UTC : The recovery process has ended successfully cluster is healthy
]]>EDIT 15:00 UTC : Deployment system is now in-sync and freeze deployement are up and running
]]>EDIT 13:56 UTC : A node has crashed, the metrics storage layer finished its recovery process, it will take 20 minutes to consume the lag.
EDIT 14:22 UTC : Lag has been consumed and metrics storage layer has been operating normally
]]>EDIT 8:31 UTC: hypervisor is up and running.
]]>EDIT 18:56 UTC: To be more specific about the instabilities, the connections were slower to be processed, increasing the response time, sometimes drastically. The root cause has been found and fixed at 18:42 UTC. Since then, everything is back to normal. We continue to monitor the situation.
EDIT 19:11 UTC: Additional investigation will be performed to pinpoint the exact cause of the problem and measures will be added to prevent it from happening again. Sorry for the inconvenience.
]]>During these 30 minutes, some deployments may not go through. Some calls may fail.
Everything seems to have gone well. The operation was over at 21:28.
EDIT 23:15 UTC: It seems like some application creation are having issues following this change, we are investigating.
EDIT 00:10 UTC: A fix has been implemented and applications are now correctly created. Some users may have had the API answer a 200 - OK for application creation but following requests for that application would return a 404 - Not Found. Sorry for the inconvenience.
]]>16:30 we started seeing alerts about high load on the primary node. 17:00 we started getting report about the cluster being unreachable. 18:00 after checking the cluster, we decided to restart the primary node.
Data may have been lost as the node was not writing / replicating correctly. We are still waiting for the primary node to restart. The secondary does not seem to elect itself as primary.
19:30 the secondary finally got promoted as primary. We are blocking users with unfair use of the cluster. 22:45 we detect that the node we restarted failed to get back in the cluster. We decide to remove it entirely and re-create that node from scratch. 2023-03-13 10:00 the node has fully reached the "SECONDARY" state. We put it back into production.
Measures have been taken to prevent future unfair use from users.
]]>11:30 Our main API keeps stopping to respond. We are investigating it. This impacts the following, in an irregular fashion:
clever ssh
may not succeedApplications should keep running, but some monitoring deployments may fail.
12:55 The API seems to have stabilized. The database seems to have had a huge load. We are investigating the queries responsible for that load and try to improve them.
]]>EDIT 17:15 UTC: The issue is now resolved. A part of our infrastructure in Paris couldn't access some public DNS servers anymore, leading to multiple DNS queries failing. An upstream network provider made a change that fixed the problem around 16:52 UTC.
]]>EDIT 16:03 UTC: We are seeing improvements, we continue to monitor the situation and keep investigating the root cause. We continue to add more data collection around the various points of contention.
]]>Update 11:11 AM UTC: The hypervisor has been rebooted, add-ons should be reachable. Root cause of the issue will be determined later. In the meantime, applications hosted on that hypervisor are still redeploying. We continue to monitor the situation.
Update 03:13 PM UTC: the same hypervisor went down again. It has been rebooted. Add-ons should be reachable. In the meantime, applications hosted on that hypervisor are still redeploying. We continue to monitor the situation.
]]>EDIT 14:37 UTC: We are seeing improvements, we continue to monitor the situation.
EDIT 16:23 UTC: The incident is now over.
]]>EDIT 03/03 02:15 PM UTC: Connectivity between MTL and our control is to fully restored.
]]>EDIT 10:32 AM UTC: a connectivity issue have been detected between RBX and our control-plane. The issue is now fixed.
]]>The maintenance is expected to last 5 minutes. If you urgently need to contact us, you can send an email to support@clever-cloud.com
EDIT 19:38 UTC: The maintenance is now over. Actions on the ticket center should be fully available. If you encoutner any problems following this update, please email us at support@clever-cloud.com
]]>Impacted users will receive an email for each impacted service.
EDIT 2023-02-28 20:25 UTC: The maintenance is starting
EDIT 2023-02-28 22:18 UTC: The maintenance is now over.
]]>EDIT 15:48 UTC: We are seeing improvements and the situation is currently back to normal. The root cause seemed to be a BGP announce change from GitHub's side that made our traffic go through suboptimal routes, leading to degraded performances. We keep monitoring the situation.
EDIT 16:30 UTC: The incident is fully resolved.
]]>The reboot is planned tonight (15/02/2023) at 22:00 UTC. Maintenance will start at 21:00 UTC.
EDIT 21:07 UTC: The maintenance is starting. Add-ons will be automatically migrated in the next few minutes.
EDIT 22:52 UTC: The maintenance is over.
]]>EDIT 22:47 UTC: The hypervisor is back online with add-ons UP since a few minutes. Root cause of the issue will be determined later. In the meantime, applications hosted on that hypervisor are still redeploying. We continue to monitor the situation.
EDIT 23:44 UTC: The incident is now over. Sorry for the inconvenience.
]]>EDIT 10:55 UTC: The root cause has been found. It was only impacting multipart uploads. For deployments already at the upload phase, you will need to cancel the current deployment and start a new one for the problem to be fixed. Sorry for the inconvenience.
]]>EDIT 22:24 UTC: The hypervisor is up again since 10 minutes. Add-ons are available again. We make sure all applications were redeployed.
EDIT 00:17 UTC: The incident is over.
]]>All services running on that hypervisor are still up and running, but deployments fail to stop the obsolete VMs and we cannot connect to the host itself. We are considering a "semi" kernel crash on the hypervisor's host. We are investigating and may reboot the hypervisor in the following minutes/hours. (First, we try migrating as much important services as possible to avoid causing too much downtime to our customers.)
EDIT 16:46 UTC: We are starting to migrate add-ons on the impacted hypervisor.
EDIT 18:54 UTC: We rebooted the hypervisor, everything went well, all the remaining services are UP again.
]]>WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!
when pushing code using git+ssh on our Git repositories. This was due to an update of the allowed signature algorithms of our SSH servers. Users that had an old signature algorithm stored in their known_hosts ssh file were impacted.
The change has been rolled back.
]]>EDIT 10:00 UTC We have made a hardware upgrade to the MySQL shared cluster
]]>EDIT 01:00 UTC the update deployment has been rollback
]]>EDIT 9:08 UTC : Backends behind Clever Cloud API are up and running. Numbers of timeouts have decreased. Everything is operating normally.
]]>EDIT 10:10 UTC Operation to increase the disk space is done. We are redeploying the associated applications
]]>EDIT 22:56 UTC : The storage backend has left the read-only mode
]]>** EDIT 13:59 UTC ** One hypervisor is up and running
** EDIT 14:52 UTC ** The second hypervisor is down due to hardware issues
** EDIT 15:22 UTC ** Applications and databases may be difficult to reach as a load balancer node is hosted on the down hypervisor
** EDIT 17:00 UTC ** Deployments may have been impacted, we are redeploying the system
** EDIT 17:30 UTC ** Hypervisor is up and running. We are cleaning up the last thing
** EDIT 18:17 UTC ** Hypervisors are up and running. All systems seems working normaly
]]>04:40 UTC, we take the decision to lower the replication ratio to let the cluster breathe.
A lot of backups failed, though. We will start them again during the day.
]]>EDIT 21:56 UTC: we are experiencing a network connectivity issue, impacting parts of Paris region. Cellar is also impacted.
EDIT 22:01 UTC: Network connectivity is back online. Apps should be reachable. Cellar is in recovery, we are working on it.
EDIT 22:25 UTC: Cellar should be accessible. You may experience a bit more latency due to recovery processes in progress.
EDIT 22:57 UTC: Everything should be up.
]]>Edit 3:25 pm UTC: The hypervisor is back online. All impacted applications have been redeployed. If you are experiencing an issue, please contact our support.
]]>Metrics through the console or Grafana or access logs query is currently affected.
EDIT 16:44 UTC: The service is back up, we are starting to process the backlog of events. You should now be able to query the data but it might lag a bit.
EDIT 17:01 UTC: The queue has been ingested. The service is now back to normal. Sorry for the inconvenience
]]>EDIT 13:46 UTC: The slowness is now resolved since 13:35. The initial cause of the slowness has been found and we continue to monitor the situation.
]]>EDIT 15:25 UTC: Deployments are running again. Some more operations will be done in the next few minutes to stabilize the situation. In the meantime, we continue to monitor the health of the deployment system.
EDIT 15:45 UTC: The incident is now over. If you still have troubles deploying your application, please reach out to our support team. Sorry for the inconvenience.
]]>We are working on fixing the clocks on the Ceph monitoring servers. (Ceph is the software we use to provide the Cellar service.)
EDIT 12:40 UTC+1: One of the reverse proxies in front of the Cellar system was desynchronized. This proxy is now out of the pool for further investigation and the issue should now be fixed.
]]>Our metrics and access logs stack is currently unavailable, we are working towards bringing it back up.
Update 9:55 am UTC: Metrics and access log storage is now up. We are catching up the lag
Update 14:33 UTC: The lag of the Metrics and access logs platform is now resolved. Regarding the network instabilities, our network provider identified the issue and is working towards resolving it. It may take a few hours to get back to a nominal situation. We did not see any other instabilities since this morning.
Update 15:59 UTC: Another network issue happened at 15:50 UTC and lasted for ~1 minute, parts of the Paris zone was unreachable during that time.
Update 23:11 UTC: No other incident has been seen, we are still waiting for our network provider to ensure that the issue is resolved on their end.
Update 2023-01-09 14:18 UTC: We've seen two new events, one at 13:23 UTC and another at 14:14 UTC. We notified our network provider. Those may be related to the same problems we've seen last week.
Update 2023-01-09 19:47 UTC: Those two events weren't linked to the ones seen last weeks. The reason has been identified by the network provider and has been fixed. We are still waiting for confirmation of resolve on the original issue.
]]>Delayed git repository creation for newly created applications
Delayed add or removal of SSH keys authorized to interact with the git repositories
GitHub applications will not be impacted.
During the maintenance, you will be able to continue to push your updates as well as do deployments. The maintenance is expected to last up to 1 hour. If you have any questions, please reach out to our support team.
EDIT 18:01 UTC+1: The maintenance is starting.
EDIT 18:35 UTC+1: The maintenance is now over. Thanks for your patience.
]]>EDIT 13:45 UTC - done.
]]>EDIT: 2022-12-20 19:47 UTC : During the recovery process some services goes down with tls issues
]]>Update 4:16 AM UTC: HV is now up. We are running the cleanup tasks associated to the HVs
Update: 4:54 AM UTC: Cleanup is over.
]]>The cause has been identified and a solution is currently being investigated. This incident will be updated as soon as we have more information.
EDIT 13:44 UTC: Another network interruption happened at 13:01 UTC. A fix is currently being tested.
EDIT 14 Dec 2022 15:55 UTC: The fix appears to be working as expected. This incident is now over.
]]>EDIT 15:55 UTC: The cause has been found. This issue only affects applications tied to a unique IP proxy service. The issue has been mitigated in the last minutes and we are working to fully fix it.
EDIT 16:20 UTC: The issue has been fixed and should not happen again. If you encounter weird Monitoring/Unreachable deployments, feel free to contact our support team.
]]>EDIT: 18:30 UTC - all systems are up
]]>EDIT 24 of november 9:33 UTC: Balancing is over.
]]>A monitoring desynchronization causes disturbance on deployments. We are investigating the troubles. We are manually cleaning unnecessary deployments
We have to clean some stuck deployment but system is now recovered
]]>We are investigating the issue.
EDIT 14:26 UTC: The underlying issue has been identified and fixed. Services, including the Console and CLI should now be loading as usual. Sorry for the inconvenience.
]]>EDIT 11:23 UTC: Ingestion lag is now resolved, metrics and access logs should now be up-to-date.
]]>Status:
UPDATE 20:13 UTC: Network is kind of coming back up, but we see 80% to 90% packet loss. 21:50 UTC: still a lot (90%) of loss on the PAR -> SGP/SYD route, way less (30%) in the SGP/SYD -> PAR route. 2022-11-01 0812 UTC: >90% of loss on the PAR -> SGP/SYD route. 2022-11-01 1812 UTC: network seems fine.
]]>Update 10:36 UTC: Performance has been fixed.
]]>** EDIT 11:55 UTC **: We have found the root cause, we have mitigated the issue. we are deploying the solution.
]]>We are investigating and watching the situation.
At 11:53 UTC, the monitoring sees everything up again. We perform a few check on some services
]]>** EDIT 18:10 UTC ** : The issue has been identified and actions to solve this issue has been performed
]]>Some hypervisors are behaving strangely. We are watching and fixing them.
EDIT 10:20:00 UTC: Deployments are currently unavailable while we work around the issue.
EDIT 11:31:00 UTC: Deployments issues are fixed. We continue to monitor the situation. If you have troubles redeploying an application, please contact our support.
POSTMORTEM: The Pulsar outage that started around 04:30 UTC (see https://www.clevercloudstatus.com/incident/574) got in the way of:
The pulsar notification system is being gradually deployed on our infrastructure, having passed the tests on our preproduction zone. We do have a fallback method for notifications. However, the issue was weird enough that the pulsar notification was not cleanly failing. They rather timed out after a long time, preventing the fallback to trigger. We stopped all deployments at 10:20 UTC. We worked on quickly adding an emergency flag to prevent the hypervisors from using pulsar for notifications. This way, we can bypass it and go straight to the fallback method.
To avoid this issue, we are working on the following:
POSTMORTEM (all times are UTC) : Around 04:30: Timeouts in inter-nodes connections started to show up in the logs. They did not lead to alerts in the monitoring Around 05:00: We start getting issues in our infrastructure from software using that cluster.
11:30 : we disable the brokers to analyze the issue.
14:42: The incident is now resolved. If you still encounter any problems, please contact our support.
]]>Additional investigations will be conducted to understand why our monitoring system did not report the issue earlier. Apologies for the inconvenience.
]]>Operation will take 10 minutes, during which add-on API will be unreachable.
]]>05:45 - To prevent issues on the infrastructure, we disabled all deployments.
05:55 - We detect that some VMs are DOWN. It seems that the pulsar connection issues have overwhelmed the hypervisor's processes.
06:05 - We shut down the processes that fill up the hypervisors. It seems to fix the issue.
06:20 - The deployments seem to be back on tracks. We continue investigating the pulsar issue before putting it back into the deployment processes.
09:09 - We are still experiencing deployments issues. We are investigating.
12:28 - Deployments have been fixed.
]]>It looks like we are under a DDoS. We are monitoring it and blocking IPs that are performing the most requests.
EDIT 15:08 UTC: we have found the application that was taking 50% of all the platform traffic. We blocked all the IPs trying to reach that application. Traffic is now operational.
]]>EDIT 16:06 UTC: Ingestion lag is now resolved.
]]>** 16:30 UTC **: Incident has been resolved
]]>No service degradation is to be expected from this warning.
Please reach out to our support team should you have any questions regarding this matter.
EDIT 2022-09-29 17:30 UTC: A first license update has been applied. A new license update will be applied in the following days to finish the license update.
EDIT 2022-10-12 16:55 UTC: All licenses have been updated with a valid platinum license. The incident is over.
]]>26/09/2022 12:00 UTC: End of incident
]]>EDIT 16:46 UTC: The fix has been deployed. We are monitoring the situation. This issue also impacted Heptapod runners creation.
EDIT 17:28 UTC: The issue has been fixed, runners creation are now working correctly. Sorry for the troubles.
]]>EDIT 11:50 UTC: First investigations are showing that it is not only a network issue between our Paris infrastructure and the OVH network. It seems to impact other network links as well. We will reach to OVH and try to know more about it.
EDIT 11:51 UTC: The incident has been renamed from "Network issues between Paris and OVH zones" to "Network issues on OVH zones"
EDIT 11:58 UTC. We are seeing improvements since a few minutes now. Connectivity has been restored from our point of view. We keep waiting for more information.
EDIT 12:09 UTC: We have not seen any new disruption so far. We consider this incident closed while we wait for a more detailed incident report from OVH.
EDIT 12:59 UTC: OVH status: https://network.status-ovhcloud.com/incidents/5mldyhd6v99c
]]>This will also impact some FSBuckets add-ons during which reads and writes will be unavailable. Applications will be redeployed automatically once the maintenance is over to make sure they correctly re-connect to the FSBucket server.
The maintenance is expected to last 15 minutes.
Impacted users will shortly receive an email with the impacted add-ons.
** Edit 08:05 UTC ** Waiting for last migration to end
** Edit 08:25 UTC ** Last migration has ended, the maintenance is beginning
** Edit 08:35 UTC ** The server has rebooted successfully
** Edit 08:55 UTC ** Everything is up and running normally
]]>We have fix the issue and watching the service.
]]>EDIT 10:38 UTC: All expired tokens have been regenerated and updated. Sorry for the inconvenience.
]]>EDIT 03/09/2022 12:10 UTC: lag is finally catching up, we will keep you posted.
EDIT 03/09/2022 16:10 UT: lag is fully recovered
]]>This will also impact some FSBuckets add-ons during which reads and writes will be unavailable. Applications will be redeployed automatically once the maintenance is over to make sure they correctly re-connect to the FSBucket server.
The maintenance is expected to last 15 minutes.
Impacted users will shortly receive an email with the impacted add-ons.
EDIT 2022-09-05 21:10 UTC: Add-ons migrations is starting
EDIT 2022-09-05 21:40 UTC: Add-ons have been migrated. The hypervisor reboot will happen in twenty minutes.
EDIT 2022-09-05 22:00 UTC: Hypervisor is rebooting
EDIT 2022-09-05 22:28 UTC: Hypervisor has been rebooted in 4 minutes, fsbucket server went back one minute later with most clients reconnecting. We started all affected applications to make sure everyone properly reconnects.
]]>It affects: 1 load balancer 1 redis add-on 1 mysql add-on The free postgresql databases on MTL.
Update 16:40 after investigating, we decide to redirect the IP of the load balancer to the second LB. A ticket is open at OVHCloud to investigate what seems to be a hardware issue. Update 17:56 OVHCloud team physically checked the server: the RAID card was broken. They changed it and restarted the server. Update 18:05 All VMs on the hypervisor are up and running again.
]]>EDIT 17:38 UTC: Hypervisor has been rebooted. Services are being restarted.
EDIT 18:08 UTC: Services have all been restarted. We continue looking into why the hypervisor went down and continue to monitor the situation.
EDIT 18:27 UTC: Initial investigation shows that a KVM kernel bug was encountered, leading to a kernel crash. We will investigate further to see if this can be mitigated by an update. The incident is now over.
]]>EDIT 07:04 UTC: We are seeing network improvements to reach the zone. It is currently operational but we are still waiting on confirmation from our provider. From our point of view as of now, traffic towards the zone was dropped when reaching the Level3 network transit. Our network provider seems to have changed it to another provider, allowing us to reach the zone again.
EDIT 12:18 UTC. The network problem is fully resolved. We are still waiting for an incident report from the network operator of the Datacenter. We will share it once available.
EDIT 2022-08-26 14:27 UTC: Here is the report from our provider: It has been identified that the incident is due to a bug found in our device at DRT1. As an initial resolution, our team rebooted the device. Consequently, all alarms cleared and all services were restored after executing the said activity. As of the moment, we can confirm that the link has remained clean and error-free since the service went up.
]]>EDIT 18:02 UTC+2: Hypervisor is rebooting
EDIT 18:04 UTC+2: Hypervisor is up again. Services are currently restarting.
EDIT 18:25 UTC+2: Hypervisor services are all up since a few minutes. Add-ons should now be reachable. Applications of owners using the FSBucket server that is hosted on this hypervisor will be redeployed. Since there is a huge number of applications, you can deploy them on your end directly if needed. We will continue to monitor the situation.
EDIT 19:10 UTC+2: The situation seems to be back to normal. We will investigate further why this hypervisor became unresponsive. If you still have any issues, please contact our support team.
]]>EDIT: 00:54 Issue has been resolved, deployments must be worked normally
]]>Once the maintenance is over, you will have to refresh your Clever Cloud Console to be able to access your tickets or contact our team.
During this maintenance, you will still be able to reach our support team using our email address: support@clever-cloud.com
EDIT 2022-07-26 18:59 UTC+2: The maintenance is about to start.
EDIT 2022-07-26 19:10 UTC+2: The maintenance is now over. You will need to refresh your Clever Cloud Console to access the ticket center.
]]>Investigations will be carried out to understand how this happened and why our monitoring did not raise an alert.
The cluster should now be fully operational.
]]>EDIT 10:40 UTC - fixed.
]]>EDIT 16:03 pm: connectivity has been resolved
]]>Maintenance will start at 07:30 am UTC
EDIT 07:30 am UTC: Starting maintenance
EDIT 08:16 am UTC: Maintenance is over, we are catching up with the lag
EDIT 08:30 am UTC: Queries are currently disabled to speed up recovery
EDIT 09:17 am UTC: our maintenance triggered a major compaction on our storage layer. To speed up recovery, query are still disabled
EDIT 16:20 pm UTC: major compaction is over. We are struggling to handle both read and write operations at the same time. We are working on it.
EDIT 20:23 pm UTC: queries are still disabled. We are testing new configurations to resolve the issue
EDIT 14 of July 9:22 am UTC: it's a brand new day, we are still working on it.
EDIT 14 of July 18:26 pm UTC: We are struggling to handle both read and write operations at the same time. We are working on it. Happy french national day.
EDIT 16 of July 17:35 pm UTC: We found a performance issue triggered when the dotmap on the Console is accessed. We disabled some macros used to retrieve data to allow other users to access metrics. Metrics and access logs are now accessible.
]]>Some applications are being redeployed for Monitoring/Unreachable because the monitoring couldn't see them anymore.
Things seem to be working fine again since 09:37 UTC. We continue to monitor the situation and will try to get more information from OVH.
EDIT 11:12 UTC: The issue has not occurred again. We will wait for any input from OVH and will add it here if we get any useful information.
]]>EDIT 15:32:00 UTC: The server is back online. We are making sure services are correctly restarted. Additional services were impacted: One application reverse proxy and one add-on reverse proxy were unavailable.
EDIT 15:48:00 UTC: We are still investigating the cause of the reboot. We opened a ticket on OVH services to know if they had any un-planned intervention for that machine.
EDIT 16:03:00 UTC: The machine is unreachable again. We are investigating.
EDIT 16:11:00 UTC: The machine is up again. We are starting to suspect a hardware issue.
EDIT 16:30:00 UTC: We will drop all services from the machine to avoid any other issues until we know more about the underlying issue. FSBuckets server will be moved out around 19:00 UTC.
EDIT 19:59:00 UTC: Unfortunately, FSBuckets are going to require more time to move to another server. So far the server is working fine but OVH suspects an issue with the power supply.
EDIT 23:58:00 UTC: The FSBuckets migration is starting. FSBuckets will be set into read-only and applications will be redeployed to use the new server.
EDIT 2022-07-09 00:28:00 UTC: Buckets are fully migrated. The server is now empty and will be investigated further by OVH. This incident is now over.
]]>EDIT 22:30 UTC. The maintenance is starting.
EDIT 22:55 UTC: Maintenance is over, no visible impact happened, links failed over in less than 100ms each time.
]]>One of the partition is corrupted, fixing
EDIT 17:10 UTC: The underlying issue has been fixed. The queue is currently being processed. Some events might have been lost during the cluster rebalance. Data points will take a few more hours to be up-to-date in the various dashboards.
EDIT: Queue is in sync
]]>A batch was sent by an employee. The throttle interval was set two small and the batch made a huge amount of queries to the database, making it unresponsive. We stopped the batch and will restart it with a higher throttle interval.
]]>This means that the data transferred to the server is encrypted, and that even if they are intercepted, they cannot be read by a third party. This protection has been provided by the TLS (Transport Layer Security) protocol for almost 20 years, whether it’s a personal site, an online shop or an access to your bank’s services.
Over time, this critical technical brick on the Internet has evolved to strengthen the level of security it offers. In August 2018, its version 1.3 (the latest) was released. Meanwhile, versions 1.0 and 1.1 were considered to no longer offer a sufficient level of protection. They have been deprecated by the IETF (Internet Engineering Task Force) since March 2021 and have therefore been gradually removed from recent browsers such as Firefox, Chrome and its derivatives or Safari.
At Clever Cloud, we have seen our customers adopt TLS 1.2 and 1.3 gradually. On our load balancers, based on our in-house and open source reverse proxy Sōzu, the latest version accounts for over 90% of the requests processed each day. TLS 1.2 for just under 9%. TLS 1.0 and 1.1 for only a few tens of thousands of requests per day, less than 0.1% of our traffic.
While we have maintained these versions for compatibility reasons, this will no longer be the case as of June 30. We will of course inform the customers affected by this choice, and encourage them to switch to more recent versions, which will have advantages for them in terms of security, performance and SEO.
Several reminders will be sent between now and the final shutdown of TLS 1.0 and 1.1. If you have any questions on this subject, please contact our support team through the Console.
EDIT 2:00 PM UTC: every public load balancers has been updated with new configuration
]]>EDIT 14:37 UTC: Network connectivity has been resolved. Database is starting.
]]>Edit 07:13 UTC : the ticket center is back online.
]]>EDIT 13:02 UTC: The index has reloaded
]]>EDIT 06:04 UTC: The server experienced a hardware failure. It may not be able to come back. Applications on it were redeployed elsewhere. Custom services and add-ons are currently impacted.
EDIT 06:23 UTC: A public reverse proxy serving requests for domain.par.clever-cloud.com (185.42.117.109) was on this hypervisor. This IP was moved to another server. Between 05:23 and 05:35, it was unreachable.
EDIT 06:52 UTC: ETA for server to come back is 08:00
EDIT 07:46 UTC: Hardware has been changed, server will be rebooted.
EDIT 07:57 UTC: Server is back online, we are making sure all services are up.
EDIT 09:10 UTC: Everything is now back to normal, the incident is over. We will investigate further on the reason of the hardware failure.
]]>EDIT 13:55 UTC: Our provider now indicates that emails should now be received with some delays.
EDIT 16:15 UTC: Email delivery should now be working fine again. Our provider's incident is over.
]]>EDIT 21:28 UTC: The issue has been found and fixed. We are monitoring the situation.
EDIT 21:40 UTC: Everything seems to be back to normal. The issue was happening for a couple of applications starting around 16:30 UTC. We will investigate further on why its configuration was out of sync during that time period.
]]>16:24:00 UTC: At first look, it seems that a network error is making us see that hypervisor as down. No information yet on if it's a hardware or software network issue.
16:28:00 UTC: The hypervisor seems to be back up again. We are making sure everything on it is responding well.
16:40:00 UTC: Everything has been check and is responding correctly.
Impacts:
EDIT 14:55 UTC: The problem has been identified and fixed. Deployments should now be working for the last 10 minutes. Sorry for the inconvenience.
]]>EDIT: we stop some components which were increasing load of the cluster. it should be more stable now
]]>19:40: The culprit is a switch that half stopped responding. Turns out that it's not broken enough so its routes are automatically removed. Our DC contractor is moving to physically remove the switch. ETA is 30 minutes.
20:00: Cellar seems to be up again. We are still watching and waiting for a direct confirmation from our DC contractor.
00:00: Everything is back to normal
]]>Applications will automatically be restarted once the maintenance is over.
EDIT 20:05 UTC: The maintenance is beginning
EDIT 20:28 UTC: The downtime was reduced to a few minutes but multiple network cuts may have happened. Applications linked to this service are currently redeploying.
]]>EDIT: trying to repair database files EDIT: database filesystem repaired
EDIT 04/06: MongoDB process has restarted. Some customer perform expensive queries on the MongoDB cluster, which can cause an OOM of the process,
EDIT 06/06 10:31:06 UTC: mongodb-c2 is still experiencing issues, we are working on it.
EDIT 06/06 11:24:00 UTC: Because of a replication recovery bug not fixed by MongoDB on pre-SSPL version, we are working on making databases back from the previous backups made overnight. Everything should be back on in the afternoon. Users can setup new dedicated database with the previous backups for faster recovery.
EDIT 06/06 13:45:00 UTC: Restore process has began, it will take a few hours. We will keep you posted.
EDIT 06/06 15:01:00 UTC: We restored half of the customers. We are expecting full recovery in a few hours.
EDIT 06/06 17:01:00 UTC: An issue occured while restoring the databases. We are investigating.
EDIT 06/06 23:00:00 UTC: We restored all the databases that were not above usage quota. The cluster is now running and we improved how we export connection data so applications will behave better when connecting.
Current state:
EDIT 20:37 UTC - fixed.
]]>EDIT 09:21 UTC: The issue should have been fixed. Your applications might need to be redeployed if the issue persists. We continue to monitor the service.
EDIT 13:11 UTC: We didn't see any other issues with the service, the issue is now resolved.
]]>EDIT 22:46 UTC: The hypervisor doesn't reboot, we continue our investigation.
EDIT 00:06 UTC: The hypervisor is back online since a few minutes. All services are now available again. The extended period of downtime has been identified and will be fixed on similar hypervisors to have a faster recovery next time.
]]>EDIT 21:04 UTC: Ingestion is now back to normal. Access logs will be processed over the next few hours.
]]>UPDATE: all applications have been redeployed
]]>UPDATE 14:57 UTC: Some Add-ons are being inaccessible due to a faulty proxy. We're removing it from the pool to mitigate.
UPDATE 14:59 UTC: Services are being reloaded to ensure the faulty proxy is removed from the pool.
UPDATE 15:10 UTC: Services are back online for redeployed apps. A faulty sentry induced an abnormal behaviour in the API.
CALL FOR ACTION 15:23 UTC: Remaining applications are currently redeployed. If you're impacted, we advise you to redeploy your app to accelerate the recovery process
]]>EDIT 14:59 UTC - We have identified defaulting component which encounters an issue in the connection pooler.
EDIT 15:09 UTC - deployments queue is being consumed and catching up. Issue it mitigated.
EDIT 15:23 UTC - Incident is fixed.
Root cause: we've found an issue in a messaging driver on a couple of isolated servers. Anyway, we've curated out this specific driver to fall back on an alternative messaging layer. In the coming days, we will dive into this specific bug we've found and will communicate the bug fix upstream.
]]>EDIT 20:02 UTC: the MySQL shared cluster is back online.
]]>EDIT 21:39 UTC - querying logs is now available.
]]>EDIT 21:39 UTC - shared cluster is now back online
]]>Maintenance is expected to start in a few minutes
EDIT 17:56 UTC: Service is back online, you should now be able to SSH to your instances. Sorry for the inconvenience.
]]>EDIT 23:06 UTC - Storage cluster is now up. We are now catching up the accumulated ingestion lag. Query components will be restarted in a rolling fashion throughout the next 6 hours.
EDIT Sunday 11:27 UTC - Some query components are still reloading
EDIT Sunday 20:27 UTC - We are still experiencing issues on the query components.
EDIT Monday 07:20 UTC - Query is back online
]]>Because of this, those hypervisors became more empty than the others. More VMs were scheduled on them since they had more resources available, which then lead to more Monitoring/Unreachable events.
Instances weren't, for the most part, unreachable, but were redeployed anyway.
This should now be fixed. Sorry for the inconvenience
]]>The migration is expected to last at most 1 hour. All impacted applications will be redeployed during the migration. After the deployment, applications will be able to write to the bucket. Read operations will not be impacted.
Users of buckets that need to be migrated have received emails.
EDIT 2022-05-31 10:00 UTC: The migration is starting, buckets will be put into read-only.
EDIT 2022-05-31 10:25 UTC: The migration is over. Applications have started redeploying, it should take around 2 hours. You can redeploy your application earlier to finish the migration.
EDIT 2022-05-31 13:11 UTC: All applications have been redeployed, the migration is now over.
]]>EDIT 07:16 UTC - Indexes have been rebuilt. Query is now available.
]]>EDIT 17:12 UTC: The queue is still being consumed.
EDIT 17:27 UTC: The queue is now empty. Every monitoring actions should now be working as expected.
]]>EDIT 10:50 UTC: Hypervisor is back online. Add-ons hosted on that hypervisor are currently available.
]]>EDIT 15:02 UTC - Indexes have been rebuilt. Query is now available.
]]>EDIT 09:20 UTC - Indexes have been rebuilt. Query is now available.
]]>The migration is expected to last at most 1 hour. All impacted applications will be redeployed during the migration. After the deployment, applications will be able to write to the bucket. Read operations will not be impacted.
Users of buckets that need to be migrated have received emails.
EDIT 24/05/2022 12:00 UTC+2: The migration will start soon. FSBuckets will be put into read-only for a couple of minutes so that all buckets are correctly synchronized.
EDIT 24/05/2022 12:03 UTC+2: FSBuckets are now in read-only mode.
EDIT 24/05/2022 12:39 UTC+2: Synchronization is over. Applications are being redeployed. If you wish to recover faster, you can trigger a deployment through the web Console or CLI. Deployments are expected to all be started within the next 30 minutes.
EDIT 24/05/2022 13:34 UTC+2: The migration is over, if you have any issues, please contact our support team
]]>EDIT 08:28 UTC - We are consuming the lag.
EDIT 08:28 UTC - Indexes are rebuilding.
EDIT 09:34 UTC - Indexes are rebuilt. Query is available.
EDIT 16:03 UTC - Fixed.
]]>EDIT 23:41 UTC - Issue has been identified and we are consuming the lag.
EDIT 07:28 UTC - Lag has been consumed .
EDIT 07:30 UTC - Fixed.
]]>23:55 UTC - Issues has been identified and we are consuming the lag.
00:19 UTC - lag has been consumed.
00:20 UTC - Fixed.
]]>EDIT 09:11 UTC - Metrics/AccessLogs are catching up their lag.
EDIT 16:34 UTC - Fixed.
]]>EDIT 09:06 UTC - The logs are catching up.
EDIT 11:15 UTC - Fixed.
]]>EDIT 08:00 UTC - We have identified ongoing issues.
EDIT 08:02 UTC - New deployments are currently disabled to reduce the impact on our infrastructures. We will reactivate them when the queued ones will be deployed.
EDIT 08:45 UTC - Deployments are still flaky, we are working to resolve the issues.
EDIT 09:08 UTC - Deployments queue is catching up. When it ends, we will redeploy a part of the PAR zone to ensure deployments are monitoring are consistent.
EDIT 09:25 UTC - The mentioned deployments are running.
EDIT 11:16 UTC - We are about at 75% of the deployments completed.
EDIT 12:06 UTC - Finished and fixed.
]]>EDIT 06:34 UTC - Our orchestrator is impacted and the deployments are experiencing issues.
EDIT 06:44 UTC - Core API is fixed.
EDIT 08:34 UTC - We are experiencing issues affecting console, cli. We are investigating.
EDIT 08:45 UTC - Core API is fixed.
]]>EDIT 12:11 UTC: Metrics and logs are now accessible again. Sorry for the inconvenience.
]]>EDIT 10:56 UTC: Metrics and logs are now accessible again. Sorry for the inconvenience.
]]>EDIT: Ingestion fixed, query almost restored
]]>EDIT 13:20 UTC: The issue has been fixed. Some metrics data points have been lost. Access logs are being queued for ingestion again.
]]>EDIT 17:20 UTC: The service has been fully restored. Sorry for the inconvenience.
]]>EDIT 06:45 UTC: fixed.
EDIT 07:22 UTC: we have identified another issue.
EDIT 09:45 UTC: fixed.
]]>EDIT 15:12 UTC: This seems to be back to normal. We did not find the root cause but we keep looking. Some actions may have failed like deployments, git push or accessing the dashboard / using the CLI in general
EDIT 17:34 UTC: We still see some instabilities, resulting in various longer queries or even errors from some services that fail to contact our API. We are still working on identifying the root cause.
EDIT 20:34 UTC: We didn't see any more instabilities since the latest status update. We'll continue to monitor the activity in the next couples of days.
]]>EDIT 20:43 UTC: The delay has now resolved, you should now be able to query the access logs using the CLI or API.
]]>EDIT 15:05 UTC: fixed.
]]>EDIT 18:23 UTC - the SYD zone (provided by OVH) seems only reachable using the OVH network
EDIT 18:30 UTC - we are waiting for our provider's feedback
EDIT 19:00 UTC - fixed https://network.status-ovhcloud.com/incidents/j5vzf90dpzcc
]]>Edit 10:27 UTC: The delay is now resolved. Sorry for the inconvenience.
]]>This should now be resolved. The 7 other reverse proxies were working as usual.
]]>0740: The reason has been found and it's been fixed.
]]>The team has found to origin. We are working on a fix.
]]>Some hypervisors are experiencing issues with qemu. VMs are randomly crashing.
We are investigating.
The 4th of April, some new deployments were unable to be completed by the CCOS (Clever Cloud Operating System) orchestrator.
A few day ago, we introduced a new notification subsystem. It was required to enable the Network Groups feature. The new notification subsystem led to new connections from hypervisors agent to be initiated to the messaging component.
An issue on the proxy layer which did not properly closed connexions, led to connexion stacking until saturation of the pooler. This situation made agents to stack up too many processes on hypervisors machines for too much time preventing new processes for being spawned.
Our hypervisor controller suffered from being able to spread new threads, which led to new deployments being unable to be completed. It also prevented the current virtual machines from spawning new threads, thus crashing some of these running VMs.
Network Groups being in ALPHA, we immediately decided to rollback their availability, pushing back a non blocking version which did not rely on our messaging layer.
Two different actions are being rolled out.
EDIT 15:32 UTC: The team has found to origin. We are working on a fix.
EDIT 15:50 UTC: Reading is back, tu situation is being mitigated.
EDIT 16:01 UTC: Cellar C2 is up and running.
]]>Problem has been identified, we are working to fix the problem.
EDIT 15:36 UTC: certain metrics and accessLogs are still not accessible. EDIT 18:50 UTC: metrics and accessLogs are now accessible.
]]>We are trying to fix these performance issues.
]]>EDIT 18:43 UTC: fixed.
]]>Everything went well. Do not hesitate to each us via support for any questions.
]]>EDIT 20:57UTC - creation is enabled.
]]>** UPDATE ** 2022-03-24 15:40 UTC website does not have HTTP errors anymore
]]>If you lack some files that were on it, please contact the support with all the informations: add-on ID, bucket name, etc.
]]>As a reminder, the service will be shut down on 21/03/22. A few network brownouts will be applied to remind customers that they need to migrate their data.
A total of 5 brownouts will be applied. During these planned downtime, the service will refuse any connections, be it HTTP or HTTPS.
This brownout will happen on 16/03/22 16:00 UTC for a 30 minutes window.
Our support team stays at your disposal for any questions.
]]>Some services are also impacted:
EDIT 18:20 UTC: Our network provider is investigating the issue.
EDIT 18:28 UTC: The issue has been identified and has been escalated. Logs may also be impacted.
EDIT 18:44 UTC: The issue is still being worked out but Pulsar and Logs are now working fine again.
EDIT 19:26 UTC: The issue has been fixed by the network provider at 18:54 UTC. All components are now working fine again. Access logs are being ingested and may have some lag for a few hours. Sorry for the inconvenience.
]]>EDIT 12:04 UTC: The lag in the ingestion pipeline has been resolved.
]]>As a reminder, the service will be shut down on 21/03/22. A few network brownouts will be applied to remind customers that they need to migrate their data.
A total of 5 brownouts will be applied. During these planned downtime, the service will refuse any connections, be it HTTP or HTTPS.
This brownout will happen on 18/03/22 10:00 UTC for a 30 minutes window
Our support team stays at your disposal for any questions.
EDIT 11:00 UTC: The brownout has started and will last for 30 minutes.
EDIT 11:30 UTC: The brownout has ended. The service will be decommissioned next Monday.
]]>As a reminder, the service will be shut down on 21/03/22. A few network brownouts will be applied to remind customers that they need to migrate their data.
A total of 5 brownouts will be applied. During these planned downtime, the service will refuse any connections, be it HTTP or HTTPS.
This brownout will happen on 14/03/22 09:30 UTC for a 30 minutes window.
Our support team stays at your disposal for any questions.
EDIT 09:36 UTC: The brownout is starting. It will last for 30 minutes.
EDIT 10:07 UTC: The brownout has ended. Next one will happen on 16/03/22 16:00 UTC for a 30 minutes window.
]]>As a reminder, the service will be shut down on 21/03/22. A few network brownouts will be applied to remind customers that they need to migrate their data.
A total of 5 brownouts will be applied. During these planned downtime, the service will refuse any connections, be it HTTP or HTTPS.
This brownout will happen on 11/03/22 14:00 UTC for a 10 minutes window.
Our support team stays at your disposal for any questions.
EDIT 14:00 UTC: The brownout is starting and will last for 10 minutes.
EDIT 14:10 UTC: The brownout has ended. Next one will happen on 14/03/22 09:30 UTC for a 30 minutes window.
]]>As a reminder, the service will be shut down on 21/03/22. A few network brownouts will be applied to remind customers that they need to migrate their data.
A total of 5 brownouts will be applied. During these planned downtime, the service will refuse any connections, be it HTTP or HTTPS.
This brownout will happen on 09/03/22 10:00 UTC for a 10 minutes window.
Our support team stays at your disposal for any questions.
EDIT 10:00 UTC: The brownout has started.
EDIT 10:10 UTC: The brownout has ended. Next one will happen on 11/03/22 14:00 UTC
]]>Edit: Connectivity issues has been solved by our network provider. The service should run as expected
]]>EDIT 10:27 UTC: Queries have returned to normal, Metrics and Access logs should now be reachable. We are monitoring the queries.
EDIT 11:03 UTC: Queries have returned to normal, Metrics and Access logs should now be reachable.
]]>Sorry for the inconvenience.
]]>cellar-c1.clvrcld.net
or cellar.services.clever-cloud.com
. We are investigating the issue.
EDIT 22:42 UTC: After a quick investigation, only one of the 3 IP that is serving those domains is having troubles reaching other nodes of the cluster. The IP has been dropped from the DNS. Meanwhile, we are investigating the issue with our network provider.
EDIT 22:39 UTC: Lowering the severity to Performance Issues. Ticket has been open with our network provider.
EDIT 23:15 UTC: The connectivity is now back since 23:07 UTC with our network provider saying that the issue has been resolved. We will wait a bit before adding back the IP of the faulty node in the DNS just to be sure but this incident is now closed on our end. Sorry for the inconvenience.
]]>EDIT 23:15 UTC: The connectivity is now back since 23:07 UTC with our network provider saying that the issue has been resolved. This incident is now closed on our end. Sorry for the inconvenience.
]]>EDIT 19:16 UTC+1: This does not impact renewal of certificates.
EDIT 19:36 UTC+1: We are now under the rate limit, newly added domains should have their certificates generated in a few minutes, as usual. Sorry for the inconvenience.
]]>The fail-over will be done in the upcoming hour.
EDIT 15:17 UTC: The cluster will be fail-over in the next few minutes. Some queries might be failing as soon as the leader goes down and until your application correctly connect to the new leader.
EDIT 15:28 UTC: The fail-over has been done. Make sure to restart your applications if they can't connect to their add-on.
]]>EDIT 14:25 UTC: The issue has been fixed. A fix has been scheduled for deployment this afternoon which should reduce those delivery issues events. We will monitor the fix closely once it gets deployed.
]]>EDIT 15:07 UTC: Live logs and drains are back. Some drains logs may have been lost during the recovery process. Sorry for the inconvenience.
EDIT 15:52 UTC: Live logs and drains are down again, we are looking into it
]]>EDIT 15:52 UTC: Queries have returned to normal, Metrics and Access logs should now be reachable.
]]>Ingestion is now starting at full capacity again. There will be some delay before having up-to-date access logs but it should be good in a few hours. Sorry for the inconvenience.
]]>We are still identifying which ones are broken to restart them. If you see that your drains are broken, please contact the support so we can restart them!
Edit 15:11 — we restarted all drains to be sure. Edit 16:27 — Most of the drains are still broken. We are trying to fix the issue by deleting and re-creating message queues in the logs infrastructure. Edit 16:37 — Deleting and creating back everything seems to have cleaned up the situation. Drains seem to be working again!
]]>EDIT 20:32 UTC - fixed.
]]>EDIT 17:45 UTC: The incident is over, sorry for the inconvenience.
]]>EDIT 20:27 UTC: We identified the issue, and the resolution is on going.
EDIT 20:54 UTC : Fixed.
]]>20:39 UTC: The ingestion pipeline is back for now but the underlying issue is not properly fixed yet.
20:49 UTC: Theoretically, the problem is fixed. In any case, the ingestion pipeline is working at full speed. We are keeping an eye on things.
]]>We have dealt with msot consequences of that downtime, we are still working on fixing an issue with the ingestion pipeline of Metrics and access logs. There will be some delay.
16:40 UTC: Everything is working as expected, delay will go back to normal soon.
]]>This impacts:
EDIT 17:30 UTC: Everything is back to normal. Sorry for the inconvenience.
]]>EDIT 21:58 UTC: Everything should be back to normal, sorry for the inconvenience.
]]>The migration will start at 19:00 UTC+1 and should apply instantly as soon as you refresh the console.
During the transition, you can directly contact us at supportmail@clever-cloud.com.
EDIT 20:44 UTC+1: The migration has ended, our new support tool is now ready to be used! Make sure to refresh the web console.
]]>17:26 UTC: The maintenance operation did not fix the issue. Deployments are completely disabled at the moment. We are investigating.
17:31 UTC: It was DNS (DNS reverse resolving was too slow when opening connections, which timed out). We are working on bringing everything back up.
17:52 UTC: Everything is back up. If you are experiencing an issue, please contact us.
]]>We do not have any details about this incident as of now.
]]>This is due to an issue with Slack. Slack is replying with 500 errors to our notifications even though they are clearly processing the messages just fine, our notification system sends multiple retries after receving failures so you will be receiving multiple duplicates and your webhooks will probably be disabled automatically (as they are after too many repeated failures). We will be re-enabling them once the issue is fixed. If your webhook remains disabled, please contact us.
14:17 UTC: We have not received a single 500 error from Slack in 8 minutes. It looks like this may be fixed. Although a broader incident is still ongoing on Slack's end: https://status.slack.com/2021-12/a17eae991fdc437d
14:44 UTC: Webhooks disabled since 12:00 UTC have been re-enabled. Slack status says messaging/notifications part of the incident is resolved, we are not seeing any errors so this incident is now over. If you are experiencing an error or if your webhook has not been re-enabled, please contact us.
]]>13:40 UTC: Multiple servers in the same rack have gone down at the same time. It's most likely a network issue.
13:45 UTC: Our provider (OVHcloud) is aware of the issue. They will come back to us with more details later.
13:53 UTC: The hypervisor is back online. We are making sure everything is fine.
14:11 UTC: Everything is fine now, there was an issue with outgoing traffic from 13:53 until 14:08 UTC. This is now fixed.
Our provider tells us it was an issue with the cooling system. More info may be posted here: https://bare-metal-servers.status-ovhcloud.com/incidents/5cqtb0q9ht67
]]>EDIT 9h15 UTC : The ingestion pipeline is back to normal. No abnormal delay.
]]>This impacts:
EDIT 14:52 UTC: The queries are available again since 14:20 UTC. This incident is over.
]]>We will apply this option on all add-ons and restart them as an emergency maintenance. For single node add-ons, this will trigger a short downtime of minimum 1 minute (the approximate time it takes Elasticsearch to boot). For clustered add-ons, no downtime is to be expected as it will be a rolling restart.
Newly created add-ons are already patched.
The restart of all add-ons will start at 15:00 UTC. Sorry for the short notice. Feel free to contact our support if you have any questions.
EDIT 15:05 UTC: Add-ons restart is starting
EDIT 16:10 UTC: Add-ons have been restarted. The maintenance is over.
]]>The issue is now fixed.
]]>09:22 UTC: The issue is identified and fixed, logs ingestion should catch up. Logs should appear within a few minutes.
09:38 UTC: The issue is not actually fixed, there is something else blocking the pipeline. We are investigating.
09:55 UTC: The ingestion is working, there are a lot of older logs to be processed so it will take a while before you can see recent logs in real time.
13:07 UTC: The ingestion pipeline is back to normal. No abnormal delay.
]]>10:53 UTC: We are still investigating this issue. The culprit seems to be a peering node.
11:18 UTC: It seems to only affect a few routing paths between our infrastructure and some hosts of Scaleway and Azure. We are trying to narrow down the issue with their network teams.
13:05 UTC: We see improvements between Scaleway and our Infrastructure since 11:26 UTC. We do not yet know if it's a temporary resolution and are awaiting for more information on Scaleway side.
13:36 UTC: Confirming that the issue between Scaleway and our infrastructure has been fixed. We are still awaiting some details from Scaleway to know if they are indeed the ones who changed their routing configuration to avoid the faulty peer.
15:10 UTC: Scaleway tells us they did not change anything on their end. Still, no issue to report on this side since 11:26 UTC. On the Azure side of things, it seems to be better, the issues we could reproduce earlier cannot be reproduced anymore but some hosts may still be affected. We are marking this as resolved but if you have any specific problems, please contact us so we can troubleshoot the issue more efficiently.
]]>Impacted users will shortly receive an email and can contact us on our technical support for any further questions.
EDIT 20:32 UTC+1: Add-ons migrations are starting
EDIT 21:31 UTC+1: Add-ons have been migrated. Add-ons that couldn't be migrated in the first place will be unavailable up to one hour. We will announce the planned downtime tomorrow (02/12/2021)
EDIT 02/12/2021: The hypervisor will be rebooted on December 06, 2021 at 11:00 UTC+1. The expected downtime is less than 1 hour.
EDIT 06/11/2021 10:59 UTC+1: The hypervisor is going down at 11:00 UTC+1 as expected. Downtime should not be higher than 1 hour.
EDIT 06/11/2021 11:09 UTC+1: The hypervisor is back up since 3 minutes, all services should be reachable again. We are making sure everything runs fine.
EDIT 06/11/2021 11:13 UTC+1: The maintenance is over.
]]>12:21 UTC: Incident is resolved (there may be some lag for a few minutes)
]]>PHP versions from 7.0 to 7.2 are vulnerable to security vulnerabilities as they will not receive security updates. You can find the list of end of life versions here: https://www.php.net/eol.php.
Affected customers will be e-mailed about this change and can contact our support team for any additional questions.
]]>EDIT 13:05 UTC: fixed.
]]>EDIT 11:00 UTC: A node from the cluster failed to reboot and was stuck in failed state. We are rebuilding this node. It will take 2 to 3 hours. No data will be lost.
]]>11:45 - The Logs API stopped crashing. We don't know why and continue to investigate the reason to fix this for the long term.
]]>At 11:12 UTC today, the queue has been emptied so all webhooks matching the events during this period have not and will not be sent out. Events from 11:12 to 12:25 UTC have all been sent at once and everything is back to normal since then.
]]>The reverse proxy has been rebooted and this incident is now over.
]]>Impacted users will shortly receive an email and can contact us for any further questions.
EDIT 19/10 18:35 UTC: Migration of add-ons has started
]]>10:40 UTC+2: The issue is resolved.
]]>EDIT 15:07 UTC: The problem has been identified and fixed. Queries should now be back, current data lag is 1 hour and 30 minutes. It should quickly come down in the next hour.
EDIT 17:58 UTC: Ingestion lag is now resolved
]]>https://twitter.com/ovh_status/status/1448185498812485633?s=20
The website travaux.ovh.com is unreachable preventing us from getting a status on the maintenance where "No impact" was expected.
09:55 UTC+2: We still have no update from OVH.
10:01 UTC+2: https://twitter.com/olesovhcom/status/1448196879020433409?s=20
10:20 UTC+2: Our Montreal zone is reachable, others zones might come back soon.
All our zones are now reachable, you might still experience DNS issues or other issues due to the OVH incident it self.
]]>12:58 UTC: The issue is resolved.
Here is what we know so far:
The revocation server of the Certification Authority providing this certificate says that this certificate has been revoked on 2021-06-23, except it was still accepted just fine a few hours ago.
We have asked for a reissue of the certificate (this is an automatic operation). The reissued certificate has been installed and is working fine. Meanwhile, we have asked the CA about this revocation without any warning or notice and are waiting for an answer.
]]>We are investigating the issue.
EDIT 10:00 UTC: The issue has been fixed, pushes using the HTTP protocol should now be working as intended. Pushes and clones using SSH protocol were not impacted. We'll investigate further the issue.
]]>EDIT 18:45 UTC: CC_JAVA_VERSION should now be fixed with the right value. Impacted applications are redeploying to make sure they use the right version.
EDIT 18:58 UTC: If you changed the value of CC_JAVA_VERSION between 09:30 UTC and 18:45 UTC, the value might have been replaced with its previous version. Make sure you set it back to the right version if needed. Sorry for the inconvenience.
]]>EDIT 14:57 UTC: The problem has been fixed, the documentation should now be fully accessible at https://www.clever-cloud.com/doc/
]]>Our Let's Encrypt certificates already provide the up-to-date Let's Encrypt chain but some older clients might not be able to trust that new chain because they don't have the new root Certificate Authority in their trustore. If you are in this situation with clients you can't update, we can sell certificates that will be trusted by those older clients. You can contact us on the support with the domains you need to protect.
You can also find more information about this expiration on Let's Encrypt website: https://letsencrypt.org/docs/dst-root-ca-x3-expiration-september-2021/
]]>EDIT 14:57 UTC: A fix has been pushed, the errors should be resolved. We continue to monitor the situation.
EDIT 15:19 UTC: No more Internal server errors are happening, this incident is now closed.
]]>EDIT 06:43 UTC: queries should be back to normal, the ingestion lag should take a few minutes to be consumed.
EDIT 11:12 UTC: Everything is back to normal
]]>Edit: New hypervisors were added but they had no support for fsbuckets yet.
]]>EDIT 14:35 UTC: The root cause has been identified, the ingestion lag currently sits at around 2 hours so metrics queries will be out of sync for the time being. Access logs are not ingesting and are currently kept in a separate queue. We expect the lag to start decreasing later tonight. This incident is a follow-up to the urgent maintenance of yesterday which mainly aimed at better stabilizing the cluster.
EDIT 23:34 UTC: Metrics have been fully ingested, access logs are still delayed but they are currently being written. Queries might still be slow, this is expected.
EDIT 6:30 UTC: The situation is back to normal.
]]>EDIT 16:17 UTC: the maintenance is still ongoing. Reads and writes are disabled since 15:42, this is expected.
EDIT 21:50 UTC: the maintenance is finished. Ingestion is catching up.
]]>EDIT 15:14 UTC: fixed, the related drains are currently catching up.
]]>The time outs last for about 2 minutes before the proxy was put out of the pool.
Some requests might have failed during the first minute and then, all requests handled failed during the remaining minute. Additional investigation will be performed to analyze what happened.
]]>Update: NPM Registry posted on their status page confirming the incident and are working on a fix: https://status.npmjs.org/incidents/bydjtj102gsn Update: The issue has now been fixed, node deployments are back to normal.
]]>Affected customers have been e-mailed about this and can contact our support team for any additional question.
EDIT 21:05 UTC+2: Update is beginning.
EDIT 22:00 UTC+2: Updates are over and were successful for most of the add-ons. Owners of add-ons that couldn't be updated will be contacted. If you encounter any issue following this update, please reach to our support team.
]]>We will be switching back to the original server (which has been fixed by the manufacturer). The server should be down for 10 minutes if our provider does not encounter any issues (may last up to an hour otherwise).
Affected customers have been e-mailed about this and can migrate their add-ons automatically beforehand.
2021-08-25 19:02 UTC: Server is going down.
19:17 UTC: This is taking longer than expected. Server management software decided to reapply firmware settings; this takes a few minutes.
19:24 UTC: Server is up. Add-ons are starting up.
19:26 UTC: Everything is up. Incident is over.
]]>You can follow their incident here: https://status.mailgun.com/incidents/jj6fx7nqwn9t
21:19 UTC: Incident is resolved.
]]>The root cause has been identified. Unfortunately, the higher load also triggered a lot of redeployments with the Monitoring/Unreachable cause. Most of them were cancelled on time but some of them went through. Some of the deployments that started did not correctly finish and ended-up in a blocked state. Those deployments are currently being canceled and all cancels should be over in a few minutes.
We have developed a fix that will prevent those events from happening again and it will be deploy in the next hours.
]]>The root cause has been identified and the issue has been fixed. Unfortunately, the higher load also triggered a lot of redeployments with the Monitoring/Unreachable cause. Most of them were cancelled on time but some of them went through. Some of the deployments that started did not correctly finish and ended-up in a blocked state. Those deployments are currently being canceled and all cancels should be over in a few minutes.
We will investigate more in depth about this increased CPU load usage and see how we can better prevent this.
]]>We do not have any more information at the moment (though it is most likely a routing issue). Everything is working fine now except for Metrics (and accesslogs) which will come back in a few minutes.
EDIT 19:33 UTC: It happened again at 19:29 UTC. We are awaiting more information from our network provider.
EDIT 19:42 UTC: It happened again at 19:41 UTC.
--
This was due to a maintenance on one of the fiber optic channels between our two Paris datacenters. Our network provider was not made aware of this maintenance which caused the connection to switch back and forth between links when a link went on and off again.
]]>EDIT 9:22 UTC - fixed.
]]>EDIT 13:42 UTC: The ingestion stopped again, we continue looking into it.
EDIT 14:05 UTC: We continue to investigate the issue. If you need to access the logs of your application, you can SSH to the VM and display them: https://www.clever-cloud.com/doc/reference/clever-tools/ssh-access/#show-your-applications-logs
EDIT 14:30 UTC: Some part of the ingestion queue couldn't have been consumed and has been lost. The queue is still being consumed so up-to-date logs are still delayed
EDIT 17:15 UTC: The queue has been fully consumed and the logs are now up-to-date.
]]>EDIT 12:35 UTC: Logs are back, query should now work again and logs drains should have been sent to their endpoints. No logs have been lost.
]]>EDIT 14:45 UTC: The server won't reboot as of now, we are not yet sure of the reason. We continue to look into it. In the meantime, you can create a new add-on and import last night's backup. Please contact our support team for any further assistance
EDIT 14:58 UTC: The server still won't reboot, we continue to investigate the reason.
EDIT 15:08 UTC: A ticket has been opened to the manufacturer. The server is still unreachable as of now.
EDIT 15:12 UTC: A server replacement is currently being discussed. In the meantime, we advise you to import last night's backup into a new add-on. If the hypervisor ever comes back, you will be able to access your old add-on and possibly access the data between last night's backup and now, allowing you to merge them if possible. Current ETA is 24 hours.
EDIT 16:38 UTC: No server replacement will happen, we'll have more information to share tomorrow once the manufacturer gets back to us.
EDIT 16:54 UTC: Clarification: No server replacement will happen tonight. There are no sign of disk / data corruption, it seems to only be an hardware problem, which we can't fix right now.
EDIT 29/06/21 09:30 UTC: A maintenance on the server should happen in the next few minutes. The goal is to replace the problematic hardware piece. More information to come.
EDIT 13:17 UTC: The maintenance has been performed and a hardware piece has been changed but it didn't fix the issue. We continue investigating.
EDIT 13:26 UTC: The initial hardware replacement was the network card. Another replacement, this time the motherboard, has been planned for tomorrow. We do not yet have the exact time.
EDIT 30/06/21 11:09 UTC: The motherboard has been changed, additional checks are being performed.
EDIT 13:03 UTC: The motherboard replacement did not improve the situation. The server reboots fine without the network card, which has already been changed. A full server replacement is being considered by the manufacturer.
EDIT 18:23 UTC: Our infrastructure provider has been able to provide us with a temporary replacement server which is now up and running. Add-ons and custom services are all up and running. Do note that this is a temporary replacement, once the manufacturer gives us back the fixed server or a fully working permanent replacement, we will have to switch to it (meaning a shutdown of a few minutes). Affected customers will be e-mailed about this.
]]>We are currently having connectivity issue or high latency to some part of our Paris infrastructure. Our network provider is aware of the issue and is currently investigating.
10:03 UTC: It seems like the issue is only affecting one of the datacenter. Applications that use services deployed on another datacenter might suffer from connectivity issue or increased latency
10:15 UTC: We are removing the IPs of the affected datacenter from all DNS records of load balancers (public, internal and Clever Cloud Premium customers) and are awaiting more info from our network provider.
10:19 UTC: Packet loss and latency have been going down from 10:12 UTC and it seems to be back to normal now. We are awaiting confirmation of the actual resolution of the incident.
10:23 UTC: We are working on resolving issues caused by this network instability and making sure everything works fine.
10:25 UTC: Logs ingestion is fixed. We are working on bringing back Clever Cloud Metrics.
10:31 UTC: IPs removed from DNS records at 10:15 UTC will be added back once we have confirmation that the network issue is definitely fixed.
10:41 UTC: Full loss of connectivity between the two Paris datacenters for a few seconds around 10:39 UTC. We are still experiencing packet loss now. Our network provider is working with the affected peering network on this issue.
10:45 UTC: The two Paris datacenters are unreachable depending on your own network provider.
10:49 UTC: Network is overall very flaky. Our network provider and peering network provider are still investigating.
10:57 UTC: According to our network provider, many optical fibers in Paris are deteriorated. Some interconnection equipment might be flooded. We are waiting for more information.
11:02 UTC: (Network and infrastructure inside each datacenter are safe. The issue is clearly happening outside the datacenters.)
11:13 UTC: Network is still flaky. Overall very slow. We are still waiting for a status update from our network and peering providers.
11:20 UTC: Network seems better towards one of the datacenters. We invite you to remove all IPs starting by "46.252.181" from your DNS.
11:42 UTC: Still waiting for information from our network providers. Still no ETA.
12:16 UTC: Network loss between the datacenters has lowered a bit. Console should be more accessible.
12:21 UTC: Connections are starting to come back UP. We are still watching and waiting for more information from our network providers.
12:30 UTC: Info from provider: over the 4 optical fibers, 1 is "fine". They cannot promise this one will stay fine. They are still working on it. Teams have been dispatched on the premises.
13:15 UTC: Network is still stable. We are keeping Metrics down for now as it uses a significant amount of bandwidth between datacenters.
13:48 UTC: A second optical fiber is back UP. According to our provider, "it should be fine, now". The other two fibers are still down. The on-site teams are analysing the situation.
13:41 UTC: You can now add back these IPs to your domains:
@ 10800 IN A 46.252.181.103
@ 10800 IN A 46.252.181.104
15:35 UTC: We are bringing Clever Cloud Metrics back up. It's now ingesting accumulated data in the queue while the storage backend was down.
16:45 UTC: Clever Cloud Metrics ingestion delay is back to normal.
17:16 UTC: The situation is currently stable but may deteriorate again. We are closely monitoring it. A postmortem will be published in the following days. If the issue comes back, this incident will be updated again. Sorry for the inconvenience.
17:31 UTC: A 30 seconds network interruption happened between 17:22:42 and 17:23:10, it was an isolated maintenance event done by the datacenter's network provider.
07:01 UTC: This incident has been set to fixed as everything has been working fine, as expected, since the second optical fiber link has been restored except for the incident mentioned in the previous update. Do note that as of now we are not at the normal redundancy level as the other two optical fiber links are still down. We will update this once we have more information.
10:23 UTC: We have confirmation that a non-redundant third optical fiber link has been added at 00:30 UTC, this is only meant to add bandwidth capacity, it does not solve the redundancy issue. However, our network provider also tells us that their monitoring shows that the redundant link just came back up; although this may be temporary and the link may not be using the usual optical path.
16:13 UTC: The redundant link that came back at 10:23 UTC is stable. It may be re-routed to use another physical path at some point but we can now consider that our inter-datacenter connectivity is indeed redundant again.
]]>This was due to an update that has now been rolled back.
]]>(The original incident text can be found at the end)
A network issue caused 17 minutes of full unreachability of the Paris zone which in turn caused some applications to go down and our deployment system to slow down while restarting affected applications as well as several other services.
10:12 UTC: The whole PAR network is unreachable from outside, cross-datacenter network is down as well.
10:16 UTC: The on-call team is warned by an external monitoring system.
10:21 UTC: Our network provider informs us that they are aware of the issue.
10:29 UTC: The network is back.
10:30 UTC: The monitoring systems are starting to queue a lot of deployments. The load of one monitoring system in charge of one of the PAR datacenters increases significantly. Other systems such as Logs, Metrics, and Access Logs (collection and query) are also impacted and unavailable. Some applications relying on FSBucket services (mostly PHP applications) are also having communication issues with their FSBuckets. This might have made some applications unreachable and their I/O very high, sometimes leading to Monitoring/Scaling deployments. This particular issue was detected later during the incident.
10:35 UTC: Our network provider confirms to us that the issue is fixed.
10:50 UTC: Deployments are slow to start because many of them are in queue.
11:00 UTC: The load of the faulty monitoring system being too high causes it to see more applications down than there actually are, and to queue even more deployments for applications that were actually reachable.
11:15 UTC: Clever Cloud Metrics is back, delayed data points have been ingested. Writing to the ingestion queue is still subject to problems.
11:20 UTC: We notice the build cache management system is overloaded, slowing down deployments and failing those that rely on the build cache feature. The retrying of these failed deployments adds even more items to the deployment queue.
11:28 UTC: We start upscaling the build cache management system beyond its original maximum setting.
11:52 UTC: We believe an issue found in the past few days within the build cache management system is responsible for the slowness/unreachability of the build cache service. This issue caused a thread leak which had been triggering more upscalings than usual. A fix was being tested on our testing environment but was not yet validated. We try to push this fix to production.
12:48 UTC: The fix pushed to production at 11:52 UTC is not effective. We upscale the build cache management system again.
13:00 UTC: Logs collection is back. Logs collected before this time were lost. Queries are also available but might still fail sometimes or return delayed logs.
13:05 UTC: We prevent the overloaded monitoring system from queuing up more deployments and empty out its internal alerting queue.
13:10 UTC: We rollback a change made on the database a few days ago, which we believe is the root cause of the ongoing issue.
13:16 UTC: The build cache management system database load starts to go up. This is caused by the application being more effective at making requests to the database thanks to the previous rollback.
13:18 UTC: The build cache management system database is overloaded.
13:33 UTC: We start looking into optimizing requests and clearing up stale data.
13:59 UTC: We manage to bring the build cache management system database load down.
14:05 UTC: The build cache management system is still overloaded/slow despite its database now working properly. A deployment is queued with an environment config change but is slow to start. We restart the application manually to apply this change.
14:10 UTC: The change of configuration is effective, the deployment queue starts to empty itself but there are still a lot of deployments in the queue.
14:15 UTC: An older deployment, performed without the environment change which was waiting to be processed, finishes successfully, leading to about half of the build cache requests failing.
14:17 UTC: We start reapplying the fix manually on live instances while a new deployment with the correct environment is started. The deployment queue size is going down.
14:29 UTC: The deployment queue is filling up again.
14:53 UTC: We realize the faulty monitoring system is still queuing deployments despite its alerting queue being empty and the alerting action being disabled.
14:57 UTC: We completely restart the faulty monitoring system and make sure it stops queuing deployments.
15:10 UTC: We are now certain the previously faulty monitoring system stopped queuing deployments for false positives. The deployment queue is back to normal and the deployment system is more reactive.
15:15 UTC: We start cleaning stuck deployments and making sure everything is working fine.
15:42 UTC: We start redeploying all Paris PHP applications which have not been deployed since the network came back.
16:00 UTC: Some PHP deployments seem to be failing due to a connection timeout to their PHP session stored on an FSBucket. We abort the PHP deployment queue to avoid any more errors.
16:10 UTC: The connection was only broken on one hypervisor and is now fixed. We also make sure every other hypervisor can contact all FSBucket servers on the PAR zone.
16:15 UTC: The PHP deployments queue is started again, with a lower delay between deployments.
16:42 UTC: Clever Cloud Metrics / Access logs ingestion is now fixed. Queries should be returning up-to-date data. Access logs were stored in a different queue and have been entirely consumed.
17:05 UTC: The PHP deployments queue is now completed. All other applications in the PAR zone, which had not been redeployed since the network came back, have also been queued for redeployment to fix any connection issue to their FSBucket add-ons.
19:10 UTC: A few applications which have the “deployment with downtime” option enabled were supposed to be UP but had no running instances. Those applications are now being redeployed.
Foreword: Clever Cloud has servers in two datacenters in the Paris zone (PAR). In this post-mortem, they are named PAR4 and PAR5.
A routine maintenance operation made by our Network Provider on PAR4 started a few minutes before the incident. This maintenance was about decommissioning a router that shouldn’t impact the network. Various checks and monitoring were in place, as usual, and a quick rollback procedure was planned in case anything went wrong.
The decommission triggered an unexpected election of another router, which then triggered a lot of LSA (link-state advertisement) updates between all the routers of the datacenter, sometimes doubling them. Those updates created new LSA rules on other routers, which first made them slower to update and routing traffic. Some of the routers then hit a configuration limit on the number of LSA rules. When hitting the limit, the router went into protection mode and shut itself down. This shutdown triggered other LSA updates on other routers which then also hit their LSA limit and entered in protection mode. This isolated PAR4 site from the network.
An internal equipment that had a link between PAR4 and PAR5 also propagated those LSA updates onto PAR5 routers, replicating the exact same scenario.
To fix this, our Network Provider disconnected some routers, lowering the number of LSA announcements across the network and bringing the routers back online.
We are currently experiencing a network accessibility issue on our PAR zone. We are investigating.
EDIT 12:21 UTC+2: Our network provider is looking into the issue.
EDIT 12:28 UTC+2: Deployments on other zones might not correctly work. But traffic shouldn't be impacted.
EDIT 12:30 UTC+2: Network connectivity seems to be back. We are awaiting confirmation of incident resolution from our network provider.
EDIT 12:35 UTC+2: Our network provider found the issue and fixed it. Network is back online since 12:30 UTC+2. Investigation will be conducted to understand why the secondary link hasn't been used.
EDIT 12:42 UTC+2: A postmortem will be made available later once everything has been figured out.
EDIT 12:50 UTC+2: The deployment queue is currently processing, queued deployments might take a few minutes to start
EDIT 13:00 UTC+2: Logs may also be unavailable depending on the applications
EDIT 13:20 UTC+2: Deployment queue still has a lot of items, the build cache feature is currently having troubles which slows down deployments.
EDIT 14:33 UTC+2: Deployments queue is now lower but there are still some issues with some of them. Logs are also partially available
EDIT 15:30 UTC+2: The build cache feature still has troubles, we are currently working on a workaround. Logs should now be back but there is a delay in processing which might affect availability on the Console / CLI. They might be a few minutes late.
EDIT 16:04 UTC+2: Some applications linked to FSBuckets systems might have lost their connection to the FSBucket, increasing their I/O and possibly rebooting in a loop for either Monitoring/Unreachable or Monitoring/Scalability. This can cause response timeouts, especially for PHP applications
EDIT 16:16 UTC+2: Build cache should be fixed, meaning that deployments should take less time
EDIT 16:53 UTC+2: There is still a lot of Monitoring/Unreachable events that are being sent, making a lot of application redeploy for no good reason. We are still working on it.
EDIT 17:18 UTC+2: The issue with Monitoring/Unreachable events has been fixed. The size of the deployments queue should go down.
EDIT 18:07 UTC+2: Most issues haves been cleared up. PHP applications may still be experiencing issues, we are working on it. If you are experiencing issues on non-PHP applications, please contact us.
EDIT 19:05 UTC+2: All PHP applications have been redeployed. If you are still experiencing issues, please contact us. All other applications which have not already been redeployed since the beginning of the incident will be redeployed in the next few hours (to make sure no apps are stuck in a weird state).
]]>EDIT 16:58 UTC: we mitigated the issues.
]]>EDIT 18:04 UTC+2: One of the reverse proxy stopped accepting new connections. It has been put out of the pool for further investigation. Stability should have been resumed since 2 minutes.
EDIT 18:18 UTC+2: Performance is back to normal. We are going to investigate further why this reverse proxy went into this state without being noticed.
]]>The maintenance itself should take no more than an hour. During this time, writes will be queued and reads will be partially available.
Once the maintenance is over, queued-up writes will start being ingested, reads will be available again (except for recent data until queued-up data points are ingested).
11:36 UTC: Maintenance is starting.
12:04 UTC: Maintenance is over. The ingestion pipeline is running at full speed catching up on the queued-up data.
12:18 UTC: Ingestion is caught up.
]]>As a reminder, this cluster is only used by free plans labeled "DEV". This is meant to be used for development and testing purposes only, not production.
If you are using a free plan in production, we suggest you migrate to a dedicated plan using the migration tool in the Clever Cloud console.
10:43 UTC: The cluster is working fine now although it may be slower than usual for now as a node is out of the cluster and will be re-added later.
12:23 UTC: The node mentioned in the last update has been re-added. The incident is over.
]]>EDIT 19:14 UTC: Logs should now be back to normal. Sorry for the interruption.
]]>We are in the process of adding capacity to resolve this issue.
14:28 UTC: Performance is back to normal.
]]>EDIT 12:46 UTC: we are experiencing abnormal new connection rates on public reverse proxies.
EDIT 12:50 UTC: we found the responsible application for this new connection rate and are mitigating it.
EDIT 14:19 UTC: Load balancers have been upscaled so they can handle more traffic. Performance is back to normal since 13:12 UTC.
]]>08:00 UTC: New logs are being ingested. Logs emitted during the incident will not be ingested in the main logs storage system. Log drains may start receiving (part of) the older logs, we are still investigating this part.
08:15 UTC: Looks like everything that could be ingested has been ingested. Ingestion delay may still be a little higher than normal though, it should go back to normal soon.
]]>Cellar-c2 cluster isn't impacted.
EDIT 08:23 UTC: Connection seems to be back, we have notified both network providers used for Cellar-c1 and are still awaiting an answer. We are waiting a bit more to see if the links are correctly back or if we should expect another issue.
EDIT 08:47 UTC: The connection is now down again.
EDIT 09:35 UTC: The connection has been back up for 15 minutes and the root cause may have been found. We are waiting for explanations from our network provider. In the meantime, this issue may also have affected applications that are connecting to external services. We've seen loss to Scaleway and Azure, there might have been more.
EDIT 10:25 UTC: The issue now seems to be resolved. The root cause wasn't entirely found, current investigations show that a transit provider had an issue and traffic was redirected elsewhere, maybe leading to some links saturation (which would explain why the loss wasn't 100%, but more like 80%).
]]>EDIT: The issue has been fixed
]]>Data is still being ingested.
09:45 UTC: Incident is over.
]]>14:52 UTC: Network issue is resolved. We are assessing the damage.
15:07 UTC: API and deployments are down. We are cleaning everything and bringing it up.
15:20 UTC: API is back. Deployments are back but have a significant delay as of now.
15:42 UTC: We are still working on this. Deployments are quicker now but not yet back to normal.
16:02 UTC: This incident is over. If you are still experiencing issues, please contact us.
A maintenance operation carried out by our network provider a few hours before this incident generated a faulty BGP announce. Because of this, a significant portion of traffic coming out of our Paris infrastructure was going out via a NYC peer causing significant delay and even timeouts.
Routers in one of our Paris datacenter were heavily impacted by this issue and failed to accept configuration fixes. After multiple attempts to fix this, our provider ended up power-cycling affected routers which caused most of our hypervisors in this datacenter to be cut off from the rest of the network for 3 minutes.
Corrective actions will be taken to prevent this from happening again (BGP filters, dedicated admin network for the routers which was already scheduled to be set up in a few days). We will also make sure that we are warned in due time if a significant network configuration/hardware issue occurs.
]]>Affected clusters are:
This update may affect performances of the databases and their availability.
The upgrade will start in a few minutes. This maintenance will be updated accordingly
EDIT 18:28 UTC+2: Montreal cluster is now up-to-date
EDIT 19:54 UTC+2: Paris cluster is now up-to-date but postgis extension is currently broken due to the update. We are working on a fix
EDIT 20:27 UTC+2: Paris cluster: databases are currently being migrated to a newer version of postgis. It will take a few hours to run on all of the databases
EDIT 20:42 UTC+2: This maintenance is now considered as over
]]>Add-ons will start being migrated at 20:30 UTC+2. Hypervisor will be rebooted at 21:30 UTC+2
EDIT 20:36 UTC+2: Maintenance is starting. Applications are getting redeployed and add-ons are starting their migrations
EDIT 21:30 UTC+2: Add-ons that could be migrated have been migrated, applications have been redeployed. Server will now reboot
EDIT 22:00 UTC+2: Server has finished its reboot, add-ons that weren't migrated should have been reachable since 21:45 UTC+2. The maintenance is over.
]]>Affected applications are being automatically redeployed. Affected addons are unreachable.
21:53 UTC: The hypervisor is back online and is starting addon VMs.
21:55 UTC: All addons are back online. The incident is over.
]]>06:30 UTC: Incident is over.
]]>EDIT 23:02 UTC: the incident is related to one of our hypervisors.
EDIT 23:03 UTC: we restarted the hypervisor; related databases are down.
EDIT 23:04 UTC: hypervisor is up; VMs are starting.
EDIT 23:13 UTC: metrics are down too.
EDIT 23:25 UTC: databases are up. We are now experiencing issues with our internal reverses proxies and console and API are not available.
EDIT 23:30 UTC: we queued the linked applications for a high-priority redeploy to ensure they reconnect to their databases. Core services are still partially down.
EDIT 0:00 UTC: all applications are redeployed.
EDIT 02:56 UTC: we are still working to fix issues on our internal core services (console, API); users applications/addons are not impacted.
EDIT 03:30 UTC: internal core services are back!
]]>After investigation, the hadoop namenodes were all in standby. At 23:33, after various checks, we promote one back to active. We then restarted all the hbase regionservers, then waited for the cluster to balance and heal up.
At 00:04 we restart the warp10 stores. At 00:07 everything is back to normal.
]]>16:13 - Rollback was successfully executed and everything is back to normal.
]]>17:33 UTC: The issue has been resolved. It was due to a partial upgrade (in progress) of the cluster. Upgraded nodes have been downgraded.
18:08 UTC: The upgrade was in-progress to fix the security issue labelled as CVE-2021-20288. Due to the large number of machines, some of them were not yet up-to-date, which have led to the issue we were facing. Some of the machines were unable to authenticate correctly, leading to a cascading failure of multiple machines that weren't yet patched. Another strategy will be used to continue the upgrade of the cluster.
]]>Edit 22:48 UTC: The deployments should be fine since 22:30, we just made sure that everything was okay. Deployments that were stuck were restarted, those who failed can now be restarted without any issue. Sorry for any inconvenience.
]]>EDIT 14:37 UTC - fixed.
]]>The error started at 12:23:36 UTC and stopped at 12:46:50 UTC, lasting around 23 minutes.
]]>EDIT 13:21 UTC - fixed.
]]>EDIT 17:03 UTC - fixed.
]]>We'll update this incident in the morning. Until then, if OVH fixes the issue before that, the liar proxy should recover network access.
EDIT 11:06 UTC+1: This service is in SBG1 which is currently impacted by the fire that took place in SBG. It may take several days to come back online depending on how possible it is to order new servers at OVH. If you are a user of this service, please contact us on the support if you have any questions.
]]>The culprit was a badly configured NOFILE limit on the RBX reverse proxies. We updated the setting accordingly.
Afterwards: We investigated all the reverse proxies on all the zones to make sure the NOFILE limit was correctly configured everywhere. We updated the reverse proxy software (sozu) to refuse to start when given too few NOFILE. We updated the sozu package to enforce the right NOFILE value upon installation.
]]>The service is completely unavailable at the moment. We are working on it.
08:50 UTC: The faulty component is working. We are working on bringing everything back up.
08:59 UTC: Everything is back up. The ingestion pipeline is catching up.
09:07 UTC: The incident is over.
]]>EDIT 21:07 - fixed.
]]>The migration is expected to last at most 1 hour. All impacted applications will be redeployed during the migration. After the deployment, application will be able to write to the bucket. Read operations will not be impacted.
EDIT: This maintenance has been postponed to 15:00 UTC+1
EDIT 15:00 UTC+1: The maintenance is starting
EDIT 15:02 UTC+1: The buckets are now read-only
EDIT 15:14 UTC+1: Starting now, you can redeploy your applications if you want to regain write access early. Otherwise, affected applications will be redeployed automatically in the upcoming hour, starting with applications of Clever Cloud Premium customers
EDIT 17:14 UTC+1: The deployment queue finished one hour ago, everything has been working fine so far. This maintenance is over
]]>11:00 UTC: Maintenance is starting. Deployments are disabled.
11:02 UTC: API is down.
11:11 UTC: API and deployments are up again. Maintenance is over.
]]>The migration is expected to last at most 1 hour. All impacted applications will be redeployed during the migration. After the deployment, application will be able to write to the bucket. Read operations will not be impacted.
Emails will be sent to customers of the impacted add-ons.
EDIT 12:00 UTC+1: The maintenance will begin shortly
EDIT 12:04 UTC+1: The buckets are now read only
EDIT 12:13 UTC+1: The redeployment queue began, it should not last more than 15 minutes.
EDIT 12:51 UTC+1: The maintenance is over, the queue ended 20 minutes ago and everything seems to be normal.
]]>10:44 UTC: We have found the cause and fixed the issue. It was due to an internal tool unexpectedly making too many costly requests.
]]>EDIT 13:18 UTC: Ingestion is working again, working at full speed to catch up.
EDIT 14:03 UTC: Ingestion has caught up since a few minutes ago, everything should be back to normal.
]]>The migration is expected to last at most 1 hour. All impacted applications will be redeployed during the migration. After the deployment, application will be able to write to the bucket. Read operations will not be impacted.
Emails will be sent to customers of the impacted add-ons.
EDIT 11:55 UTC+1: The maintenance will start on time.
EDIT 12:00 UTC+1: The maintenance is starting
EDIT 12:07 UTC+1: Applications are being restarted. The restart queue should be done in about 20 minutes
EDIT 12:41 UTC+1: The migration is over.
]]>13:06 UTC: The issue has been solved, ingestion is catching up.
13:10 UTC: Ingestion is all caught up. This incident is over.
]]>13:53 UTC: The issue is fixed. Everything is back to normal.
]]>EDIT 18:07 UTC: The IP has been restored. OVH blocked it after a 4 hours email notice of phishing which has escaped our own filters. Further investigations will be conducted to avoid this incident in the future.
]]>EDIT 13:26 UTC: The queue has been consumed. Logs should now be up-to-date.
]]>This server has been unavailable for 8 minutes.
]]>Dedicated addons are NOT impacted.
]]>EDIT 15:45 UTC: Two hypervisors went down. The impacted services are:
Add-ons -> add-ons hosted on those servers are currently unavailable
Applications -> applications that were hosted on those servers should be redeployed or in the redeploy queue
Logs -> new logs won't be processed. This includes drains. You might only get old logs when using the CLI / Console
Shared RabbitMQ -> A node of the cluster is down, performance might be degraded
SSH -> No new SSH connection can be made on the applications as of now.
FS Bucket: a FS Bucket server was on one of the servers. Those buckets are unreachable and may timeout when writing / reading files
EDIT 15:54 UTC: Servers are currently rebooting.
EDIT 15:59 UTC: Servers rebooted and the services are currently starting. We are closely monitoring the situation.
EDIT 16:07 UTC: Services are still starting and we are double checking impacted databases.
EDIT 16:11 UTC: Deployment might take a few minutes to start due to the high deployment queue.
EDIT 16:33 UTC: Most services should be back online, including applications and add-ons. The deployment queue is still processing.
EDIT 16:45 UTC: The deployment queue is now empty since a few minutes, all deployments should go through almost instantly.
EDIT 17:13 UTC: Deployment queue is back to normal.
EDIT 17:15 UTC: The incident is over.
]]>15:52 UTC: The issue has been identified and should be fixed. We are monitoring things closely.
16:11 UTC: Overall traffic in the logs ingestion pipeline is not completely back to normal. If one of your applications does not have up-to-date logs you can try to restart it.
16:32 UTC: We have forced the hand of a component of the ingestion pipeline making it catch up with the logs waiting in queue. It should go back to normal in a matter of minutes now.
]]>While investigating the issue, something broke in one of the reverse proxies which is causing availability issues. We are working on this.
10:25 UTC: The availability issue has been resolved. We are still working on resolving the performance issue.
10:32 UTC: We found the culprit and have implemented a work-around. Performance is back to normal. We are still working on an actual fix.
]]>EDIT 14:03 UTC: The problem is now resolved. Some connection issues happened but a retry would have worked.
]]>EDIT 22:30 UTC: Redsmin owners updated the certificate. Redsmin should now be available again
]]>EDIT 11:02 UTC: The server currently has no network. Add-ons hosted on it are currently impacted.
EDIT 11:16 UTC: The network has come back. Waiting for OVH confirmation on the end of the incident.
EDIT 11:19 UTC: OVH closed the incident, everything should be back to normal.
]]>EDIT 11:42 UTC: The issue has been fixed, metrics and access logs can be queried again. There is a delay (currently 30 minutes) in the ingestion that is currently being resolved.
EDIT 12:10 UTC: The ingestion delay is now resolved, everything should be back to normal.
]]>EDIT 18:53 UTC: The maintenance is still in progress.
EDIT 00:00 UTC: The maintenance is done, the custom metrics should be available again.
]]>13:54 UTC: Related to this issue, the API is unavailable at this time. We are working on it.
13:55 UTC: We stopped the deployments to avoid any more missing updates.
13:56 UTC: The API being unavailable means that the Console and the CLI will display various errors.
14:05 UTC: Git push are also unavailable, an error will occur. The main problem has been identified and we are working toward a resolution.
14:23 UTC: We are still working on fixing the root cause of this issue.
14:49 UTC: We are still working on fixing the root cause of this issue. In the mean time, we have managed to get a fully up-to-date configuration on some reverse proxies.
15:07 UTC: We believe we have fixed the root cause of the issue and are working on cleaning everything up.
15:15 UTC: Everything is looking good now. If you still have an issue, please contact us.
]]>The root cause have not yet been found but this shouldn't have happened as we routinely do such maintenance operations without any issues. We will look further into this. Apologies for the inconvenience.
]]>HTTP 503 / This application is redeploying
or HTTP 404 / Not Found
error alongside the regular applications responses.
The root cause of this is still unclear, additional investigations will be performed. A bit before 16:00, we had an incident on an internal tool that may be related.
]]>We are working on it.
16:23 UTC: This incident is over.
]]>The original incident started at around 05:15 UTC and we have been containing it since then with a lag under tens of seconds at worst.
It's now getting worse due to attempts at fixing the issue which are currently doing the opposite. This will take a while to solve.
11:17 UTC: The ingestion delay is now reduced to about 15 seconds. The issue is not completely solved, this is only a first step.
11:58 UTC: The ingestion delay is now back to normal. The root cause is not entirely fixed so this may come back but we will consider this incident as resolved for now.
]]>EDIT 21:10 UTC: The service is now back to normal since ~30 minutes.
]]>EDIT 17:22 UTC+1: The network have been restored on those servers. We continue investigating which services are currently impacted. Applications that lost network connectivity to our monitoring are restarting. Applications that crashed because they lost their database access are also restarting.
EDIT 17:45 UTC+1: Deployments may still take some time to start or for those ongoing, to finish. We are cleaning-up the situation.
EDIT 18:17 UTC+1: Deployments are back to normal since 18:05. We are still cleaning up the rest of the mess and making sure everything is back to normal and working fine.
EDIT 18:25 UTC+1: Incident is over.
The issue came during a maintenance of our infrastructure provider during which multiple electricity cables were disconnected on active switches. Some of our servers were linked to those switches, cutting their network access for 5 minutes. Backup network links of those servers were also affected leading to a total loss of network. We will investigate this incident further with the infrastructure provider.
]]>11:03 UTC: The maintenance is starting, console is in maintenance mode.
11:06 UTC: Maintenance is almost over.
11:07 UTC: Maintenance is over.
]]>EDIT 17:55 UTC - we identified the issue (DDOS).
EDIT 17:56 UTC - we fixed the issue on internal reverse proxies.
EDIT 19:15 UTC - we are still working to fix the issue.
EDIT 20:30 UTC - fixed and situation is back to normal. We will publish a post mortem.
16:45 UTC: Our monitoring throws an alert: public and internal reverse proxies traffic is abnormally decreasing. Dedicated reverse proxies for Premium clients are not impacted. The on-call team starts investigations;
16:53 UTC: We see a lot of HTTP requests timing out with PR_END_OF_FILE_ERROR
randomly on multiple reverse proxies.
17:00 UTC: We diagnose lots of IPs running an abnormal DDoS shape on our Paris infrastructure on identified domain names which prevents reverse proxies from accepting connections and causes reduced traffic;
17:30 UTC: After banning these addresses, new ones are used for the attack and we start banning IP ranges. During this period, we are applying custom reverse proxies configurations to limit the attack impact on various clients;
17:56 UTC: We are applying these bans on the internal reverse proxies, the internal situation comes back to normal; then we ban these on public reverse proxies;
18:00 UTC: Traffic is back to normal; PR_END_OF_FILE_ERROR
disappeared and we are now facing SSL_ERROR_SYSCALL
. We start investigating;
18:24 UTC: We determine these errors are due to configuration errors applied during the reverse proxies configuration changes.
20:06 UTC: All configurations are fixed, everything is working as usual. We are improving reverse proxies auto-configuration to avoid error-prone manual actions. We are fixing custom clients' configuration items and are watching monitoring data closely.
20:14 UTC: Reverse proxy improved auto-configuration is deployed.
20:30 UTC: We announce the end of the incident. The attack logs will be used to improve our DDoS detection system.
]]>The impact on applications deployed on more than one scaler should be null (apart from database access depending on your particular case). Applications deployed on a single instance had about a 50% chance of being affected.
This network incident had an impact on Metrics, the service was unavailable for 15 minutes after the incident and ingestion has been delayed for another 15 minutes.
As of now, we don't know exactly what happened but we expect that a router malfunctioned and went haywire for a minute.
]]>Our network provider is investigating the issue.
10:48 UTC: We no longer experience packet loss on this interconnection. We are awaiting more information from our network provider on the cause and resolution of this incident.
10:57 UTC: The issue is back, we are experiencing the same amount of loss again.
11:07 UTC: The issue went away again. We are still awaiting word from our provider.
11:32 UTC: We are experiencing packet loss again on the same link.
11:35 UTC: The issue went away again.
11:36 UTC: The issue, ultimately, lies with Free and we cannot do anything about it from our side. Until the root cause is properly fixed, the loss issue may come back off and on.
14:53 UTC: Our network provider tells us that the peering link has been affected by the side effects of a DDOS targeting another customer of our network provider. They are working on providing measures to prevent more attacks targeting this network which should in turn prevent this link from getting overwhelmed.
]]>We are investigating this issue.
13:47 UTC: The issue is fixed. All deployments have been working fine during this period, only delayed by a few seconds. The issue came from a misconfigured deployment component which was sending broken messages to hypervisors. The broken component has been dealt with.
]]>13:17 UTC - no other network loss. All critical parts of Clever Cloud have been checked and restarted to make sure they still communicate with each other.
]]>The issue is currently fixed and awaiting for full resolution.
]]>EDIT 21:25 UTC: The issue is fixed. The PHP applications may not work correctly. We are redeploying them.
EDIT 22:30 UTC : Applications with FS Buckets have been redeployed. The incident is closed.
Post mortem: An incorrect human action conducted the FS Buckets system to follow the wrong path between different storage nodes. We applied fix to avoid this cause.
]]>EDIT 15:25 UTC: fixed. We are investigating the reasons.
EDIT 15:45 UTC: we identified the reasons and applied a fix.
]]>Data won't be lost, the ingestion is simply delayed.
Impacted products:
EDIT 14:03 UTC: Ingestion is now catching up on the delay, everything looks good. Looks like it may take 30 to 40 minutes to go completely back to normal.
EDIT 14:25 UTC: Ingestion has now caught up, everything should be back to normal.
EDIT 21:26 UTC: New issues are ongoing, we are investigating.
EDIT 22:16 UTC: Ingestion is running. We are consuming queues.
EDIT 23:30 UTC: Ingestion is back to normal. Fixed.
]]>EDIT 13:02 UTC: A change causing this issue has been backed out. We will investigate further why it went wrong despite working correctly on our test infrastructure. Sorry for the disruption.
]]>EDIT 10:54 UTC: Redsmin is currently working on a fix.
EDIT 19:54 UTC: The fix seems to be complete. Redsmin interfaces should now be able to load.
]]>For any support queries, you can send us an email at support@clever-cloud.com
EDIT 14:26 UTC: The issue has been found and should now be fixed. We will investigate it further to prevent it from happening again.
]]>We are monitoring the situation. Our new Cellar cluster (cellar-c2.services.clever-cloud.com) is still reachable and works fine.
EDIT 12:02 UTC: A reverse proxy node is somehow still able to communicate with the nodes on Scaleway. All cellar-c1 traffic has been routed through that reverse proxy and requests should be served as expected.
EDIT 12:34 UTC: The network issue seems to not be on Scaleway's side per say but more on Level3/CenturyLink side which is a more global networking provider.
EDIT 15:17 UTC: The incident on Level3/CenturyLink seems to be resolved. The cluster is now fully reachable.
]]>EDIT 15:17 UTC: The incident on Level3/CenturyLink seems to be resolved. The cluster is now fully reachable.
]]>reaching our services: if your FAI uses this provider, you might experience timeouts reaching our infrastructure
reaching external services from our infrastructure: if you contact external services from our infrastructure, the peering routes might use this network provider and your requests might timeout too.
This incident will group the previous opened incidents:
https://www.clevercloudstatus.com/incident/294
https://www.clevercloudstatus.com/incident/295
We do not have an ETA for the service to come back to normal.
EDIT 15:17 UTC: The incident on Level3/CenturyLink seems to be resolved. All connections either incoming or outgoing to/from our services should be working as expected. Please reach to our support if not.
]]>This was caused by a human error, partly related to a laggy UI (low-level UI of a server manager used for a group of servers).
The person who triggered this realized the issue immediately and restarted the server which has stopped responding to our monitoring for a total of 3 minutes.
Chronology:
14:01:30 UTC: The server goes down
14:04:30 UTC: The server responds to our monitoring again and starts restarting static VMs (add-ons and custom services)
14:07:05 UTC: The last static VM starts answering to our monitoring again.
Impact:
Customers with add-ons on this server will find connection errors in their application logs during those 3 to 6 minutes and those applications most likely responded with errors to end users during that time.
Customers with applications with a single instance which happened to be on that server will have experienced about 2 to 3 minutes of downtime before a new instance started responding on another server.
]]>This impacts:
The issue has been identified and we are working toward a fix.
EDIT 14:07 UTC: The problem has been solved and the access logs stored have been processed. You should now be able to have an up-to-date livemap and fetch recent access logs using the CLI / API. Request count will be affected and won't be computed for the time window the access logs were not processed.
]]>14:30 UTC: It looks like an issue with the storage backend, we are working on bringing it back to life.
14:52 UTC: The storage backend looks fine but writes are still failing. We are still investigating this issue. It may take a while.
15:11 UTC: Again, the storage backend looked perfectly fine... restarting everything did fix the issue though so then again maybe it wasn't fine after all. Writes are functional, ingestion is working at full speed, fresh data will be available in ~20 minutes.
15:30 UTC: Ingestion delay is back to normal. Incident is over.
]]>16:23 UTC: Some storage nodes were misbehaving. The issue is now fixed: reads are functional again and ingestion is now catching up.
16:28 UTC: Ingestion delay has been divided by two, incident should be over in under 10 minutes.
16:28 UTC: Incident is over.
]]>EDIT 13:27 UTC: all access logs should be available again since 12:50 UTC. The root cause has been identified and will be addressed. Some logs may have been lost during that timeframe.
]]>During this incident, you may have seen random issues while opening new connections to your databases.
]]>The incident on Cloudflare side: https://www.cloudflarestatus.com/incidents/b888fyhbygb8
EDIT 21:35 UTC: The DNS resolution seems to be back again, our services are currently reachable from our point of view. It may vary depending on your location.
EDIT 22:51 UTC: Cloudflare implemented a fix and we did not see any new issue since then. This incident is now closed.
]]>EDIT 16:30 UTC: fixed.
]]>EDIT 14:32 UTC - the cluster had been upscale.
]]>At the moment, we know the problem is affecting customers of the French ISP Orange.
08:28 UTC: We found that Orange NS servers were indeed still using the faulty NS records from last night's incident. We have updated the zone on those name servers which should have never been used in the first place and hopefully Orange customers will be able to resolve our domains (and by extension their domains) properly.
08:42 UTC: Looks like the propagation is quite fast and this indeed fixed the issue for affected customers.
]]>EDIT 20:04 UTC - fixed.
]]>The service has been restarted and we will monitor it closely, as well as adding monitoring to better catch this ramp up.
Dedicated databases are not impacted by this issue. If you are impacted, you can migrate your free plan to a dedicated plan using the migration feature. You can find it in the "Migration" menu of your add-on.
22:41 UTC: Load seems back to its normal state. The monitoring has been adjusted and we should then receive an alert at the start of the event instead.
23:19 UTC: The issue is back, the load is not as high as before but it might make the cluster slow.
23:54 UTC: Users impacting the cluster the most have been contacted to avoid this issue. Further actions will be taken later today if the issue persists.
2020-07-01 06:25 UTC: The node crashed due to a fatal assertion hit and restarted
06:38 UTC: The node is still unreachable for an unknown reason
07:48 UTC: The cluster is currently being repaired. For an unknown reason, nodes wouldn't listen to their network interfaces.
09:32 UTC: The repair is halfway through. The cluster might be able to be up again in ~1h30
11:50 UTC: The repair is done, the node successfully restarted. You should now be able to connect to the cluster. We are now re-starting the follower node for it to join back the cluster.
15:09 UTC: The leader node crashed again because of an assertion failure which means it is now unreachable again as mongodb reads its entire journal and rebuilds the indexes.
15:30 UTC: It usually takes 1h30 for mongodb to read the whole journal so it should be up again around 16:20 UTC.
16:34 UTC: It is taking longer than usual.
20:06 UTC: The restarts weren't successful. The secondary node successfully started at some point but was shutdown to avoid any issue with the primary one. We'll try starting it again.
2020-07-02 09:15 UTC: The first node has been accessible now and again but keeps on crashing due to user activity. The second node failed to sync to the first node so it cannot be used as primary right now. We are now trying to bring the first node back up without making it accessible to users so we can at least get backups of every database. Once this is done, we will update you on the next steps. This process will take a while as Mongo takes hours (literally) to come up after a crash.
12:00 UTC: The first node is finally back up (but incoming connections are shut off for now). We are now taking backups of all databases, you should see a new backup appear in your dashboard in the coming minutes / hours. Once this is done, we will start working on bringing the second node back in sync. Once the cluster is healthy, we will bring it back online.
14:30 UTC: Backups are over, customers who were using the free shared plan in production can create a new paid dedicated add-on and import the latest backup there. Meanwhile, we are now rebuilding the second node from the first one to make the cluster healthy again. Once it's over, we will bring the service back up (if everything goes well).
15:55 UTC: The second node is synced up and the service is available again. We are still monitoring things closely.
18:35 UTC: The service is working smoothly, no issues or anomalies to report.
]]>08:37 UTC: Ingestion delay is back to normal. Incident was caused by a few storage nodes misbehaving after a short network issue.
]]>EDIT 14:50 UTC: situation is back to normal.
]]>Applications using redis as a session backend were not impacted by the session issue. It may have been impacted if your application generates temporary files, which are on the same fs-bucket.
Our clean-up policy of temporary files was not aggressive enough, we'll reduce it to once a day and will continue to monitor if we need to upgrade the current disk space.
This incident started at 11:33 UTC+2 and was fully resolved at 11:41 UTC+2
]]>EDIT 15:20 UTC: fixed.
]]>19:00 UTC: If your application deploys, your application will not be up-to-date. It will continue to show the old content, the old instance will be kept until this incident is over.
19:12 UTC: The issue has been identified, we are fixing it.
19:20 UTC: The issue was caused by the configuration checker that took way more time than usual before applying each configuration changes. A configuration option to disable those checks inside the program handling the configuration has been enabled. The configuration remains checked by the reverse proxy itself but it is way faster.
Deployments should now be up-to-date.
]]>15:22 UTC: We continue to investigate what's been impacted. Currently deployment are disabled to recover from the event.
15:27 UTC: Deployments are now available.
15:56 UTC: The situation on the platform is stabilized. It seems the outage was between both of our datacenters in the Paris zone. We are asking for more details to our hosting provider.
16:05 UTC: Our network provider came back to us. The network outage lasted for 1 minute and 20 seconds. One of the links was lost between those two datacenters. The backup link should have been up 2 seconds after the loss of the first link. But for some reason it did not switch (or not correctly). After a 1 minute timeout, all links were closed and reset leading to a new link election which takes ~20 seconds. From there, the connection has been restored. Our network provider will continue to investigate why the initial backup link did not switch.
Once the network started working again, our monitoring was able to check what was currently "down". The services that were down were restarted but nothing should have impacted reaching your application (it was mostly internal services). Add-ons connections should have been back at the same time from applications but if your application crashed because it couldn't reach the add-on, then it should have been automatically redeployed once the deployment system was up again which should have been a bit before 15:27 UTC.
We are sorry for the inconvenience this outage created. The time of this incident has been changed from 15:06 UTC to 15:04 UTC to correctly match the date and hours.
]]>An index node has been restarted to upscale it. Its replica did not like the surge of requests and decided to crash a few seconds later. We are currently in the process of upscaling all index nodes to avoid such issues, those 2 nodes were the last remaining on the list.
Index nodes have to scan the whole dataset on start, this will take close to an hour to resolve.
08:07 UTC: Incident is over.
]]>13:38 UTC: Incident is over.
]]>EDIT 06:58 UTC: Ingestion is back at its normal rate, we are currently under the 30 minutes of delay. This should be at 0 seconds of delay in the next couple of minutes. EDIT 06:18 UTC: Ingestion delay is back to normal too since a few minutes. Incident is over. Everything (access logs / metrics) should have the latest data again.
]]>Update 21:04 UTC: The GEO IP feature has been fixed. It seems to have initially broke with an auto update of the GEO IP library but more tests will need to be conducted to be sure of the root cause. All access logs between 18:47 UTC and now have been consumed and you should now be able to query them. We will work on improving the monitoring of the whole system to detect this kind of issue faster.
]]>20:30 UTC: Maintenance is starting.
20:35 UTC: API and deployments are back up, maintenance is over.
]]>Sorry for the inconvenience. We keep watching the status of the deployment system to make sure the problem is indeed resolved.
EDIT 10:40 UTC: Everything is back to normal.
]]>A chunk and its replica are both non-responding which means the service as a whole is unavailable. We are working on it.
10:00 UTC: An index node being unavailable threw us off on the wrong track. Its replica was actually working just fine, the issue was with both front read nodes being stuck at the same time. We will improve monitoring and try to figure out what went wrong and why.
]]>14:35 UTC: We found the cause of the issue and are working on fixing it.
14:47 UTC: The root cause is fixed and the ingestion is now running at full speed. The misconfiguration issue was just half the story, what caused this issue was a partial network split.
14:56 UTC: Ingestion is all caught up. Incident is over.
]]>EDIT 13:12 UTC+2: It also impact the real time map in the console. You may not see live queries to your applications. But your application still receive the requests as usual.
EDIT 13:22 UTC+2: Fixed; but during the downtime period the access logs were deleted. We identified the root cause and are fixing it.
]]>Dedicated add-ons (XS SmallSpace and above) were not impacted
]]>06:38 UTC: Everything is back online, ingestion is catching up.
06:52 UTC: Ingestion delay is back to normal.
]]>EDIT 13:01 UTC: fixed.
]]>Ingestion is failing. Access to metrics may be difficult.
15:42:30 UTC: The network is back to normal. We are working on getting the ingestion back to its normal state. Metrics access may be shut down temporarily during this.
16:00 UTC: Ingestion is back online, working through 50 minutes of data.
16:14 UTC: Ingestion delay is almost back to normal.
16:17 UTC: Ingestion delay is back to normal. Incident is over.
]]>We are looking into it.
15:42:30 UTC: The network is back to normal. We are making sure the service goes back to normal.
16:15:00 UTC: Replication of objects created during the incident is ongoing. Service is operational but can be a little slower than usual.
17:05:00 UTC: Everything is back to normal
]]>Our CLI and Console were impacted.
We will investigate this incident further.
]]>11:02 UTC: Ingestion is back online. It's unclear exactly what went wrong at the moment but it is most likely linked to the issue from yesterday. A complete reboot of all storage nodes 'fixed' the issue. Those storage nodes now have 48 minutes of buffered data to ingest.
11:11 UTC: Ingestion delay very close to normal.
11:17 UTC: Ingestion delay is back to normal.
]]>During the upgrade process, one of the workers of this reverse proxy continued to accept connections but didn't process them and kept them until the requests timeout. The issue has been resolved by 11:10 UTC+1 and will be investigated further. This is the first time the upgrade process fails us in months and we will take extra-steps to avoid and detect this issue faster.
]]>Because of this, we disabled ingestion temporarily which will make things easier to debug and fix.
17:26 UTC: Network issue seems to be gone, ingestion is restarted
17:31 UTC: Ingestion is going smoothly. As of now, we don't know what happened network-wise, we are awaiting word from our provider. As of now, it looks like a congestion issue from our point of view.
17:35 UTC: Ingestion delay back to normal
]]>We will investigate this further as we have monitoring for such a case and it apparently didn't trigger here.
]]>The issue has been resolved and the root cause has been found. A patch will be applied to avoid this happening again.
]]>EDIT 16:25UTC: fixed.
N.B. between issues and the deployments deactivation, some applications were responding HTTP 503. It's now fixed.
]]>We don't know exactly what happened at this time but it looks like the impact was fairly minimal on actual users as we can't see any meaningful dip in aggregated incoming bandwidth usage of load balancers.
This post will be updated once we get more details from our network operator.
]]>EDIT: it's now fixed, app status and ssh access are now operational.
]]>This is caused by multiple instances of the same component crashing at the same time.
We are working on fixing this, this may take a while for a definitive fix (30 minutes at best, 1h30 at worst).
14:41 UTC: Metrics are currently available but this will probably not last as there is only partial redundancy on the affected component and the cause of the crash is not fixed
15:23 UTC: Metrics cannot be queried again
15:33 UTC: Metrics can be queried, but issues may still arise from time to time, issue is still not fixed.
15:45 UTC: Two nodes of the storage backend crashed under the load caused by the reload of the first components, this caused a delay in the ingestion and a pause in the reload of the first components. At this time, ingestion is catching up on the delay and queries are running fine despite the issues. You will most likely encounter issues as we work our way through this.
16:48 UTC: We have complete redundancy, this issue is now fixed.
]]>EDIT 15:28 - we are still experiencing issues, we are working on a fix;
EDIT 15:39 - fixed.
]]>This issue has been fixed at 09:12 UTC.
From 08:56 UTC to 09:12 UTC, all clever ssh
commands would hang forever.
Since 09:12 UTC, you may get the message "Opening an ssh shell." and then nothing. If this does happen, you will have to restart the application you are trying to ssh to.
]]>EDIT 18:24 - We identified the issues, applications linked are redeploying.
]]>14:17 UTC: A component of the "live logs" part of the pipeline was a bit overloaded and started slowing everything down slowly until it became actually noticeable. It has been restarted and the pipeline is now working on the delayed logs waiting in queue.
14:21 UTC: The load came back up soon after the restart, we are working on bringing it down; we may have to shut it down temporarily to scale it up (quick note: we are working on a new pipeline which can be scaled at will without any downtime)
14:25 UTC: We are temporarily shutting down the Logs API to make things easier.
14:34 UTC: Logs API is back and delay is back to <5 seconds, we are still watching the situation closely.
14:58 UTC: Everything is indeed back to normal.
]]>EDIT 13:55 UTC: fixed.
]]>Metrics cannot be read as well, this includes access logs, hence the overview of your organizations is not available.
16:20 UTC: The issue is fixed, ingestion is working again. Overview is still not loading for now (because recent data is not there).
16:34 UTC: There was another issue with the reading part, which is now fixed. Everything is now working as normal. Though there may be some hiccups with the ingestion in the coming minutes.
16:43 UTC: This issue is resolved. Sorry about the inconvenience.
]]>During approximately 1 hour, Metrics and access logs (dot maps / requests count in the console) will be unavailable both in reading and writing starting December 26th at 14:00 UTC+1.
All data will be kept and ingested at the end of the maintenance.
EDIT 13:00 UTC: Maintenance is starting
EDIT 13:23 UTC: Initial steps are done, WRITE have been delayed up to 8 minutes and some READ may have failed. The second phase of the maintenance will begin shortly.
EDIT 14:32 UTC: Second phase is over. There were two ingestion delays, peaking at 4 minutes each. The maintenance is not over yet but it should not impact the ingestion nor the read.
EDIT 14:58 UTC: It should not have had any impact but it still did. Ingestion is delayed, reads are impossible; we are investigating.
EDIT 15:21 UTC: The issue is solved; reads are back, ingestion is working
EDIT 15:31 UTC: Ingestion delay is back to normal
EDIT 16:00 UTC: Maintenance is over.
]]>17:45 UTC: Ingestion is back to normal performance, delay will be back to normal in 15 minutes.
]]>18:42 UTC: We are still working on it. This is a never-before-seen, massive issue so we are unable to give any ETA at this time.
22:35 UTC: The issue has been narrowed down and is now under resolution. We will wait until tomorrow morning to continue restoring this service. All metrics gathered before this incident are still accessible, only new metrics are not. Those are currently stored and will be processed once the Metrics cluster goes back to normal. More news tomorrow morning.
12:00 UTC: We have been back working on this since 7:30 UTC, things are looking good; still at least a few hours to go.
13:55 UTC: The issue with the storage platform is now finally fixed. The ingestion is now running at full speed and catching up; it's processing the 22 hours of data which have been accumulating.
15:25 UTC: We are about halfway there.
16:50 UTC: We are 4/5 of the way there. It should be resolved in under an hour.
17:30 UTC: You should now already see recent points in your applications' metrics. Delay will be back to normal in less than 30 minutes. Closing this off.
]]>EDIT 19:44 UTC: fixed, the logs collection is catching up its lag.
EDIT 19:49 UTC: back to normal state.
]]>EDIT 11:03 UTC: The cluster is now back up. A node was shutdown for maintenance reasons as it already happened these past weeks. Somehow the data it hosted was unavailable even though replicated data is available on other nodes. We will investigate this incident further.
]]>08:16 UTC: The issue is resolved
]]>EDIT 16:32 UTC: The network issue seems to be resolving, only one of our datacenter had the issue but it may have impacted applications and add-ons that weren't in this datacenter.
EDIT 16:36 UTC: Console is not stable because of Clever Cloud API issues due to datacenter network problems.
EDIT 16:40 UTC: Our network provider is already aware of the issue and is looking into it.
EDIT 17:00 UTC: Our datacenters still have issues, we working on it with our provider.
EDIT 17:17 UTC: The network issue on our datacenters is over but it included additional issues. API is currently having issues and our console is unreachable at the moment.
EDIT 17:34 UTC: Console and API are up again and we are making sure that all services are up and running again.
EDIT 19:26 UTC: The incident is currently over and nothing has come up since 17:34 UTC.
We are still waiting for more information from our Network Provider that we will add here as soon as we get it.
The network perturbation was ongoing from 16:18 UTC to 16:30 UTC. One of our datacenter experienced high packet loss due to routing issues. Those issues were only impacting the external trafic (communication between our 2 datacenters was not impacted). Applications and add-ons were UP but unfortunately, because of those routing issues, you may have experienced difficulties reaching out your applications.
Those issues also impacted some of our systems and made our API / Console unavailable for 1h during which deployments were also not working.
]]>EDIT 05:30 UTC: routes have been updated to avoid the incriminated router. Traffic is back to normal.
]]>Service should be back in 15 minutes.
Meanwhile, ingestion is still working fine.
15:01 UTC: Incident is over.
]]>10:26 UTC: The ingestion issue is fixed, the system is now catching up.
10:33 UTC: The ingestion delay is almost back to normal.
10:36 UTC: There is still a bit of a lag but it should come back to normal in a few minutes. Read performance is still a bit hit or miss but coming back to normal as well. We will reopen the incident if it does not.
11:06 UTC: The ingestion lag is increasing. We are investigating. This may take a while.
11:30 UTC: The cause has been identified and partially fixed.
11:37 UTC: Lag is now <5s ; we are currently working on fixing the issue in a more permanent way.
11:45 UTC: The issue is now fixed.
]]>13:20 UTC: The issue has been identified and at least partially fixed. Logs are coming through but we are still making sure that everything is indeed fine.
13:25 UTC: The issue is indeed fixed. Some older logs are still being collected.
13:33 UTC: Incident is over.
]]>08:40: The problem has been alleviated by allowing more connections. It will slow down the service but you can at least connect to your databases and migrate to paid add-ons if you were using this service for production. We will start a new cluster very soon to improve performance.
]]>17:21 UTC: Incident is over. A monitoring component was still complaining about a few applications in a loop, there was no actual issue, just a very overzealous alerter process. Deployments performance has been back to normal since 16:43 however.
]]>12:44 UTC: The delay is now back to normal. Some deployments may be stuck though, please contact us if you are experiencing such an issue.
]]>EDIT 20:51 UTC: fixed.
EDIT 23:19 UTC: the logging infrastructure is experiencing issues. We are working on a fix.
EDIT 23:25 UTC: fixed.
]]>EDIT 00:00 UTC: Cluster is now available again, no failover happened.
]]>EDIT 06:22 UTC: Logs are now available again. No logs should have been lost but they might be out of order until 06:15 UTC.
]]>EDIT 00:21 UTC: The cluster is getting back to normal, errors have already significantly decreased and most of the requests should now be successful. We keep monitoring failed requests.
EDIT 03:00 UTC: No more failed request over the last 30 minutes, the incident is closed. We are still in the process of migrating this cluster data to the new cluster. Until we automatically migrate your buckets, you can migrate them yourself. Feel free to contact our support for more information
]]>12:11: An orchestrator was experiencing intermittent network issues. The issue is now fixed.
]]>Applications which had instances on these hypervisors have been redeployed automatically because the monitoring could not reach them (even though they were available).
]]>EDIT 9:29 UTC: fixed.
]]>13:27 UTC: Issue fixed.
]]>EDIT 22:05UTC: the hypervisor is restarted.
EDIT 22:20UTC: incident fixed.
]]>EDIT 19:43 UTC: The maintenance is starting, API will be shortly unavailable.
EDIT 19:49 UTC: The maintenance is over!
]]>EDIT 20:03 UTC: The maintenance should start shortly. We will keep you updated on its progress.
EDIT 20:53 UTC: The maintenance is still ongoing. Nothing unusual to report as of now
EDIT 21:20 UTC: Everything is going smoothly as seen in our tests. Nothing unusual to report as of now
EDIT 21:42 UTC: The maintenance is over. No network interruptions have been noticed by our monitoring systems. Everything is back to normal.
]]>Here is a non exhaustive list of affected actions (some of them will succeed):
EDIT 23:30 UTC: Our payment processor issues should now be resolved. Everything should be back to normal on our side too.
]]>10:02 UTC: Deployments queued now will be post-poned until the end of the maintenance.
10:04 UTC: The main API is now unavailable.
10:06 UTC: The main API is restarting.
10:09 UTC: Maintenance is over. The main API is available, pending deployments are starting.
]]>An automatic restart at 09:21:48 UTC made them unavailable until the configuration was re-generated without the error at 09:24:40 UTC.
Steps will be taken to prevent this error from happening again.
]]>EDIT 22:49 UTC: Our API is also down for now, that's expected. The console is therefore down too. Clients websites remain accessible.
EDIT 23:11 UTC: Network came back 5 minutes ago, we are currently checking if everything is ok
EDIT 23:26 UTC: Applications with fs-bucket (including PHP applications) may have issues loading because their connection to the fs-bucket server, if this server was on the datacenter who lost the connection.
EDIT 00:26 UTC: Applications with fs-buckets are currently redeploying. Most of them successfully reconnected (sometimes after several minutes) to their bucket server. The incident is over.
]]>We will perform a migration of the Git repositories. Once deployments are enabled again, you may have to wait a few more minutes depending on your DNS cache.
19:00 UTC: Maintenance is starting, deployments are now disabled (except for Github deployments).
19:13 UTC: The maintenance will last longer than initially planned, we are experiencing an issue and are looking into it.
19:15 UTC: The issue is fixed. We are making sure that everything is indeed fine. Some deployments may now go through, depending on your DNS cache.
19:30 UTC: Maintenance is over; if you encounter an issue, please refresh your DNS cache.
]]>We will perform a migration of the Git repositories. Once deployments are enabled again, you may have to wait a few more minutes depending on your DNS cache.
EDIT: This has been postponed.
]]>It should be quicker than that but if you do have deployments planned, make sure to start them well before the beginning of the maintenance.
EDIT 10:01 UTC: Maintenance is starting now, deployments are disabled.
EDIT 10:19 UTC: Deployments are enabled again.
EDIT 10:31 UTC: Deployments are disabled again. Dedicated reverse proxies for Clever Cloud APIs are out of sync, our APIs are down at the moment. We are working on it.
EDIT 10:39 UTC: Main API is back online.
EDIT 10:47 UTC: Reverse proxies are in sync, deployments are enabled again. We are cleaning up.
EDIT 10:53 UTC: Maintenance is over.
]]>It should go back to normal in 30 to 60 minutes.
EDIT 8:56 UTC: There are still clean-up operations in progress which slow down the cluster. Error rate is going down though.
EDIT 9:55 UTC: Incident over since 9:40
]]>EDIT 18:41 UTC: The problem should now be fixed since a couple of minutes. We gathered information as to why this problem happened and will try to narrow it down.
]]>The issue is now fixed, but deployments will take a little while longer to start until the queue is consumed.
EDIT: Incident over at 09:40 UTC
]]>The new Cellar cluster is not impacted by those issues.
EDIT 23:40 UTC: Cluster now seems to be in a good shape again
]]>EDIT 12:33 UTC: We may have identified the root cause. It may be due to a change that happened this morning. We will revert it.
EDIT 12:43 UTC: The change has been reverted and we confirm that it resolves the issue. Sorry for the inconvenience.
]]>EDIT 23:30 UTC: Other nodes need to be restarted. We saw <1% of failing requests, expect the same amount for the remaining restarts.
EDIT 02:00 UTC: Nodes have been restarted, failing requests are getting lower and lower, still under 1%.
]]>EDIT 15:32 UTC: The issue has been identified, we are currently re-deploying the API. Console is still unavailable.
EDIT 15:34 UTC: The API successfully redeployed and is now available. Console is now available too. The incident is over.
]]>EDIT 14:27UTC: finished.
]]>EDIT 15:31 UTC: The issue has been fixed. Some of the logs were lost but not all of them, you should have the last ~15 minutes, the buffer wasn't large enough to keep them all. We will increase it next week.
]]>EDIT 16:20 UTC: Problematic queries have been killed and the cluster load is going down. We continue to monitor the situation but it should go back to normal. We also have a newer MySQL shared cluster on MySQL version 8. You can migrate your database to it using the "Migrate" tool.
EDIT 16:45 UTC: The performance issue is back, we are trying to narrow down the issue
EDIT 17:00 UTC: Performances are again back to normal. We will keep an eye on it. Meanwhile, do not hesitate to migrate to our new cluster to avoid this issue.
EDIT 10/05/19 08:10 UTC: The issue has come back.
EDIT 10/05/19 12:00 UTC: Owners of the potential abusive queries have been notified. Cluster performances are back to normal. As usual, we will keep an eye on it.
]]>EDIT: Issue resolved at 15:48:20 UTC
]]>EDIT 09:41 UTC: 503 errors are now gone but were replaced by 500 errors that get triggered after a few seconds. We are checking the cluster's state
EDIT 10:10 UTC: Error rate is decreasing but continue to be important. Deployments are also impacted by this issue if you are using build cache.
EDIT 10:27 UTC: Error rate is still at ~20% and continue to decrease.
EDIT 11:52 UTC: We did not receive any errors since 11:40 UTC, the cluster is now in good shape and everything should be back to normal.
This cellar cluster will soon be deprecated (new cellar add-ons are already created on an up-to-date cluster) in favor of a better and maintained version.
]]>EDIT 16:11 UTC: fixed.
]]>EDIT 17:56 UTC: the systems are backing to normal. It was a DNS resolver problem.
EDIT 17:58 UTC: fixed.
]]>This means that your browser may show you a security alert when visiting a cleverapps.io site.
We are looking into reporting the mistake to the relevant lists and services.
Meanwhile, we remind our users that they should never use a cleverapps.io domain for production; they should only be used for development and tests.
]]>It should go back to normal gradually and will not take more than an hour at the most.
EDIT 17:00 UTC: Error rate and performance is back to normal
]]>EDIT 7:25 UTC: multipart uploads are down, the fixes are ongoing. EDIT 15:38 UTC: the cluster has been fixed, everything is back to normal.
]]>EDIT 06:12 UTC: The network issue is over. This was an issue with our provider which affected all our servers but not all at the same time. Nothing was actually fully unreachable at any point in time but there was a lot of packet loss.
]]>EDIT 9:19 UTC: This only affects the older SFR network, not the SFR-Numericable network. This specifically affects all SFR peering going through TH2.
EDIT 9:50 UTC: This has been resolved at 9:36:30; if you are still experiencing issues, please tell us.
]]>Console is partly down. Some apis are down.
EDIT 18:20 UTC: Here is the history and context of the network issue:
At 17:25, a maintenance on a component of a redundant network link caused one of the underlying links to fail. For reasons unknown at this time, the failing link was elected and about 30% of packets were lost until 17:29.
At 17:30, the network engineer decided to revert the change; this caused additional loss for about 30 seconds. Network was back to normal at 17:31.
]]>EDIT 14:01 UTC: Error rate is back to normal. Response times are going down, we are still watching the situation closely.
EDIT 15:40 UTC: We are seeing an elevated error rate again, this was caused by a restart of a node which triggered a very high load on other nodes (which is not supposed to happen). We are investigating.
EDIT 16:30 UTC: The error rate went down significantly but it's not over yet. We sadly cannot give any meaningful ETA as of now.
EDIT 16:55 UTC: The error rate is close to normal. One node is still in trouble and it's causing a few errors; it should resolve quickly.
EDIT 17:15 UTC: The failing node went back to normal at 17:02. We are still seeing a few errors for write requests as of now.
EDIT 17:23 UTC: The error rate is back to normal. A few nodes are still a bit slower than usual so performance is a bit hit or miss but it should go completely back to normal in up to an hour.
]]>EDIT 15:33UTC: fixed.
]]>We didn't see any new timeout since 23:45 UTC but we continue to monitor the service.
]]>EDIT 10:30UTC: fixed.
]]>EDIT 21:00 UTC: fixed.
]]>EDIT 15:28 UTC: All add-ons should be back online, some of them took longer than expected to recover. The cause of the reboot will be investigated.
]]>EDIT 16:42 UTC: The root cause has been found. We are redeploying core components to clean everything.
EDIT 16:50 UTC: Deployments are available since a few minutes now. We are still cleaning things up. Sorry about the issue
]]>EDIT 20:50UTC: we are hard rebooting the hypervisor.
EDIT 20:55UTC: the hypervisor is up, the addons hosted on it are starting.
EDIT 20:58UTC: fixed.
]]>10:27 UTC: The issue is now fixed
]]>We are investigating the issue. Deployments are stopped until we find the root cause.
EDIT 16:35 UTC: Applications state should now be OK. Deployments are still stopped until we figure out the issue.
EDIT 16:40 UTC: Problem has been identified. We will resume deployments in a few minutes. All deployments action were queued and will be consumed.
EDIT 16:42 UTC: Deployments are enabled again. It may take a few minutes before your actions are handled. We consider this incident over.
]]>EDIT 11:31 UTC: Cause has been identified, we are currently fixing the issue on our reverse proxies.
EDIT 11:34 UTC: All reverse proxies now have a consistent state. The issue is fixed.
The issue happened after a configuration error made during a manual operation on some of the reverse proxies. Applications that redeployed since 11:08 UTC were impacted by that issue. Other applications were fine. The changes were rollbacked and will again be tested thoroughly on our test infrastructure.
]]>The maintenance will last at least 5 minutes but no more than 20 minutes.
Different parts of the system will be affected throughout this maintenance, please wait until the end of the maintenance before reporting any issues you may be having.
EDIT 11:00 UTC: The maintenance will start in a few minutes. Deployments and GIT repositories will be unavailable. The console might report an "unknown" or not up-to-date state for applications. This is expected.
EDIT 11:05 UTC: Maintenance is starting, deployments are down and so are GIT repositories (push actions will be rejected)
EDIT 11:09 UTC: Deployments are available again. Push actions on GIT repositories are still disabled.
EDIT 11:10 UTC: Our main API is entering read-only mode. 500 errors might appear during this time.
EDIT 11:11 UTC: Git repositories are now available. You might need to clear your DNS cache to be able to push again.
EDIT 11:20 UTC: Our main API should be fully available again. We are looking if everything looks fine.
EDIT 11:23 UTC: Everything is looking fine. The maintenance is over. You might experience git push errors up until 45 minutes. To avoid that, please clear your DNS cache.
]]>EDIT 07:11 UTC: The issue is resolved
]]>All deployments actions will be queued and started once the deployment stack is back up. The maintenance shouldn't last longer than 2 hours.
Feel free to ask any question on our support regarding this maintenance.
EDIT 11:03 UTC: the maintenance will start soon. Deployments will be shutdown in a few minutes. Push actions on our GIT repositories are disabled.
EDIT 11:06 UTC: Deployments are shutdown
EDIT 11:20 UTC: Deployments should be back, we are still cleaning up things
EDIT 12:20 UTC: We have been keeping a close eye on deployments, everything is going smoothly. Maintenance is over.
]]>UPDATE 9:40 UTC: deployments are back since 20 minutes, we are still cleaning things up.
UPDATE 10:30 UTC: Everything is back to normal, sorry for the issue.
]]>EDIT 8:21 UTC: fixed.
]]>EDIT 18:10: Deployments should be back to normal.
]]>EDIT 11:05 UTC: maintenance is finished.
]]>EDIT 16:29 UTC: the new addon dashboard is available. We are continuing the maintenance.
EDIT 17:30 UTC: maintenance finished.
]]>EDIT 18/01/19 00:53 UTC: Root issue is most probably identified. The issue was coming from an internal tool. We will investigate this further. In the meantime, the tool has been deactivated and shouldn't cause any harm.
]]>EDIT 16/01/2019 09:45 UTC: The problem might be due to old clients drivers being used on the cluster. We have set up a new cluster (version 4.0.3) which should greatly improve things. You can create a new add-on to migrate your database.
To dump your data from your existing, you can use this command: mongodump -u "${MONGODB_ADDON_USER}" -p "${MONGODB_ADDON_PASSWORD}" -h "${MONGODB_ADDON_HOST}" -d "${MONGODB_ADDON_DB}" --archive --gzip
You can then import the data into the new database by using the mongorestore
command displayed in the dashboard of your new add-on.
An automatic migration tool for mongodb should be available in the next few days.
]]>16:15 UTC: The maintenance is over. Add-on creation and dashboard are now fully available again.
]]>At 22:27, one of our hypervisors lost access to parts of its disks. Amongst others, It impacted a deprecated front reverse proxy for applications and a front reverse proxy for add-ons (databases). We moved the IP of one of the proxies. The other one, related to the application reverse proxy (62.210.92.244) couldn't be moved and is now unreachable. If you still use it, you should update your DNS records: https://www.clever-cloud.com/doc/admin-console/custom-domain-names/#personal-domain-names
The situation is stabilized. We still consider the infrastructure not fully recovered.
]]>EDIT 15:00 UTC: new addon dashboard is available, but addon creation is still unavailable.
EDIT 17.28 UTC: maintenance is now finished.
]]>16:38:00 UTC: An alert due to an important change in network traffic is triggered
16:39:30 UTC: The load balancer is restarted
Everything is back to normal now.
]]>EDIT 10:07 UTC: The reverse proxy has been restarted and the issue seems to be resolved. We are monitoring the situation.
]]>EDIT: There was sudden drops in free disk space. We change the logging method and it seems to have stabilized the system. We are still working on figuring out the issue.
]]>We are having issues with the authentication component. Open connections are working fine, new connections are impossible for now.
17:21 UTC: It should be fixed. We are making sure.
17:30 UTC: Incident over.
]]>EDIT 20:15 UTC: Incident resolved, it was due to a network miss-configuration. We will ensure this doesn't reproduce anymore.
]]>START
, RESTART
, STOP
, ... will be unavailable but will remain in queue and will be processed at the end of the maintenance.
EDIT 13:06 UTC: The maintenance is starting EDIT 13:17 UTC: Deployments are now available again. Queued deployments have been processed.
Maintenance is over.
]]>EDIT 16:53 UTC: API is fixed. We detected a problem on our reverse proxies, we are currently fixing it.
EDIT 16:54 UTC: fixed.
]]>EDIT 15:17 UTC: fixed.
]]>EDIT 12:18 UTC: We are still trying to figure out a fix for the issue.
EDIT 12:47 UTC: The problem should now be fixed. A configuration error made this incident longer than it should have last. Applications may need to be redeployed to get the SSH service back online.
Sorry about this incident.
]]>Sorry for the inconvenience
]]>14:26 UTC: One culprit has been found. The cluster's load has been reduced significantly.
14:38 UTC: The cluster's load is back to normal since 14:30.
]]>08:38 UTC: We are restarting part of the deployment system.
08:49 UTC: Since 5 minutes ago, deployments are being processed with some delay.
08:54 UTC: Back to normal.
]]>10:00 UTC: We found the root cause. The console still can't be loaded at the moment but other services should now be available (like deployments) 10:06 UTC: There was an underlying issue causing the console loading. It is now fixed. The incident is now over. Sorry for the inconvenience
]]>EDIT 14:04 UTC: Metrics are getting back up
EDIT 14:10 UTC: Metrics are fully recovered. Sorry for the inconvenience
]]>Cogent will be performing code upgrades in the following areas.
During these upgrades, customers in or transiting the area may experience
intermittent periods of packet loss and latency between 15 and 45 minutes
for the duration of the window.
Location: Paris, France
Start time: 11/30 00:01 CET
End time: 11/30 06:00 CET
Work order number: NC840-119
our link with Montréal (MTL) zone can be affected by issues, so our systems (deployments, monitoring, etc.) on Montréal (MTL) can experiences issues.
]]>Sorry for the inconvenience.
]]>An action was taken at 02:30 UTC (2018-11-21) which has successfully fixed this issue. This is only temporary though.
A permanent fix will be applied later today, which will require a downtime of that component.
EDIT 2018-11-21 16:50 UTC: The permanent fix is delayed to tomorrow, 2018-11-22.
EDIT 2018-11-22 10:40 UTC: The fix will be applied at 10:50 UTC, this will require at least one restart of that component which will lead to an unavailabiliy of Metrics for about 20 minutes.
EDIT 2018-11-22 11:25 UTC: Metrics are back since 11:08 UTC. Incident over.
]]>EDIT 19:21 UTC: Here is the incident of our provider: https://status.online.net/incident/153 (3 racks have lost public connectivity)
EDIT 20:33 UTC: The issue should be fixed. As of now, our monitoring is happy. We are cleaning up.
]]>EDIT 12:10 UTC: The issue seems to be resolved now
]]>EDIT 16:25 UTC: One of the component was failing due to a network configuration error. The network configuration has been fixed and the component is currently restarting. It should be restarted in about 15 minutes.
EDIT 16:40 UTC: The component has restarted, metrics are now available again for read actions. No data was lost. Sorry for the extended interruption.
]]>EDIT 13:28 UTC: The network issue has been resolved since 13:20 UTC. Everything should be back to normal. Sorry for those issues.
]]>Affected applications are being restarted automatically.
Affected addons are unreachable.
EDIT 17:56 UTC: Looks like it's a network issue, we are awaiting word from our provider.
EDIT 18:08 UTC: Our provider tells us they are working on it, no ETA nor details given.
EDIT 18:26 UTC: There was a short electrical outage in the datacenter where this server is, some routers and switches have been impacted by the switch to the backup power source. They are working on fixing affected network hardware.
EDIT 18:44 UTC: The server is back, addons should be reachable. We are making sure that everything is back online.
EDIT 18:56 UTC: Everything is working fine. Incident closed.
]]>Update 16:34 UTC: The cluster nodes have been restarted. The cluster is UP again. Sorry for the inconvenience.
]]>At 12:30 UTC, we found the cause of the issue.
At 12:32 UTC, the issue was fixed and we regenerated the reverse proxies configuration.
At 12:33 UTC, add-ons were available again.
We have put the necessary protections in place to prevent this from happening in the future.
]]>13:09 UTC: We are going to restart one of the deployment core system. Deployments actions (like the one above) will be unavailable for up to 30 minutes. All actions will be queued and executed at the end of the maintenance.
13:40 UTC: Another problem occurred during the restart of that system. We are now trying to fix this one.
EDIT 14:03 UTC: Deployments are available since ~5 minutes now. We are still cleaning things up before closing this incident.
EDIT 14:30 UTC: Everything should be back to normal now. Sorry for the extra maintenance time and the deployments unavailability.
]]>EDIT 12:50 UTC: The deployments should now be back to normal. Apologies for the delays.
]]>DNS has been updated. Clients should connect back to the database
EDIT 12:22 UTC: The new leader is correctly serving requests since 0:30 AM UTC.
]]>We are trying to restart it.
]]>EDIT 10:17 UTC: We are still working on the issue. If you have troubles deploying, you can set your application's scalability settings to which a dedicated build instance would use. Do not hesitate to ping our support if needed.
EDIT 10:25 UTC: ETA is 2 hours if everything goes well.
EDIT 12:30 UTC: The deployments with cache are back. Everything should work as expected from now. Sorry for any failed deployments or longer than expected deployment times.
]]>The problem has been resolved at 16:08 UTC
]]>EDIT 18:50 UTC: The node has successfully restarted, the cluster should now be operational as usual
]]>We are looking into it.
EDIT 15:30 UTC: Our API is back online. The console can now be loaded.
]]>Applications on this hypervisor are being automatically redeployed. Add-ons are unreachable.
EDIT 12:21 UTC: The hypervisor is back online and is restarting the add-ons.
EDIT 12:32 UTC: All add-ons are now reachable.
]]>The maintenance shouldn't last longer than 30 minutes but it may be possible that some delays occur. We will update this ticket to let you know about the status of the maintenance.
EDIT 12:25 UTC+2: New deployments are stopped to be consumed.
EDIT 12:30 UTC+2: The maintenance has started
EDIT 12:56 UTC+2: Deployments are back since ~10 minutes. We are still cleaning things up
EDIT 13:03 UTC+2: Maintenance is over and was successful. Do not hesitate to contact us if anything's wrong on your side.
]]>EDIT 19:17 UTC: This was actually a false positive from our monitoring. After verifying that the component is working fine and fixing the monitoring probe, we re-enabled deployments.
]]>EDIT 05:28 UTC: The server is partially and randomly available: the problem has been identified by our provider: it's coming from the switch the server is connected to. They are working on fixing the issue.
EDIT 08:04 UTC: Issue is fully fixed since 07:30 UTC
]]>The failing node is up again.
]]>EDIT 15:30 UTC: The creation of add-ons and bucket is now fixed. It may take a little longer than usual but these slowness will be resolved in a few hours
]]>Users are stretching the "fair usage" concept way above reasonnable limits. We are working with them to enforce the fair usage.
]]>We are still watching the cluster.
]]>EDIT 13:17 UTC: Logs should be available, the cluster is slowly recovering
EDIT 13:23 UTC: The logs cluster is UP and running again, logs shouldn't have been lost thanks to buffering.
Sorry about the inconvenience.
]]>Write operations like "git push" or "clever deploy" to Clever Cloud repositories won't be possible during 30min. Read access won't be affected during this time.
Thanks for your patience.
EDIT 13:00 UTC+2: The maintenance is starting
EDIT 13:05 UTC+2: The maintenance is now complete. Do not hesitate to open a support ticket if anything goes wrong. Thanks for your patience!
]]>EDIT 10:27 UTC: Connections should now be working again. It seemed that already established connections were also impacted and were slower than expected. This should now also be fixed.
EDIT 10:27 UTC: FS Buckets service is now fully operational .
]]>EDIT 13:25 UTC: Recovery takes longer than expected, we are still working on it.
EDIT 13:59 UTC: We are still working on fixing these issues.
EDIT 14:08 UTC: We are still having issues but deployments can start.
EDIT 14:41 UTC: Deployments performance has been back to normal for more than 15 minutes now. We are still watching the situation closely. If you have an issue, please contact us.
]]>Some of our internal services were impacted by this network issue and thus, automatic re-deployment of applications has been delayed.
Everything is back to normal, applications are currently finishing their redeployment.
]]>Redis should be back as soon as the maintenance ends
EDIT 13:35 UTC: The maintenance is still ongoing
EDIT 13:50 UTC: The maintenance is over. Redis cluster is UP. Logs cluster is getting back UP. Logs should be saved but might not be directly available through the console
EDIT 14:30 UTC: The logs cluster is now fully operational too
]]>EDIT 11:08 UTC: The server was shutdown a few minutes ago. Applications on it are being redeployed. Add-ons are currently unavailable
EDIT 11:52 UTC: We are still waiting for news from our provider regarding the hard drives issue
EDIT 21:20 UTC: Our provider is still working at finding the root cause of the issue
EDIT 2018-06-29 07:05 UTC: We received an answer from our provider and the server can't be brought back online. Databases will need migration. We are waiting an answer to know if we can access the disk in a read only mode to transfer the databases. If not, backups from the the 28th June will be used.
EDIT 2018-06-29 07:18 UTC: The disks can't be read. Backups will need to be used
]]>EDIT 22:00 UTC: Restart took approximately 30 seconds, most applications sent again the logs they couldn't send during that time
]]>Some databases are unreachable.
EDIT 2018-06-18T23:25:00 UTC: Seems to be a malfunctioning fan. The server is still down for investigation. We are waiting for more informations from our hypervisor provider. EDIT 2018-06-19T00:37:00 UTC: The malfunctioning fans have been replaced. The server is up again. All the databases are up and running.
]]>EDIT 2018-06-18 16:29 UTC: The hypervisor is up again, the databases are getting back up.
Applications that were on this HV were redeployed on another one.
]]>EDIT 15:08 UTC: We are still waiting for our network provider to find the root cause of it.
EDIT 15-06-18 13:00 UTC: Instabilities have ceased since this morning. Everything should be back to normal
]]>EDIT 10:40 UTC: The node has been restarted, we continue to monitor the situation.
EDIT 13:20 UTC: The cluster has been running fine since the incident
]]>EDIT 13:30 UTC: The maintenance has begun. Deployments are shutdown (but are queued) and git repositories aren't available anymore.
EDIT 13:39 UTC: The maintenance is over, deployments and git repositories are available again
]]>EDIT 09:45 UTC: We might have found why connections are hanging, we are currently doing some tests
EDIT 10:10 UTC: The tests worked fine and a fix has been deployed. All connections should have been restarted. If you still experience troubles with connecting to a particular service, please let us know at support@clever-cloud.com with the service you're trying to access
]]>EDIT 14:40 UTC: Metrics are back since 14:15. Performance is gradually coming back to its usual level.
]]>EDIT 08:05 UTC: Deployments should be back to normal, we are keeping an eye on the situation.
EDIT 08:33 UTC: Some deployments still won't start
EDIT 09:00 UTC: Deployments should be back to normal again. We are still keeping an eye on the situation and cleaning up the remaining issues
EDIT 12:28 UTC: Again, some deployments are failing to finish even though they appear as successfully done in the logs. We are looking at it
EDIT 13:27 UTC: Deployments are going to be stopped to fully clean the system. It should not last more than 15 minutes. The maintenance is starting now.
EDIT 14:08 UTC: Deployments are available since 13:45 UTC. The maintenance period is over. We keep looking for everything to go back to normal
EDIT 16:30 UTC: Everything seems to be back to normal
]]>9:17am Paris Time: incident is fixed. All add-ons have recovered.
]]>EDIT 13:50 UTC: Instabilities have stopped for 10 minutes now, we are still closely monitoring the situation.
]]>EDIT 08:00 UTC: A new version of the PHP image has been released. Redeploying your application should be enough to SSH again to the machine
]]>Until it's over, Metrics are not available. Metrics agents on scalers should push the data when the service is back.
EDIT 15:14 UTC: Metrics are back since 15:12 UTC
]]>Traffic was back to normal at 15:32:00 UTC.
]]>We are investigating the problem
EDIT 19:35 UTC: The problem seems to be gone.. It may be due to a maintenance operation made on the Cellar cluster which shouldn't have caused this. This maintenance has been done multiples times without problems. We will keep an eye on the cluster when this maintenance starts again, probably tomorrow.
]]>EDIT 15:10 UTC: The source of the problem is one of our customers receiving a DDoS on its application. While the infrastructure can handle such load, we detected a problem with the configuration of our reverse proxies which doesn't allow us to correctly handle the load of this DDoS. We are looking at how we can improve that. In the meantime, traffic targetting that customer's application has been blocked.
EDIT 16:45 UTC: Most of the traffic is filtered. We will continue watch the issue in the following hours
]]>EDIT 11:49 UTC: Incident over since 11:45 UTC
]]>EDIT 17:03 UTC: Real-time delivery is back since 16:50 UTC
]]>EDIT 10:53 UTC: You can create the following environment variable for a temporary workaround: CC_PRE_RUN_HOOK=npm install nomnom@1.8.1 -g
EDIT 11:33 UTC: A fix has been made and the new image version is now deploying on our servers.
EDIT 12:33 UTC: The new image is now live. All NodeJS applications will be redeployed to avoid using a now broken image.
]]>EDIT 17:35 UTC: Service is back to normal and collected metrics have all been correctly persisted.
]]>EDIT 15:42 UTC: Incident over since 15:40 UTC.
]]>EDIT 16:41 UTC: the proxy has been successfully restarted. Add-ons should be reachable again. Applications not supporting the loss of an established connection will be redeployed. We continue to monitor the proxy.
EDIT 17:30 UTC: the incident is now over
]]>EDIT 20:17:00 UTC: The cluster has been restarted, impacted applications have been redeployed. The incident is over
]]>EDIT: Delayed to 12:50 UTC
EDIT 12:50 UTC: Will start in a few seconds
EDIT 13:07 UTC: Maintenance over. If you encounter an issue, please tell us.
]]>EDIT 03:15 UTC: Logs are back again
]]>Performance issues and or partial outage are to be expected. We will try to keep them as low as possible.
The maintenance starts at 22:00 UTC
EDIT 02:00 UTC: the maintenance is now over
]]>EDIT 20:45:00 UTC: The reverse proxy took ~1 minute to restart. It is now restarted
EDIT 20:48:00 UTC: Impacted applications were redeployed as expected. The incident is now over and all add-ons are now reachable again
]]>The Activity
pane (Console), clever status
(cli) and the API endpoint /applications/<app>/deployments
incorrectly report the deployment status.
Notifications (slack webhooks, mails) correctly report the deployment status (failed or successful) and can be trusted.
EDIT 21:48 UTC: It should now be fixed. Deployments with the "FAILED" state will keep their broken state.
]]>EDIT 2018-06-15 UTC: All 7 days are now available again.
]]>EDIT 11:31 UTC: Maintenance is starting
EDIT 12:06 UTC: Deployments are back, we are now cleaning some old artefacts
EDIT 13:00 UTC: The maintenance is over
]]>EDIT 19:25 UTC: Those slow downs might require an infrastructure change that will be done next week. Until then, slow downs should be less frequent and less important
EDIT 2017-12-08: 12:00 UTC: Deployments take less time after some fixes on our end. The migration will still happen to entirely fix it. Incident is considered as closed because we don't see any more extra times.
]]>EDIT 17h31 UTC: Deployments are disabled for now EDIT 17h38 UTC: Deployments are now back up but may be stopped again in a few minutes if needed EDIT 17h55 UTC: The incident is now resolved. We will keep an eye on it for the upcoming days
]]>EDIT 14:56 UTC+1: Unreachable servers are being restarted and will be available shortly. In the meantimes, impacted applications are being redeployed
EDIT 15:26 UTC+1: The team is performing the final cleanup. The issue is about to be closed. The remaining apps and add-ons are being restarted.
EDIT 15:50 UTC+1: The outage is now resolved. Contact the support is you encounter any trouble.
]]>EDIT 13:00 UTC+1: The maintenance is over, deployments are back since 15 minutes
]]>cleverapps.io
domain, the whole domain name has been marked as malicious.
We are working on clearing the alert. In the meantime, we'd like to warn you that cleverapps.io
domain names are provided only for test purposes and that they should not be used in production.
]]>EDIT 12:09 UTC: The API is now performing smoothly. We will keep looking why it went into such state
]]>EDIT 17:30 UTC: all shared redis are now available again
]]>If you need it, here is the IP of the domain: 217.70.184.38
EDIT 19:43 UTC: The incident seems to be resolved, .cleverapps.io domains now resolve correctly
]]>Impacted applications are being redeployed
EDIT 10:15 UTC: The server is still under huge load. Services on it continue to answer correctly in most cases. Applications are still redeploying
EDIT 10:30 UTC: The server is now reachable and responsive, we are looking into why it went under such a heavy load
]]>EDIT 14:24 UTC: The server is still down, we are waiting for more informations from our prodiver
EDIT 14:37 UTC: One of the server's fan has died and the server won't start.
EDIT 14:43 UTC: Impacted databases will be migrated on another server on request to the support. We will also contact impacted users. Let us know if you want to start a new database using tonight's backup
EDIT 15:23 UTC: Our provider is replacing the fans, no ETA for now
EDIT 16:55 UTC: Our provider replaced the fans and the server is now back up. Non migrated databases have been started again and linked applications are being redeployed. We will continue to monitor the situation
]]>EDIT: 17:40: Everything is back, sorry for the interruption.
]]>Update 13:57 UTC: deployments are now back up, we continue to monitor the situation
Update 14:15 UTC: it's all good now
]]>Update 09:29 UTC: the master node has been restarted. We're watching it closely
Update 15:45 UTC: the master has been alright since then
]]>Update 20:13 UTC: Network seems more stable now. We are still waiting for more information from our provider Update 21:17:UTC: Our provider has confirmed the issue is fixed.
]]>EDIT 18:00 UTC: all good now
]]>EDIT 29/07/17 11:35 UTC: the migration will begin at 12h15pm UTC. During the migration and for a few hours after, credit cards management might not work
EDIT 29/07/17 13:10 UTC: the migration is over, we will continue to monitor payments for a few hours
]]>This is the 2nd step of the maintenance started on the 20th (https://status.clever-cloud.com/incident/31).
This should not have an impact on availability but may have a slightly bigger impact on performance than the first step (which did not have any noticeable impact).
It should take around 10 hours. This is a very rough estimate though, we will be posting updates along the way.
EDIT 08:01 UTC: Maintenance is starting now.
EDIT 11:55 UTC: Everything is going smoothly. Performance impact is very low.
EDIT 19:35 UTC: Maintenance is still in progress. No significant impact; so as for the 1st step, consider this event over.
]]>EDIT 12:05 UTC: All affected applications have finished redeploying ; we are awaiting an answer from our provider
EDIT 12:47 UTC: Our provider is "running tests" on the affected server and has not given any ETA as of now.
EDIT 13:00 UTC: The server is reporting an hardware error, not disk-related. Our provider is working on fixing the issue.
EDIT 13:31 UTC: The server fails to start. Our provider is giving us another server and will put the disks of the old server into the new one.
EDIT 14:30 UTC: The server is ready, the disks are up and running. We are now rebooting the server in operational mode and will make sure everything starts up fine and will then update the network configuration.
EDIT 15:11 UTC: All databases are available again.
]]>This is a 2-steps maintenance, the second one will be scheduled at a later stage.
This should not have an impact on availability but may have a light to moderate impact on upload / download speeds.
No ETA as of now, we will be posting updates along the way.
EDIT 2017-07-20 08:00 UTC: Maintenance is starting now
EDIT 10:00 UTC: We are expecting the maintenance to end between 21:00 UTC and 2017-07-21 01:00 UTC ; we are seeing no significant impact on upload / download speeds as of now
EDIT 14:45 UTC: The maintenance is running fine and still has no significant impact on performance, we are keeping it as-is. Consider this event over; If something goes wrong, we will create a new event.
]]>The maintenance should not last more than 1 hour.
EDIT 10:18 UTC: Maintenance started a few minutes ago, logs collection will be disabled in a few seconds
EDIT 10:44 UTC: Maintenance is over since a few minutes, logs are now available
]]>At this point, most services were available except for logs, events and notifications.
30 minutes after the beginning of this issue, it's now fully available.
]]>EDIT 06:48 UTC: The network seems to work fine now. Deployments are unavailable, we are working on bringing them back up.
EDIT 07:35 UTC: Deployments have been back up since 07:15, we are still cleaning up the remaining items.
EDIT 07:40 UTC: Everything is cleaned up and functional now. If you have an issue, come ping us.
]]>EDIT 16:12 UTC: Deployments are back
]]>This should last no more than 10 minutes. Deployments should not be delayed by more than a couple minutes.
Maintenance operation will start at 09:10 UTC.
EDIT 09:19 UTC: Deployments should go back to normal in the next few minutes. Maintenance is over, we are now checking that everything is working fine.
EDIT 09:24 UTC: Deployments delay back to normal; end of incident
]]>We are awaiting news from our provider.
EDIT 15:30 UTC: We are still awaiting a manual operation from our provider
EDIT 15:37 UTC: They have rebooted the server manually but "observed an error" and are "analyzing" the issue
EDIT 16:04 UTC: The power supply is out of order and is being replaced
EDIT 16:55 UTC: The operation is over, the server just rebooted and will now start recovering / cleaning up after the forced reboot. Databases will be coming back online automatically.
EDIT 17:50 UTC: Most databases are available since 17:15 UTC. The remaining databases are now available
]]>Deployments are stopped until the monitoring is back up and running.
]]>EDIT 11:05 UTC: Maintenance is fully over now, deployments have been available since 10:50 UTC.
]]>EDIT 16:00 UTC: The deployment starting time is back to normal
]]>ETA is about an hour.
]]>Also, deployments are delayed until we clean the non-important redeployments
UPDATE 5:07PM UTC: Incident has been resolved, sorry for those redeployments
]]>UPDATE 12:43PM UTC: The problem has been resolved, we will investigate about why it happened and how to prevent this from happening again.
]]>EDIT: The issue is gone. It looks like it was a temporary network issue of our provider.
]]>Update 5:43 UTC: the redis machines are now available, impacted applications are restarting
]]>Update: The problem has been fixed at 16:20 UTC
]]>Update at 15:07 UTC: Problem fixed
]]>