Infrastructure Multiple Paris hypervisors unreachable

Multiple hypervisors in the Paris zone are unreachable. We are investigating.

14:52 UTC: Network issue is resolved. We are assessing the damage.

15:07 UTC: API and deployments are down. We are cleaning everything and bringing it up.

15:20 UTC: API is back. Deployments are back but have a significant delay as of now.

15:42 UTC: We are still working on this. Deployments are quicker now but not yet back to normal.

16:02 UTC: This incident is over. If you are still experiencing issues, please contact us.

Post-mortem

A maintenance operation carried out by our network provider a few hours before this incident generated a faulty BGP announce. Because of this, a significant portion of traffic coming out of our Paris infrastructure was going out via a NYC peer causing significant delay and even timeouts.

Routers in one of our Paris datacenter were heavily impacted by this issue and failed to accept configuration fixes. After multiple attempts to fix this, our provider ended up power-cycling affected routers which caused most of our hypervisors in this datacenter to be cut off from the rest of the network for 3 minutes.

Corrective actions will be taken to prevent this from happening again (BGP filters, dedicated admin network for the routers which was already scheduled to be set up in a few days). We will also make sure that we are warned in due time if a significant network configuration/hardware issue occurs.

No incidents reported

Hypervisor reboot, scheduled 3 years ago

An hypervisor needs to be rebooted. Customers that are impacted will shortly receive an email and add-ons that can be migrated will be migrated before the reboot. Estimated downtime is about 15 minutes.

Add-ons will start being migrated at 20:30 UTC+2. Hypervisor will be rebooted at 21:30 UTC+2

EDIT 20:36 UTC+2: Maintenance is starting. Applications are getting redeployed and add-ons are starting their migrations

EDIT 21:30 UTC+2: Add-ons that could be migrated have been migrated, applications have been redeployed. Server will now reboot

EDIT 22:00 UTC+2: Server has finished its reboot, add-ons that weren't migrated should have been reachable since 21:45 UTC+2. The maintenance is over.

PostgreSQL shared cluster upgrade, scheduled 3 years ago

Following https://www.postgresql.org/about/news/postgresql-133-127-1112-1017-and-9622-released-2210/, our PostgreSQL shared clusters will be upgraded to the latest minor version of their branch.

Affected clusters are:

postgresql-c4: Paris zone
postgresql-c5: Montreal zone

This update may affect performances of the databases and their availability.

The upgrade will start in a few minutes. This maintenance will be updated accordingly

EDIT 18:28 UTC+2: Montreal cluster is now up-to-date

EDIT 19:54 UTC+2: Paris cluster is now up-to-date but postgis extension is currently broken due to the update. We are working on a fix

EDIT 20:27 UTC+2: Paris cluster: databases are currently being migrated to a newer version of postgis. It will take a few hours to run on all of the databases

EDIT 20:42 UTC+2: This maintenance is now considered as over

No incidents reported

Infrastructure Hypervisor unresponsive in PAR zone

A hypervisor became unresponsive in PAR zone. It's currently rebooting.

Affected applications are being automatically redeployed. Affected addons are unreachable.

21:53 UTC: The hypervisor is back online and is starting addon VMs.

21:55 UTC: All addons are back online. The incident is over.

Access logs Metrics/AccessLogs are experiencing issues

Metrics/AccessLogs queues are being consumed. Recent data values are currently unavailable.

06:30 UTC: Incident is over.

API Core services are experiencing issues

Core services (console, API, metrics, access logs) are experiencing issues. We identified the problem and are working to resolve it.

EDIT 23:02 UTC: the incident is related to one of our hypervisors.

EDIT 23:03 UTC: we restarted the hypervisor; related databases are down.

EDIT 23:04 UTC: hypervisor is up; VMs are starting.

EDIT 23:13 UTC: metrics are down too.

EDIT 23:25 UTC: databases are up. We are now experiencing issues with our internal reverses proxies and console and API are not available.

EDIT 23:30 UTC: we queued the linked applications for a high-priority redeploy to ensure they reconnect to their databases. Core services are still partially down.

EDIT 0:00 UTC: all applications are redeployed.

EDIT 02:56 UTC: we are still working to fix issues on our internal core services (console, API); users applications/addons are not impacted.

EDIT 03:30 UTC: internal core services are back!

No incidents reported

Past Incidents

Monday 17th May 2021

Post-mortem

Sunday 16th May 2021

Saturday 15th May 2021

Friday 14th May 2021

Thursday 13th May 2021

Wednesday 12th May 2021

Tuesday 11th May 2021