Multiple hypervisors in the Paris zone are unreachable. We are investigating.

14:52 UTC: Network issue is resolved. We are assessing the damage.

15:07 UTC: API and deployments are down. We are cleaning everything and bringing it up.

15:20 UTC: API is back. Deployments are back but have a significant delay as of now.

15:42 UTC: We are still working on this. Deployments are quicker now but not yet back to normal.

16:02 UTC: This incident is over. If you are still experiencing issues, please contact us.

Post-mortem

A maintenance operation carried out by our network provider a few hours before this incident generated a faulty BGP announce. Because of this, a significant portion of traffic coming out of our Paris infrastructure was going out via a NYC peer causing significant delay and even timeouts.

Routers in one of our Paris datacenter were heavily impacted by this issue and failed to accept configuration fixes. After multiple attempts to fix this, our provider ended up power-cycling affected routers which caused most of our hypervisors in this datacenter to be cut off from the rest of the network for 3 minutes.

Corrective actions will be taken to prevent this from happening again (BGP filters, dedicated admin network for the routers which was already scheduled to be set up in a few days). We will also make sure that we are warned in due time if a significant network configuration/hardware issue occurs.

Monday 17th May 2021

Post-mortem