Sunday 3rd April 2022

Infrastructure VMs are crashing on some hypervisors

Live updates:

Some hypervisors are experiencing issues with qemu. VMs are randomly crashing.

We are investigating.

  • 0323: Looks like too processes are started and systemd is kill qemu threads.
  • 0330: We suspect a recent update to be causing the thread exhaustion on the HVs.
  • 0345: We start applying a patch to revert the update.
  • 0407: We finish checking up everything. The HVs look fine, now.

Post Mortem:

Incident summary

The 4th of April, some new deployments were unable to be completed by the CCOS (Clever Cloud Operating System) orchestrator.

A few day ago, we introduced a new notification subsystem. It was required to enable the Network Groups feature. The new notification subsystem led to new connections from hypervisors agent to be initiated to the messaging component.

An issue on the proxy layer which did not properly closed connexions, led to connexion stacking until saturation of the pooler. This situation made agents to stack up too many processes on hypervisors machines for too much time preventing new processes for being spawned.

Our hypervisor controller suffered from being able to spread new threads, which led to new deployments being unable to be completed. It also prevented the current virtual machines from spawning new threads, thus crashing some of these running VMs.

Short term resolution

Network Groups being in ALPHA, we immediately decided to rollback their availability, pushing back a non blocking version which did not rely on our messaging layer.

Long term resolution

Two different actions are being rolled out.

  • The first one is a patch being currently tested on a dedicated deployment to ensure the garbage collection of connections on the messaging service proxy layer.
  • The second one is targeting the hypervisor's agent with an architectural change to prevent too much processes for being spawned. A specific driver has been setup as a service to maintain a single connexion and a single process instead of spawning an on-demand process at each notification. This modification would avoid any issue regarding the messaging service, even in case of other issue than the connection handling.