Some systems are experiencing issues

Past Incidents

Thursday 8th June 2023

Access logs Metrics system write is slow

Our metrics system's hbase cluster is in an inconsistent state. We found out which nodes are responsible for it and are fixing them.

12:26 UTC: we restarted the node responsible for the issue. While it re-converges, we stop the egress servers. We will put them back on in a few minutes.

13:31 UTC: Query is back online. We are still catching up the lag, so new datapoints may not be available

14:35 UTC: lag has ben catched up

Wednesday 7th June 2023

Access logs Metrics and access logs storage layer unreachbility

Our monitoring has detected failure on the storage layer of metrics and access logs. We have found that a storage node has lost several disk. We have remove faulty disks and restarted the storage node.

EDIT 16:00 UTC : The storage layer is restarted and we are consuming the ingestion lag

Infrastructure [RBX] A hypervisor has rebooted
  • 2023-06-07 08:56 UTC: A hypervisor on the RBX zone has rebooted.
  • 09:00: the machine has fully rebooted, it is restarting all its VMs. Applications VMs are redeploying on other hypervisors.
  • 09:31: the checks are done, everything seems to be running fine as of now.

We will investigate to understand why this hypervisor rebooted in the first place.

Tuesday 6th June 2023

Reverse Proxies [JED] Load balancers metrics show abnormal response status code

Monitoring of load balancers is detecting an abnormal amount of http 404 status. We are investigating.

EDIT 13:00 UTC : We have located the root cause, we are applying a fix.

EDIT 14:20 UTC : The issue is resolved

Monday 5th June 2023

No incidents reported

Sunday 4th June 2023

Infrastructure [RBX] lost connectivity with an hypervisor

We lost connectivity with an hypervisor on RBX. Applications have been redeployed but some databases may not be reachable. We are investigating.

EDIT 03:58 UTC: server is back online. All databases should now be reachable.

Saturday 3rd June 2023

No incidents reported

Friday 2nd June 2023

Access logs Metrics/access logs storage layer issue

We are detecting some errors on our storage layer responsible for storing metrics and access logs data. We are investigating.

EDIT Lag has been catched up