Thursday 4th January 2024

Access logs [Metrics] Elevated queries error rate

We are seeing elevated error rate for metrics read queries due to the underlying storage system. The problem has been identified and we are working toward its resolution. This can impact some of the grafana dashboards or API queries. Write performance is not impacted.

Update Thu Jan 04 14:48:00 2024 UTC: We have triggered some data balancing. Some queries may take longer than expected. This can impact some of the grafana dashboards or API queries. Write performance may be impacted.

Update Thu Jan 04 20:44:01 2024 UTC: data balancing is more aggressive than expected, overloading some components. Query may be unavailable during that time

Update Fri Jan 05 02:26:05 2024 UTC: some components are still overloaded. We are currently catching up the lag, but query is disabled for now.

Update Fri Jan 05 08:01:45 2024 UTC: our write-path is still overloaded. We are searching for the bottleneck

Update Fri Jan 05 16:03:48 2024 UTC: a cleanup subroutine has been triggered to balance and remove slack space from our internal Btree storage. Query is still disabled to speed-up the process.

Update: Sat Jan 06 11:25:28 2024 UTC: lag has been absorbed. Query is now up, the cleanup subroutine is still in-progress. You may notice latency spikes during query.

Update: Mon Jan 08 14:36:57 2024 UTC: cleanup subroutine is still in-progress, and some workloads triggered an overloading of some components. Query is disabled to speed-up recovery

Update: Mon Jan 08 16:36:18 2024 UTC: query is now open.

Update Tue Jan 09 14:38:34 2024 UTC: Some StorageServers are late, meaning that a really small portion of the data is not available for the query. We are currently catching up with the lag

Update Tue Jan 16 14:56:55 2024 UTC: closing the ticket.