A fix has been implemented and we are monitoring the results.
Posted Mar 13, 2024 - 16:40 UTC
Identified
Node running out of memory due to high increase of memory of grafana-agent. Also a rogue hangfire process in a dev container seems to contribute to this.
A new node has been spawned. The rogue container has been scaled down in order to establish if they both are the root cause of the problem.
Posted Mar 13, 2024 - 15:26 UTC
This incident affected: Kubernetes critical components (Kubernetes [Hellman] - Capacity/Scheduling).