One Kubernetes node is crashed. Multiple pods affected

Incident Report for DFDS IT

Postmortem

Postmortem published at: https://github.com/dfds/postmortems/blob/master/PM2024-001 - Two Kubernetes nodes went into NotReady state.md

Posted Mar 18, 2024 - 08:52 UTC

Resolved

This incident has been resolved.

Posted Mar 14, 2024 - 09:20 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 13, 2024 - 16:40 UTC

Identified

Node running out of memory due to high increase of memory of grafana-agent. Also a rogue hangfire process in a dev container seems to contribute to this.

A new node has been spawned. The rogue container has been scaled down in order to establish if they both are the root cause of the problem.

Posted Mar 13, 2024 - 15:26 UTC

This incident affected: Kubernetes critical components (Kubernetes [Hellman] - Capacity/Scheduling).