ZenML Pro Tenant Outage

Incident Report for ZenML Public Services

Postmortem

A Kubernetes node running ZenML server pods came under memory pressure and the kubelet had to evict pods to reclaim node resources. The cluster autoscaler spun up a new node and all pods were moved to it. This led to a short disruption in service.

We have analyzed this situation and prepared a plan to make ZenML server pods more available going forward. This strategy will be worked upon with priority.

Posted Nov 08, 2024 - 10:35 UTC

Resolved

Many ZenML Pro tenants were non-accessible today for a period of 2 minutes (03:19 to 03:21 UTC).

Posted Nov 07, 2024 - 03:19 UTC