ZenML Pro Workspace Outage

Incident Report for ZenML Public Services

Postmortem

📅 Timeline of Events

Prior to May 2025:
- The production EKS cluster hosting ZenML Pro workspaces was running an outdated Kubernetes version and AMI image type.
- Decision to upgrade was taken.
May 09 2025:
- Full upgrade of staging EKS cluster was completed successfully with minimal disruption. It was also discovered that an upgrade of the EKS and VPC Terraform modules was also necessary to complete the EKS cluster upgrade.
- Upgrade plan for production was approved for the weekend of May 10–11.
May 10, 2025:
- 18:00 – Upgrade initiated. Most ZenML Pro workspaces were idle at that time. The following actions completed without issue:
  - VPC module upgrade
  - EKS module upgrade
- 23:38 – Secrets encryption with AWS KMS was enabled via Terraform as part of the EKS module upgrade. This took significantly longer than on staging (likely due to node count and secret volume).
- 00:45+ – Upgrade left to continue overnight. No issues were expected.
May 11, 2025:
- 06:00 – Upgrade was resumed. Production cluster was found in a degraded state:
  - Secrets encryption process had been successful but as a side effect, Terraform also removed key IAM permissions from EKS worker nodes.
  - Resulted in inaccessibility of ZenML Pro server pods from outside the cluster.
  - All workspaces effectively down.
- 06:00–7:30 – Downtime persisted while troubleshooting:
  - Attempts to restore access by modifying IAM roles failed.
  - Full deletion and re-creation of the node group was ultimately required.
- ~7:30 – Full restoration of services completed.

🧠 Root Cause Analysis

Primary Cause:

Regression in EKS worker node IAM permissions caused by applying KMS-based secret encryption through Terraform and hitting an undocumented bug.

Contributing Factors:

Reliance on outdated EKS and VPC Terraform modules and legacy AMI versions.
Upgrade complexity accumulated due to skipped Kubernetes and module versions.
Misjudged risk of change based on successful upgrade in staging cluster with significantly lower scale.

📉 Impact Assessment

Availability: ~8 hours of downtime for all production ZenML Pro workspaces.
Confidentiality/Integrity: No data loss, corruption, or unauthorized access occurred.
Customer Impact: Temporary loss of access to workspace UIs and pipelines

🛠️ Resolution and Remediation

Recreated affected EKS node group with correct IAM permissions and configuration.
Restored service access across all affected ZenML Pro environments.

Post-Incident Actions:

✅ Upgraded all Terraform modules (EKS, VPC) to current supported versions.
✅ Completed EKS cluster upgrades as planned.
🟢 Initiated review of periodic Kubernetes and infrastructure upgrade strategy to prevent recurrence.

Posted May 12, 2025 - 10:20 UTC

Resolved

On May 11th 2025, between 12 AM and 8 AM CEST, all ZenML Pro workspace servers were unreachable. The ZenML Pro dashboard was unaffected.

Posted May 11, 2025 - 00:00 UTC