DNS problems for *.zenml.io subdomains

Incident Report for ZenML Public Services

Postmortem

Incident Postmortem: DNS Outage During Cloudflare Migration

Date of Incident: February 25, 2025
Duration: approximately 15 minutes
Severity: High — partial service disruption affecting *.zenml.io subdomains
Authors: ZenML Infrastructure Team
Status: Resolved

Summary

On February 25, 2025, we performed a planned DNS migration for zenml.io, switching nameservers from AWS Route 53 to Cloudflare as part of a broader move from Webflow to Cloudflare Pages for our website hosting. During the cutover, we discovered that the Cloudflare zone did not contain all DNS records present in Route 53 — specifically, several wildcard and subdomain records under *.zenml.io were missing. This caused resolution failures for those subdomains, impacting users and internal services that relied on them.

We rolled back by reverting the nameservers to AWS Route 53, which restored service. Because we had intentionally preserved the Route 53 hosted zone and had previously lowered TTLs, the rollback propagated quickly.

Impact

  • What was affected: DNS resolution for multiple *.zenml.io subdomains failed, meaning those services were unreachable from the public internet.
  • Who was affected: Users accessing ZenML services hosted on subdomains of zenml.io, as well as internal tooling and integrations relying on those DNS records.
  • What was not affected: Email delivery (MX, SPF, DKIM, and DMARC records had been verified on the Cloudflare side) and the apex zenml.io domain.

Root Cause

When the Cloudflare zone for zenml.io was set up, DNS records were imported from the existing Route 53 hosted zone. Cloudflare's automatic DNS import process does not guarantee a complete, one-to-one copy of all records — particularly wildcard records, less common record types, or records with specific routing policies (such as Route 53-specific features like alias records, weighted routing, or latency-based routing).

Several records under *.zenml.io were either not imported at all or were not correctly replicated during the initial Cloudflare zone setup. This gap was not caught during the pre-migration review because the review focused primarily on critical records (MX, SPF, DKIM, DMARC, apex, www) and did not include a comprehensive, record-by-record diff between the Route 53 zone and the Cloudflare zone.

When the nameservers were switched, Cloudflare became authoritative for zenml.io. Any DNS query for a subdomain that existed in Route 53 but not in Cloudflare returned NXDOMAIN (non-existent domain), causing those services to become unreachable.

What Went Well

  • Rollback plan worked exactly as designed. We had explicitly decided to keep the Route 53 hosted zone intact rather than deleting it, which meant reverting the nameservers immediately restored all DNS resolution.
  • Low TTLs accelerated recovery. TTLs on Route 53 records had been lowered ahead of the migration, so both the initial propagation and the rollback propagated quickly.
  • Quick detection. The issue was identified within minutes of the cutover, limiting the overall impact window.
  • Clear ownership. The migration had a defined plan with assigned owners, which meant the team knew who to contact and what the rollback procedure was.

What Went Wrong

  • Incomplete DNS record import. Cloudflare's automatic import did not capture all records from Route 53. This is a known limitation of cross-provider DNS imports, but we did not account for it sufficiently.
  • No exhaustive pre-migration audit. The review of imported records focused on a curated checklist (MX, TXT, CNAME for www, apex redirect) rather than a full export-and-diff of every record in the Route 53 zone against every record in the Cloudflare zone.
  • No pre-cutover validation of subdomain resolution. We did not test resolving *.zenml.io subdomains against the Cloudflare nameservers directly (e.g., using dig @arnold.ns.cloudflare.com subdomain.zenml.io) before switching the authoritative nameservers. This would have revealed the missing records without any production impact.
  • Wildcard and subdomain records were not explicitly inventoried. The migration plan focused on the website migration (apex + www) and email, and did not include a comprehensive inventory of all subdomains in active use.

Lessons Learned

  1. Automated DNS imports between providers are not reliable enough to trust without verification. Always treat an imported zone as a starting point, not a finished state.
  2. A full zone export diff is a mandatory pre-migration step. Before any DNS provider migration, export the complete zone file from the source provider and compare it record-by-record against the destination. For Route 53, this means exporting via the AWS CLI or console and comparing against the Cloudflare zone export.
  3. Pre-cutover resolution testing is essential. Querying the new nameservers directly (e.g., dig @new-ns.example.com record.domain.com) before switching authoritative nameservers is a low-effort, high-value validation step that would have caught this issue with zero user impact.
  4. Subdomain inventory matters. For domains with many subdomains or wildcard records, maintaining an inventory of all active subdomains and their purposes is critical before any migration.

Closing Notes

We take the reliability of our infrastructure seriously, and we're sorry for the disruption this caused. The good news is that the rollback mechanisms we put in place worked as intended, and the incident was short-lived. We're using this as an opportunity to improve our migration processes so that the next attempt — and any future DNS migrations — go smoothly.

If you experienced issues during this window and are still seeing problems, please reach out to us at [support channel/email]. We're happy to help.

Posted Feb 25, 2026 - 17:57 UTC

Resolved

The DNS issue affecting cloud.zenml.io, cloudapi.zenml.io, and workspaces has been resolved. Due to DNS propagation, some regions may take longer to fully restore. If you are still experiencing connectivity issues, please try flushing your local DNS cache and retrying. If the problem persists, reach out to hello@zenml.io. A full post-mortem will follow.
Posted Feb 25, 2026 - 17:07 UTC

Monitoring

A fix has been implemented on the DNS side and is currently propagating. Access to cloud.zenml.io, cloudapi.zenml.io, and workspaces should be gradually restoring. We are monitoring the rollout to ensure full recovery. If you continue to experience issues, please try clearing your DNS cache and retrying.
Posted Feb 25, 2026 - 17:01 UTC

Identified

The issue has been identified and a fix is being implemented
Posted Feb 25, 2026 - 16:56 UTC

Update

We are continuing to investigate this issue.
Posted Feb 25, 2026 - 16:53 UTC

Update

We are continuing to investigate this issue.
Posted Feb 25, 2026 - 16:51 UTC

Investigating

We are currently investigating DNS issues affecting cloud.zenml.io. Users may experience timeouts or inability to reach ZenML Pro services. Our team is actively working to identify and resolve the root cause. We will provide updates as more information becomes available.
Posted Feb 25, 2026 - 16:47 UTC
This incident affected: ZenML Pro.