Date of Incident: February 25, 2025
Duration: approximately 15 minutes
Severity: High — partial service disruption affecting *.zenml.io subdomains
Authors: ZenML Infrastructure Team
Status: Resolved
On February 25, 2025, we performed a planned DNS migration for zenml.io, switching nameservers from AWS Route 53 to Cloudflare as part of a broader move from Webflow to Cloudflare Pages for our website hosting. During the cutover, we discovered that the Cloudflare zone did not contain all DNS records present in Route 53 — specifically, several wildcard and subdomain records under *.zenml.io were missing. This caused resolution failures for those subdomains, impacting users and internal services that relied on them.
We rolled back by reverting the nameservers to AWS Route 53, which restored service. Because we had intentionally preserved the Route 53 hosted zone and had previously lowered TTLs, the rollback propagated quickly.
When the Cloudflare zone for zenml.io was set up, DNS records were imported from the existing Route 53 hosted zone. Cloudflare's automatic DNS import process does not guarantee a complete, one-to-one copy of all records — particularly wildcard records, less common record types, or records with specific routing policies (such as Route 53-specific features like alias records, weighted routing, or latency-based routing).
Several records under *.zenml.io were either not imported at all or were not correctly replicated during the initial Cloudflare zone setup. This gap was not caught during the pre-migration review because the review focused primarily on critical records (MX, SPF, DKIM, DMARC, apex, www) and did not include a comprehensive, record-by-record diff between the Route 53 zone and the Cloudflare zone.
When the nameservers were switched, Cloudflare became authoritative for zenml.io. Any DNS query for a subdomain that existed in Route 53 but not in Cloudflare returned NXDOMAIN (non-existent domain), causing those services to become unreachable.
dig @arnold.ns.cloudflare.com subdomain.zenml.io) before switching the authoritative nameservers. This would have revealed the missing records without any production impact.dig @new-ns.example.com record.domain.com) before switching authoritative nameservers is a low-effort, high-value validation step that would have caught this issue with zero user impact.We take the reliability of our infrastructure seriously, and we're sorry for the disruption this caused. The good news is that the rollback mechanisms we put in place worked as intended, and the incident was short-lived. We're using this as an opportunity to improve our migration processes so that the next attempt — and any future DNS migrations — go smoothly.
If you experienced issues during this window and are still seeing problems, please reach out to us at [support channel/email]. We're happy to help.