A deploy on Tuesday, zero traffic by Friday

It starts the same way every time. A routine deploy goes out. No errors, no failed tests, no alerts. The site works perfectly for every visitor who types in the URL directly. But organic traffic drops to zero - and nobody notices for days.

The culprit? A robots.txt file that tells Google to stop crawling the entire site. Two lines of text, deployed in under a second, undoing months of SEO work.

If you're not sure how robots.txt works, read our robots.txt guide first. This post is about what happens when it goes wrong in production - the real scenarios we've seen, why they're so hard to catch, and how to make sure they never happen to you.

Real-world disaster scenarios

The staging-to-production leak

The most common scenario. Your staging environment has a restrictive robots.txt to keep test content out of Google:

User-agent: *
Disallow: /

This is correct for staging. The disaster happens when your deployment pipeline copies the entire build output - including robots.txt - to production without environment-specific overrides.

We've seen this happen with:

Docker builds that bake robots.txt into the image at build time, using the staging config
CI/CD pipelines that run next build with staging environment variables, then deploy the output to production
Monorepo setups where a shared public/robots.txt is used across environments without conditional logic
Platform migrations where the new platform auto-generates a restrictive default that overrides your custom file

The fix isn't "remember to change it" - humans forget. The fix is automation (more on that below).

The CDN cache that won't let go

You fix the robots.txt, deploy it, and verify it looks correct at your origin server. But your CDN is still serving the old, restrictive version from cache. Google keeps seeing Disallow: / for hours - or days, if your cache TTL is long.

# You see the correct file at origin
curl -H "Cache-Control: no-cache" https://yourdomain.com/robots.txt
# ✅ Shows: Allow: /

# But Google sees the cached version
# ❌ Still shows: Disallow: /

After fixing a robots.txt issue, always purge your CDN cache for that specific file. On most platforms:

Vercel: Automatic on deploy, but verify with curl -I to check cache headers
Cloudflare: Purge the specific URL in the dashboard or via API
AWS CloudFront: Create an invalidation for /robots.txt

The framework upgrade that regenerates the file

You upgrade your framework or CMS, and the new version generates a robots.txt with different defaults. Your carefully configured file gets overwritten by a generic one.

This is especially common with:

Next.js robots.ts files that get reset during major version upgrades
WordPress updates that regenerate virtual robots.txt rules
Headless CMS deploys where the build step generates a new robots.txt from a template

If your robots.txt is generated dynamically (like Next.js app/robots.ts), add a test that verifies the output matches your expectations. If it's a static file, make sure your build step doesn't overwrite it.

The domain migration nobody told SEO about

Your team moves from www.example.com to example.com. DNS is updated, redirects are in place, the site works. But the robots.txt on the new canonical domain either doesn't exist (404) or has the old restrictive rules.

Google treats robots.txt per domain. A working robots.txt on www.example.com does nothing for example.com. After any domain change, verify robots.txt is accessible and correct on the new domain.

Why standard monitoring misses this

Robots.txt failures are uniquely hard to detect:

No errors in your logs. The file returns 200. It's valid. It just says the wrong thing.
No impact on user experience. Every visitor can browse the site normally. Only bots are affected.
Delayed symptoms. Google doesn't de-index instantly. Traffic declines gradually over days, making it hard to correlate with a specific deploy.
No alerts from uptime monitoring. Your monitoring checks that the site is up, not that robots.txt has the right content.

By the time someone notices the traffic drop, investigates in Google Search Console, identifies the crawl block, fixes the file, and waits for Google to re-crawl - you've lost weeks of organic traffic. And depending on your domain authority, it can take just as long to recover.

How to prevent this permanently

1. Add a robots.txt assertion to your CI pipeline

The single most effective prevention. Add a test that runs on every deploy and fails if robots.txt contains Disallow: /:

# In your CI pipeline (GitHub Actions, etc.)
ROBOTS=$(curl -s https://yourdomain.com/robots.txt)
if echo "$ROBOTS" | grep -q "Disallow: /$"; then
  echo "❌ FATAL: robots.txt is blocking all crawlers"
  exit 1
fi
echo "✅ robots.txt allows crawling"

For pre-deploy checks against build output:

# Check the built file before deploying
if grep -q "Disallow: /$" ./out/robots.txt 2>/dev/null; then
  echo "❌ Build produced a restrictive robots.txt"
  exit 1
fi

2. Use environment-aware generation

Don't use a static robots.txt file. Generate it dynamically based on the environment:

// Next.js: app/robots.ts
import type { MetadataRoute } from 'next';

export default function robots(): MetadataRoute.Robots {
  const isProduction =
    process.env.NEXT_PUBLIC_SITE_URL === 'https://yourdomain.com';

  if (!isProduction) {
    return {
      rules: { userAgent: '*', disallow: '/' },
    };
  }

  return {
    rules: { userAgent: '*', allow: '/' },
    sitemap: 'https://yourdomain.com/sitemap.xml',
  };
}

This way, staging automatically blocks crawlers and production automatically allows them. No manual switching required.

3. Monitor the file content, not just the status code

Set up a recurring check that fetches yourdomain.com/robots.txt and alerts if the content changes or contains blocking rules. This can be:

A LintPage scheduled scan that checks robots.txt as part of a full SEO audit
A cron job that curls the file and sends a Slack alert if Disallow: / appears
A GitHub Action on a schedule that runs the CI assertion against your live site

4. Purge CDN cache on every deploy

Add a post-deploy step that invalidates the CDN cache for /robots.txt. This ensures Google always sees the latest version:

# GitHub Actions example (Cloudflare)
- name: Purge robots.txt cache
  run: |
    curl -X POST \
      "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/purge_cache" \
      -H "Authorization: Bearer $CF_TOKEN" \
      -H "Content-Type: application/json" \
      -d '{"files":["https://yourdomain.com/robots.txt"]}'

Check yours right now

It takes 10 seconds to verify your production robots.txt is correct. Paste your URL below - if there's a problem, you'll know immediately.

§ try this tool

Robots.txt Validator

Validate your robots.txt file for syntax errors and blocking rules.

try it free →

A deploy on Tuesday, zero traffic by Friday

The culprit? A robots.txt file that tells Google to stop crawling the entire site. Two lines of text, deployed in under a second, undoing months of SEO work.

Real-world disaster scenarios

The staging-to-production leak

The most common scenario. Your staging environment has a restrictive robots.txt to keep test content out of Google:

User-agent: *
Disallow: /

This is correct for staging. The disaster happens when your deployment pipeline copies the entire build output - including robots.txt - to production without environment-specific overrides.

We've seen this happen with:

Docker builds that bake robots.txt into the image at build time, using the staging config
CI/CD pipelines that run next build with staging environment variables, then deploy the output to production
Monorepo setups where a shared public/robots.txt is used across environments without conditional logic
Platform migrations where the new platform auto-generates a restrictive default that overrides your custom file

The fix isn't "remember to change it" - humans forget. The fix is automation (more on that below).

The CDN cache that won't let go

# You see the correct file at origin
curl -H "Cache-Control: no-cache" https://yourdomain.com/robots.txt
# ✅ Shows: Allow: /

# But Google sees the cached version
# ❌ Still shows: Disallow: /

After fixing a robots.txt issue, always purge your CDN cache for that specific file. On most platforms:

Vercel: Automatic on deploy, but verify with curl -I to check cache headers
Cloudflare: Purge the specific URL in the dashboard or via API
AWS CloudFront: Create an invalidation for /robots.txt

The framework upgrade that regenerates the file

You upgrade your framework or CMS, and the new version generates a robots.txt with different defaults. Your carefully configured file gets overwritten by a generic one.

This is especially common with:

Next.js robots.ts files that get reset during major version upgrades
WordPress updates that regenerate virtual robots.txt rules
Headless CMS deploys where the build step generates a new robots.txt from a template

The domain migration nobody told SEO about

Google treats robots.txt per domain. A working robots.txt on www.example.com does nothing for example.com. After any domain change, verify robots.txt is accessible and correct on the new domain.

Why standard monitoring misses this

Robots.txt failures are uniquely hard to detect:

No errors in your logs. The file returns 200. It's valid. It just says the wrong thing.
No impact on user experience. Every visitor can browse the site normally. Only bots are affected.
Delayed symptoms. Google doesn't de-index instantly. Traffic declines gradually over days, making it hard to correlate with a specific deploy.
No alerts from uptime monitoring. Your monitoring checks that the site is up, not that robots.txt has the right content.

How to prevent this permanently

1. Add a robots.txt assertion to your CI pipeline

The single most effective prevention. Add a test that runs on every deploy and fails if robots.txt contains Disallow: /:

# In your CI pipeline (GitHub Actions, etc.)
ROBOTS=$(curl -s https://yourdomain.com/robots.txt)
if echo "$ROBOTS" | grep -q "Disallow: /$"; then
  echo "❌ FATAL: robots.txt is blocking all crawlers"
  exit 1
fi
echo "✅ robots.txt allows crawling"

For pre-deploy checks against build output:

# Check the built file before deploying
if grep -q "Disallow: /$" ./out/robots.txt 2>/dev/null; then
  echo "❌ Build produced a restrictive robots.txt"
  exit 1
fi

2. Use environment-aware generation

Don't use a static robots.txt file. Generate it dynamically based on the environment:

// Next.js: app/robots.ts
import type { MetadataRoute } from 'next';

export default function robots(): MetadataRoute.Robots {
  const isProduction =
    process.env.NEXT_PUBLIC_SITE_URL === 'https://yourdomain.com';

  if (!isProduction) {
    return {
      rules: { userAgent: '*', disallow: '/' },
    };
  }

  return {
    rules: { userAgent: '*', allow: '/' },
    sitemap: 'https://yourdomain.com/sitemap.xml',
  };
}

This way, staging automatically blocks crawlers and production automatically allows them. No manual switching required.

3. Monitor the file content, not just the status code

Set up a recurring check that fetches yourdomain.com/robots.txt and alerts if the content changes or contains blocking rules. This can be:

A LintPage scheduled scan that checks robots.txt as part of a full SEO audit
A cron job that curls the file and sends a Slack alert if Disallow: / appears
A GitHub Action on a schedule that runs the CI assertion against your live site

4. Purge CDN cache on every deploy

Add a post-deploy step that invalidates the CDN cache for /robots.txt. This ensures Google always sees the latest version:

# GitHub Actions example (Cloudflare)
- name: Purge robots.txt cache
  run: |
    curl -X POST \
      "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/purge_cache" \
      -H "Authorization: Bearer $CF_TOKEN" \
      -H "Content-Type: application/json" \
      -d '{"files":["https://yourdomain.com/robots.txt"]}'

Check yours right now

It takes 10 seconds to verify your production robots.txt is correct. Paste your URL below - if there's a problem, you'll know immediately.

§ try this tool

Robots.txt Validator

Validate your robots.txt file for syntax errors and blocking rules.

try it free →

How robots.txt Can Destroy Your Traffic in 30 Seconds

A deploy on Tuesday, zero traffic by Friday

Real-world disaster scenarios

The staging-to-production leak

The CDN cache that won't let go

The framework upgrade that regenerates the file

The domain migration nobody told SEO about

Why standard monitoring misses this

How to prevent this permanently

1. Add a robots.txt assertion to your CI pipeline

2. Use environment-aware generation

3. Monitor the file content, not just the status code

4. Purge CDN cache on every deploy

Check yours right now

Get notified when we publish new posts.

Want the full picture? Stop checking one thing at a time.

How robots.txt Can Destroy Your Traffic in 30 Seconds

A deploy on Tuesday, zero traffic by Friday

Real-world disaster scenarios

The staging-to-production leak

The CDN cache that won't let go

The framework upgrade that regenerates the file

The domain migration nobody told SEO about

Why standard monitoring misses this

How to prevent this permanently

1. Add a robots.txt assertion to your CI pipeline

2. Use environment-aware generation

3. Monitor the file content, not just the status code

4. Purge CDN cache on every deploy

Check yours right now

Get notified when we publish new posts.

Want the full picture? Stop checking one thing at a time.