A deploy on Tuesday, zero traffic by Friday
It starts the same way every time. A routine deploy goes out. No errors, no failed tests, no alerts. The site works perfectly for every visitor who types in the URL directly. But organic traffic drops to zero - and nobody notices for days.
The culprit? A robots.txt file that tells Google to stop crawling the entire site. Two lines of text, deployed in under a second, undoing months of SEO work.
If you're not sure how robots.txt works, read our robots.txt guide first. This post is about what happens when it goes wrong in production - the real scenarios we've seen, why they're so hard to catch, and how to make sure they never happen to you.
Real-world disaster scenarios
The staging-to-production leak
The most common scenario. Your staging environment has a restrictive robots.txt to keep test content out of Google:
User-agent: *
Disallow: /
This is correct for staging. The disaster happens when your deployment pipeline copies the entire build output - including robots.txt - to production without environment-specific overrides.
We've seen this happen with:
- Docker builds that bake robots.txt into the image at build time, using the staging config
- CI/CD pipelines that run
next buildwith staging environment variables, then deploy the output to production - Monorepo setups where a shared
public/robots.txtis used across environments without conditional logic - Platform migrations where the new platform auto-generates a restrictive default that overrides your custom file
The fix isn't "remember to change it" - humans forget. The fix is automation (more on that below).
The CDN cache that won't let go
You fix the robots.txt, deploy it, and verify it looks correct at your origin server. But your CDN is still serving the old, restrictive version from cache. Google keeps seeing Disallow: / for hours - or days, if your cache TTL is long.
# You see the correct file at origin
curl -H "Cache-Control: no-cache" https://yourdomain.com/robots.txt
# ✅ Shows: Allow: /
# But Google sees the cached version
# ❌ Still shows: Disallow: /
After fixing a robots.txt issue, always purge your CDN cache for that specific file. On most platforms:
- Vercel: Automatic on deploy, but verify with
curl -Ito check cache headers - Cloudflare: Purge the specific URL in the dashboard or via API
- AWS CloudFront: Create an invalidation for
/robots.txt
The framework upgrade that regenerates the file
You upgrade your framework or CMS, and the new version generates a robots.txt with different defaults. Your carefully configured file gets overwritten by a generic one.
This is especially common with:
- Next.js
robots.tsfiles that get reset during major version upgrades - WordPress updates that regenerate virtual robots.txt rules
- Headless CMS deploys where the build step generates a new robots.txt from a template
If your robots.txt is generated dynamically (like Next.js app/robots.ts), add a test that verifies the output matches your expectations. If it's a static file, make sure your build step doesn't overwrite it.
The domain migration nobody told SEO about
Your team moves from www.example.com to example.com. DNS is updated, redirects are in place, the site works. But the robots.txt on the new canonical domain either doesn't exist (404) or has the old restrictive rules.
Google treats robots.txt per domain. A working robots.txt on www.example.com does nothing for example.com. After any domain change, verify robots.txt is accessible and correct on the new domain.
Why standard monitoring misses this
Robots.txt failures are uniquely hard to detect:
- No errors in your logs. The file returns 200. It's valid. It just says the wrong thing.
- No impact on user experience. Every visitor can browse the site normally. Only bots are affected.
- Delayed symptoms. Google doesn't de-index instantly. Traffic declines gradually over days, making it hard to correlate with a specific deploy.
- No alerts from uptime monitoring. Your monitoring checks that the site is up, not that robots.txt has the right content.
By the time someone notices the traffic drop, investigates in Google Search Console, identifies the crawl block, fixes the file, and waits for Google to re-crawl - you've lost weeks of organic traffic. And depending on your domain authority, it can take just as long to recover.
How to prevent this permanently
1. Add a robots.txt assertion to your CI pipeline
The single most effective prevention. Add a test that runs on every deploy and fails if robots.txt contains Disallow: /:
# In your CI pipeline (GitHub Actions, etc.)
ROBOTS=$(curl -s https://yourdomain.com/robots.txt)
if echo "$ROBOTS" | grep -q "Disallow: /$"; then
echo "❌ FATAL: robots.txt is blocking all crawlers"
exit 1
fi
echo "✅ robots.txt allows crawling"
For pre-deploy checks against build output:
# Check the built file before deploying
if grep -q "Disallow: /$" ./out/robots.txt 2>/dev/null; then
echo "❌ Build produced a restrictive robots.txt"
exit 1
fi
2. Use environment-aware generation
Don't use a static robots.txt file. Generate it dynamically based on the environment:
// Next.js: app/robots.ts
import type { MetadataRoute } from 'next';
export default function robots(): MetadataRoute.Robots {
const isProduction =
process.env.NEXT_PUBLIC_SITE_URL === 'https://yourdomain.com';
if (!isProduction) {
return {
rules: { userAgent: '*', disallow: '/' },
};
}
return {
rules: { userAgent: '*', allow: '/' },
sitemap: 'https://yourdomain.com/sitemap.xml',
};
}
This way, staging automatically blocks crawlers and production automatically allows them. No manual switching required.
3. Monitor the file content, not just the status code
Set up a recurring check that fetches yourdomain.com/robots.txt and alerts if the content changes or contains blocking rules. This can be:
- A LintPage scheduled scan that checks robots.txt as part of a full SEO audit
- A cron job that curls the file and sends a Slack alert if
Disallow: /appears - A GitHub Action on a schedule that runs the CI assertion against your live site
4. Purge CDN cache on every deploy
Add a post-deploy step that invalidates the CDN cache for /robots.txt. This ensures Google always sees the latest version:
# GitHub Actions example (Cloudflare)
- name: Purge robots.txt cache
run: |
curl -X POST \
"https://api.cloudflare.com/client/v4/zones/$ZONE_ID/purge_cache" \
-H "Authorization: Bearer $CF_TOKEN" \
-H "Content-Type: application/json" \
-d '{"files":["https://yourdomain.com/robots.txt"]}'
Check yours right now
It takes 10 seconds to verify your production robots.txt is correct. Paste your URL below - if there's a problem, you'll know immediately.
Robots.txt Validator
Validate your robots.txt file for syntax errors and blocking rules.