Your robots.txt says GPTBot is welcome. Your server says 403.
Your robots.txt lists User-agent: GPTBot — Allow: /. The page loads fine in a browser. The "AI crawler" checkers say you're configured correctly. But every time ChatGPT-User actually fetches your site, it gets a 403. You don't show up in ChatGPT when people ask about your product. You don't show up in Perplexity. The standard tools can't see why, because they're reading the wrong file.
This is the most common AI crawler accessibility failure in 2026, and almost nothing on the open web explains it correctly. Most write-ups stop at "here are five user-agents, add them to your robots.txt." That's table stakes. The actual blocks happen one layer up — at the CDN, at the WAF, in the JS shell of an SPA — and you can configure robots.txt perfectly while still being invisible to every model that matters.
What's in this post
- The three ways your site gets blocked
- The bots that matter in 2026
- The Cloudflare default block problem
- The JS-rendering trap
- The opt-outs that actually matter
- How to test everything in 30 seconds
The three ways your site gets blocked
Three layers. They fail for different reasons, they need different fixes, and from the outside they all look the same: your site, missing from ChatGPT, no obvious cause. Most write-ups treat them as one thing. That's how readers end up patching the wrong layer.
Layer 1: robots.txt disallow (application layer)
The obvious case. Your robots.txt explicitly disallows an AI user-agent, or disallows * and never re-enabled the bots you actually wanted.
# Common failure mode: copied from a staging config
User-agent: *
Disallow: /
# Or the version that explicitly blocks AI bots
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
How to test: fetch /robots.txt directly and grep for AI user-agents. This is what every "AI crawler" tool already does. If this is your problem, the fix takes thirty seconds. The reason it gets so much airtime is that it's the easiest failure to detect and explain. Not because it's the most common. For more on robots.txt failure modes, see this post.
Layer 2: CDN / WAF edge block
This is the failure mode that's killing 2026 AI visibility for most sites that "did everything right." Your origin never sees the request. Cloudflare, AWS WAF, Fastly, or a custom edge rule (the one someone added at 2am after a scraper incident and nobody has touched since) returns a 403 before robots.txt gets read.
The tell: your robots.txt is permissive. The bots get blocked anyway. Parsers say everything is fine.
# What a healthy response looks like
$ curl -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" -I https://your-site.com
HTTP/2 200
content-type: text/html; charset=utf-8
# What an edge-level block looks like
$ curl -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" -I https://your-site.com
HTTP/2 403
server: cloudflare
cf-ray: 8b9c2f1e4a8d3c12-FRA
server: cloudflare plus a 403 or 429 means the request died at the edge. Same shape with server: AmazonS3 and a WAF rule, or via: 1.1 fastly. We'll go deep on Cloudflare below; it's the biggest source of silent blocks by a wide margin.
Layer 3: origin or application block
The rest happens at your server. Less common than edge blocks. Easier to hide:
- Custom user-agent filtering. Someone added
if (ua.includes("Bot")) return 403to middleware years ago. It catchesGPTBotalong with everything else. - Rate limiting. Per-IP limits hit AI crawlers harder than human traffic because the crawler IPs are concentrated in a handful of datacenters.
- Geo-blocking. AI bots fetch from regions your geo rules don't trust.
- JS-rendering invisibility.
200 OK, empty body, model walks away with nothing. Worth its own section, coming up.
How to test: curl your page with each AI user-agent and read the response body. Don't just check the status code. A 200 with no content is a 200 that means nothing to a language model.
The bots that matter in 2026
Most "AI crawler" lists copy each other and never explain what each bot actually does. Here's the practical version, sorted by what it costs you to block each one.
USER-AGENT · PURPOSE · BLOCKING IMPACT · SEVERITY
─────────────────────────────────────────────────────────────────────────────────────────────────────────────
GPTBot · OpenAI training crawler · Excluded from future GPT training · low–medium *
ChatGPT-User · ChatGPT live retrieval · Invisible in ChatGPT answers · CRITICAL
OAI-SearchBot · ChatGPT Search index · Excluded from ChatGPT Search · HIGH
ClaudeBot · Anthropic training crawler · Excluded from Claude training · low–medium *
Claude-User · Claude live retrieval · Invisible in Claude answers · CRITICAL
anthropic-ai · Legacy Anthropic UA · Same as Claude-User (older clients) · HIGH
PerplexityBot · Perplexity index · Excluded from Perplexity · HIGH
Perplexity-User · Perplexity live retrieval · Invisible to Perplexity queries · CRITICAL
Google-Extended · Gemini + AI Overviews · Excluded from Google's AI surfaces · HIGH
Applebot-Extended · Apple Intelligence training · Excluded from Apple AI features · LOW
meta-externalagent · Meta AI training / retrieval · Excluded from Meta AI · MEDIUM
Bytespider · ByteDance crawler · Excluded from ByteDance AI products · LOW
* Blocking a training crawler is a legitimate choice; lots of sites opt out and consider that fine. Blocking a live retrieval crawler is almost always an accident that destroys your AI visibility.
This distinction is the only AI crawler concept that actually matters. Everything else is footnotes. Two categories, opposite blast radius (and yes, the naming is genuinely awful — ChatGPT-User and GPTBot sound interchangeable, they aren't):
- Training crawlers (
GPTBot,ClaudeBot,Google-Extended,Applebot-Extended). They index your content for future model training. Opting out keeps you out of training data. Your live AI visibility doesn't move. - Live-retrieval crawlers (
ChatGPT-User,Claude-User,Perplexity-User). They fetch a page right now because a human asked a question that needs it. Blocking these is what makes you invisible in the answer.
Every "should I block AI?" debate that skips this distinction is wasted oxygen. You can opt out of training and still show up in answers. The configuration is just different.
The Cloudflare default block problem
In July 2024, Cloudflare shipped a one-click "Block AI Scrapers and Crawlers" toggle and turned it into a default on new zones. It blocks at the edge, runs before your origin sees the request, and bypasses robots.txt entirely. (Cloudflare announcement)
This single setting is likely responsible for more silent AI invisibility in 2026 than every misconfigured robots.txt combined. Three things make it especially destructive:
- It's on by default for many zones. Anyone who created a Cloudflare site in the last 18 months may have it enabled without knowing.
- Standard tools can't see it. Cloudflare blocks before your origin runs.
robots.txtis served by your origin. Parsers only see what the origin says — they're talking to a server that has no idea the conversation happened. - It blocks live retrieval too. It doesn't distinguish training crawlers from live-retrieval ones.
ChatGPT-Usergets the same403asGPTBot.
Picture Cloudflare as a bouncer at the door. The bouncer checks the user-agent on the ID and decides whether to let the request through. Your robots.txt is a sign on the wall inside the building. The bouncer never reads it. The bot never gets close enough to.
Anyone running a Cloudflare zone created since mid-2024 who hasn't checked their AI Audit settings should assume the bots are blocked until they've proven otherwise.
How to verify
Run the curl tests from the previous section. If you see server: cloudflare with a 403 on bot user-agents but a 200 on a regular browser UA, this is what's happening:
# Browser UA — passes
$ curl -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36" -I https://your-site.com
HTTP/2 200
# AI bot UA — blocked at the edge
$ curl -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" -I https://your-site.com
HTTP/2 403
server: cloudflare
cf-mitigated: challenge
To confirm in the dashboard: Cloudflare → Security → Bots → Configure. Look at the AI Audit / "Block AI Crawlers" toggle and the Super Bot Fight Mode settings.
How to fix
Three options, in increasing order of granularity:
1. Turn the global AI block off
→ Cloudflare → Security → Bots → uncheck "Block AI Scrapers"
→ Use this if you want to be discoverable in all AI surfaces.
2. Allow specific bots, block the rest
→ Cloudflare → Security → WAF → Create a custom rule
→ Match: cf.client.bot AND http.user_agent contains "ChatGPT-User"
→ Action: Skip
→ Useful if you want to block training but allow live retrieval.
3. Use the verified-bot allowlist
→ Cloudflare maintains a list of verified AI bots that bypass blocks.
→ Settings → Bots → Verified Bots → review which AI categories you trust.
The same pattern shows up across edge providers. AWS WAF has managed rule groups that block AI bots by user-agent (AWS-AWSManagedRulesBotControlRuleSet). Fastly customers have written custom VCL to do the same. If you're on any CDN, the question to ask is: is anything filtering by user-agent before my origin sees the request?
The JS-rendering trap
There's a fourth way to be invisible that isn't technically a block, and it's worse because every diagnostic lies. Your server returns 200. Your headers look healthy. The bot fetches your page and walks away with nothing.
AI crawlers don't run JavaScript. They read the initial HTML payload and stop. (Yes, Google's been rendering JS for search since 2019. The AI crawlers haven't caught up. They probably won't soon.) If your site is a client-rendered SPA, the bot is reading an empty <body> and a div#root that never gets populated.
# Server returns 200, but the body is empty
$ curl -s -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" https://your-spa.com | wc -c
1247
# 1247 bytes — basically just the shell. The actual content is rendered by JS.
If you want to be more rigorous:
# Pipe the response through a text extractor and count actual content
$ curl -s -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" https://your-spa.com \
| sed 's/<[^>]*>//g' | tr -s '[:space:]' ' ' | wc -w
14
# 14 words of content visible to GPTBot. The page actually has 1,200.
Or just open dev tools, disable JavaScript, and reload. If your content disappears, the AI crawlers are seeing the same blank page.
The fix is server-rendering or static generation. In a Next.js app, server components and SSG routes work without intervention; the failure mode is reaching for 'use client' at the page level when you don't actually need it. Astro, Remix, and SvelteKit default to SSR. The pattern that breaks is the pure SPA. CRA, Vite without SSR, anything that ships an empty index.html and hydrates from there.
Not a quick fix. But if you've ruled out robots.txt and edge blocks and you're still invisible, this is probably what's happening. The bot can't see your content because there isn't any to see.
The opt-outs that actually matter
Not every AI bot deserves the same answer. Treating "block AI" as a binary instead of a per-bot judgment call is how you end up either too permissive (free training data, no upside) or too restrictive (invisible in answers you wanted to be in). A short opinionated breakdown:
Always allow live-retrieval bots. ChatGPT-User, Claude-User, Perplexity-User. There is no downside. These fetch your page only when a human is actively asking a question that points at your content. Blocking them is a self-inflicted wound.
Allow training bots only if you want to be in training data. GPTBot, ClaudeBot, anthropic-ai. Opting out is a legitimate choice that more sites are making, especially publishers and SaaS companies who'd rather not have their docs used as gradient updates. Your live AI visibility isn't affected either way.
Google-Extended is the complicated one. It controls Gemini and AI Overviews. Opting out keeps you out of those surfaces, which is increasingly costly as AI Overviews show up on a growing share of Google searches. Google-Extended is separate from Googlebot (disallowing the former has zero effect on your regular Google rankings). Lots of sites that thought they were opting out of "AI" tank their Gemini visibility and change nothing about their search traffic. Which is dumb.
Applebot-Extended, meta-externalagent, Bytespider are lower-stakes. These bots feed AI surfaces with much smaller market share. Decide on principle, not blast radius.
The framing that helps: every AI bot is either a customer (live retrieval, sends users back to you) or a vendor (training, builds models that may or may not link back to you). Most sites should let the customers in.
How to test everything in 30 seconds
Testing all three layers manually is tedious. You need a curl loop with ten user-agents, a parser for cf-ray and similar headers, a body-size heuristic for the JS-rendering trap, and a way to cross-reference robots.txt rules. Every time another "AI crawler checker" parses a robots.txt and pronounces it fine, we lose a little hope. We built the AI Crawler Checker to do all of that in one scan.
It fetches your page as each of the ten bots in the table above. Reports the real HTTP response, not the robots.txt claim. Flags Cloudflare and WAF-level blocks via response headers. Runs the body-content heuristic to catch SPA shells. If you've been getting clean reports from the other AI crawler tools and you're still not in ChatGPT, this is what you want to run.
While you're there, it's worth running the rest of the pre-launch SEO checklist. AI visibility issues tend to cluster with normal indexing issues. The same staging environment that ships a restrictive robots.txt to production also tends to ship noindex tags (47-day case study) and broken sitemaps. The full check registry lists everything LintPage scans for.
The 30-second version
Configuring robots.txt right in 2026 keeps you from being trivially invisible. It doesn't make you visible. The failure mode killing AI visibility for most sites isn't a missing robots.txt directive. It's an edge-level block they didn't know was on, or a JS-rendered page the crawler can't read. If robots.txt is the only thing you test, you're checking the layer where almost nothing actually goes wrong.
Test the live fetch. Run it under every bot UA you care about. And don't just check the status code — read what came back. Your site might be one Cloudflare toggle away from being invisible to half the web in 2027 — and that toggle is already there, waiting.