lintpageincident.log
checksenvironmentstoolsblogfaq
sign inget started
~/blog/ai-crawler-accessibility-guide
SEOAI CrawlersCloudflareTutorial

Why ChatGPT Can't Find Your Site (Even Though robots.txt Says It Can)

Marius Orzaru·May 15, 2026·10 min read

Your robots.txt says GPTBot is welcome. Your server says 403.

Your robots.txt lists User-agent: GPTBot — Allow: /. The page loads fine in a browser. The "AI crawler" checkers say you're configured correctly. But every time ChatGPT-User actually fetches your site, it gets a 403. You don't show up in ChatGPT when people ask about your product. You don't show up in Perplexity. The standard tools can't see why, because they're reading the wrong file.

This is the most common AI crawler accessibility failure in 2026, and almost nothing on the open web explains it correctly. Most write-ups stop at "here are five user-agents, add them to your robots.txt." That's table stakes. The actual blocks happen one layer up — at the CDN, at the WAF, in the JS shell of an SPA — and you can configure robots.txt perfectly while still being invisible to every model that matters.

What's in this post

  • The three ways your site gets blocked
  • The bots that matter in 2026
  • The Cloudflare default block problem
  • The JS-rendering trap
  • The opt-outs that actually matter
  • How to test everything in 30 seconds

The three ways your site gets blocked

Three layers. They fail for different reasons, they need different fixes, and from the outside they all look the same: your site, missing from ChatGPT, no obvious cause. Most write-ups treat them as one thing. That's how readers end up patching the wrong layer.

Layer 1: robots.txt disallow (application layer)

The obvious case. Your robots.txt explicitly disallows an AI user-agent, or disallows * and never re-enabled the bots you actually wanted.

# Common failure mode: copied from a staging config
User-agent: *
Disallow: /
# Or the version that explicitly blocks AI bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

How to test: fetch /robots.txt directly and grep for AI user-agents. This is what every "AI crawler" tool already does. If this is your problem, the fix takes thirty seconds. The reason it gets so much airtime is that it's the easiest failure to detect and explain. Not because it's the most common. For more on robots.txt failure modes, see this post.

Layer 2: CDN / WAF edge block

This is the failure mode that's killing 2026 AI visibility for most sites that "did everything right." Your origin never sees the request. Cloudflare, AWS WAF, Fastly, or a custom edge rule (the one someone added at 2am after a scraper incident and nobody has touched since) returns a 403 before robots.txt gets read.

The tell: your robots.txt is permissive. The bots get blocked anyway. Parsers say everything is fine.

# What a healthy response looks like
$ curl -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" -I https://your-site.com
HTTP/2 200
content-type: text/html; charset=utf-8

# What an edge-level block looks like
$ curl -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" -I https://your-site.com
HTTP/2 403
server: cloudflare
cf-ray: 8b9c2f1e4a8d3c12-FRA

server: cloudflare plus a 403 or 429 means the request died at the edge. Same shape with server: AmazonS3 and a WAF rule, or via: 1.1 fastly. We'll go deep on Cloudflare below; it's the biggest source of silent blocks we see in practice.

Layer 3: origin or application block

The rest happens at your server. Less common than edge blocks. Easier to hide:

  • Custom user-agent filtering. Someone added if (ua.includes("Bot")) return 403 to middleware years ago. It catches GPTBot along with everything else.
  • Rate limiting. Per-IP limits hit AI crawlers harder than human traffic because the crawler IPs are concentrated in a handful of datacenters.
  • Geo-blocking. AI bots fetch from regions your geo rules don't trust.
  • JS-rendering invisibility. 200 OK, empty body, model walks away with nothing. Worth its own section, coming up.

How to test: curl your page with each AI user-agent and read the response body. Don't just check the status code. A 200 with no content is a 200 that means nothing to a language model.

The bots that matter in 2026

Most "AI crawler" lists copy each other and never explain what each bot actually does. Here's the practical version, sorted by what it costs you to block each one.

USER-AGENT              · PURPOSE                          · BLOCKING IMPACT                       · SEVERITY
─────────────────────────────────────────────────────────────────────────────────────────────────────────────
GPTBot                  · OpenAI training crawler          · Excluded from future GPT training     · low–medium *
ChatGPT-User            · ChatGPT live retrieval           · Invisible in ChatGPT answers          · CRITICAL
OAI-SearchBot           · ChatGPT Search index             · Excluded from ChatGPT Search          · HIGH
ClaudeBot               · Anthropic training crawler       · Excluded from Claude training         · low–medium *
Claude-User             · Claude live retrieval            · Invisible in Claude answers           · CRITICAL
anthropic-ai            · Legacy Anthropic UA              · Same as Claude-User (older clients)   · HIGH
PerplexityBot           · Perplexity index                 · Excluded from Perplexity              · HIGH
Perplexity-User         · Perplexity live retrieval        · Invisible to Perplexity queries       · CRITICAL
Google-Extended         · Gemini + AI Overviews            · Excluded from Google's AI surfaces    · HIGH
Applebot-Extended       · Apple Intelligence training     · Excluded from Apple AI features       · LOW
meta-externalagent      · Meta AI training / retrieval     · Excluded from Meta AI                 · MEDIUM
Bytespider              · ByteDance crawler                · Excluded from ByteDance AI products   · LOW

* Blocking a training crawler is a legitimate choice; lots of sites opt out and consider that fine. Blocking a live retrieval crawler is almost always an accident that destroys your AI visibility.

This distinction is the only AI crawler concept that actually matters. Everything else is footnotes. Two categories, opposite blast radius (and yes, the naming is genuinely awful — ChatGPT-User and GPTBot sound interchangeable, they aren't):

  • Training crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended). They index your content for future model training. Opting out keeps you out of training data. Your live AI visibility doesn't move.
  • Live-retrieval crawlers (ChatGPT-User, Claude-User, Perplexity-User). They fetch a page right now because a human asked a question that needs it. Blocking these is what makes you invisible in the answer.

Every "should I block AI?" debate that skips this distinction is wasted oxygen. You can opt out of training and still show up in answers. The configuration is just different.

The Cloudflare default block problem

In July 2024, Cloudflare shipped a one-click "Block AI Scrapers and Crawlers" toggle and turned it on by default for new free-plan zones. It blocks at the edge, runs before your origin sees the request, and bypasses robots.txt entirely. (Cloudflare announcement)

This single setting is likely responsible for more silent AI invisibility in 2026 than every misconfigured robots.txt combined. Three things make it especially destructive:

  1. It's on by default for many zones. Anyone who created a Cloudflare site in the last 18 months may have it enabled without knowing.
  2. Standard tools can't see it. Cloudflare blocks before your origin runs. robots.txt is served by your origin. Parsers only see what the origin says — they're talking to a server that has no idea the conversation happened.
  3. It blocks live retrieval too. It doesn't distinguish training crawlers from live-retrieval ones. ChatGPT-User gets the same 403 as GPTBot.

Picture Cloudflare as a bouncer at the door. The bouncer checks the user-agent on the ID and decides whether to let the request through. Your robots.txt is a sign on the wall inside the building. The bouncer never reads it. The bot never gets close enough to.

Anyone running a Cloudflare zone created since mid-2024 who hasn't checked their AI Audit settings should assume the bots are blocked until they've proven otherwise.

How to verify

Run the curl tests from the previous section. If you see server: cloudflare with a 403 on bot user-agents but a 200 on a regular browser UA, this is what's happening:

# Browser UA — passes
$ curl -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36" -I https://your-site.com
HTTP/2 200

# AI bot UA — blocked at the edge
$ curl -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" -I https://your-site.com
HTTP/2 403
server: cloudflare
cf-mitigated: challenge

To confirm in the dashboard: Cloudflare → Security → Bots → Configure. Look at the AI Audit / "Block AI Crawlers" toggle and the Super Bot Fight Mode settings.

How to fix

Three options, in increasing order of granularity:

1. Turn the global AI block off
   → Cloudflare → Security → Bots → uncheck "Block AI Scrapers"
   → Use this if you want to be discoverable in all AI surfaces.

2. Allow specific bots, block the rest
   → Cloudflare → Security → WAF → Create a custom rule
   → Match: http.user_agent contains "ChatGPT-User"
   → Action: Skip
   → Useful if you want to block training but allow live retrieval.

3. Use the verified-bot allowlist
   → Cloudflare maintains a list of verified AI bots that bypass blocks.
   → Settings → Bots → Verified Bots → review which AI categories you trust.

The same pattern shows up across edge providers. AWS WAF has managed rule groups that block AI bots by user-agent (AWS-AWSManagedRulesBotControlRuleSet). Fastly customers have written custom VCL to do the same. If you're on any CDN, the question to ask is: is anything filtering by user-agent before my origin sees the request?

The JS-rendering trap

There's a fourth way to be invisible that isn't technically a block, and it's worse because every diagnostic lies. Your server returns 200. Your headers look healthy. The bot fetches your page and walks away with nothing.

AI crawlers don't run JavaScript. They read the initial HTML payload and stop. (Yes, Google's been rendering JS for search since 2019. The AI crawlers haven't caught up. They probably won't soon.) If your site is a client-rendered SPA, the bot is reading an empty <body> and a div#root that never gets populated.

# Server returns 200, but the body is empty
$ curl -s -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" https://your-spa.com | wc -c
1247
# 1247 bytes — basically just the shell. The actual content is rendered by JS.

If you want to be more rigorous:

# Pipe the response through a text extractor and count actual content
$ curl -s -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" https://your-spa.com \
    | sed 's/<[^>]*>//g' | tr -s '[:space:]' ' ' | wc -w
14
# 14 words of content visible to GPTBot. The page actually has 1,200.

Or just open dev tools, disable JavaScript, and reload. If your content disappears, the AI crawlers are seeing the same blank page.

The fix is server-rendering or static generation. In a Next.js App Router app, server components and SSG routes work by default. The trap isn't 'use client' itself — client components still get server-rendered on first request. It's client-side data fetching (useEffect, useQuery, useSWR) that produces SSR'd HTML with the shell rendered but no content. Move data fetching server-side and keep 'use client' at the leaf component that needs interactivity. Astro, Remix, and SvelteKit default to SSR. The pattern that breaks is the pure SPA. CRA, Vite without SSR, anything that ships an empty index.html and hydrates from there.

Not a quick fix. But if you've ruled out robots.txt and edge blocks and you're still invisible, this is probably what's happening. The bot can't see your content because there isn't any to see.

The opt-outs that actually matter

Not every AI bot deserves the same answer. Treating "block AI" as a binary instead of a per-bot judgment call is how you end up either too permissive (free training data, no upside) or too restrictive (invisible in answers you wanted to be in). A short opinionated breakdown:

Always allow live-retrieval bots. ChatGPT-User, Claude-User, Perplexity-User. There is no downside. These fetch your page only when a human is actively asking a question that points at your content. Blocking them is a self-inflicted wound.

Allow training bots only if you want to be in training data. GPTBot, ClaudeBot, anthropic-ai. Opting out is a legitimate choice that more sites are making, especially publishers and SaaS companies who'd rather not have their docs used as gradient updates. Your live AI visibility isn't affected either way.

Google-Extended is the complicated one. It controls Gemini and AI Overviews. Opting out keeps you out of those surfaces, which is increasingly costly as AI Overviews show up on a growing share of Google searches. Google-Extended is separate from Googlebot (disallowing the former has zero effect on your regular Google rankings). Lots of sites that thought they were opting out of "AI" tank their Gemini visibility and change nothing about their search traffic. Which is dumb.

Applebot-Extended, meta-externalagent, Bytespider are lower-stakes. These bots feed AI surfaces with much smaller market share. Decide on principle, not blast radius.

The framing that helps: every AI bot is either a customer (live retrieval, sends users back to you) or a vendor (training, builds models that may or may not link back to you). Most sites should let the customers in.

How to test everything in 30 seconds

Testing all three layers manually is tedious. You need a curl loop with ten user-agents, a parser for cf-ray and similar headers, a body-size heuristic for the JS-rendering trap, and a way to cross-reference robots.txt rules. Every time another "AI crawler checker" parses a robots.txt and pronounces it fine, we lose a little hope. We built the AI Crawler Checker to do all of that in one scan.

It fetches your page as each of the ten bots in the table above. Reports the real HTTP response, not the robots.txt claim. Flags Cloudflare and WAF-level blocks via response headers. Runs the body-content heuristic to catch SPA shells. If you've been getting clean reports from the other AI crawler tools and you're still not in ChatGPT, this is what you want to run.

§ try this tool
AI Crawler Checker
Test if ChatGPT, Claude, and Perplexity can actually read your page — not just whether your robots.txt says they can.
try it free →

While you're there, it's worth running the rest of the pre-launch SEO checklist. AI visibility issues tend to cluster with normal indexing issues. The same staging environment that ships a restrictive robots.txt to production also tends to ship noindex tags (47-day case study) and broken sitemaps. The full check registry lists everything LintPage scans for.

The 30-second version

Configuring robots.txt right in 2026 keeps you from being trivially invisible. It doesn't make you visible. The failure mode killing AI visibility for most sites isn't a missing robots.txt directive. It's an edge-level block they didn't know was on, or a JS-rendered page the crawler can't read. If robots.txt is the only thing you test, you're checking the layer where almost nothing actually goes wrong.

Test the live fetch. Run it under every bot UA you care about. And don't just check the status code — read what came back. Your site might be one Cloudflare toggle away from being invisible to half the web in 2027 — and that toggle is already there, waiting.

§ about the author
Marius OrzaruFounder, LintPage (BludeskSoft)

I built LintPage after a single stray noindex tag slipped into production and quietly cost us 47 days of organic traffic. It now runs the 60 automated checks I wish we had run before that deploy.

LinkedIn →
§ faq

Questions, answered.

Is my site blocked from ChatGPT? +
Possibly — and your robots.txt won't tell you. ChatGPT uses two crawlers: GPTBot (training) and ChatGPT-User (live retrieval when a user asks a question). Blocking ChatGPT-User makes you invisible in ChatGPT answers even if your robots.txt is permissive. The most common cause in 2026 is an edge-level block at Cloudflare, AWS WAF, or Fastly that runs before robots.txt is read. Test by running curl with the ChatGPT-User user-agent against your site and checking for a 403 response with a cf-ray or similar CDN header.
How do I allow AI crawlers on my website? +
Three layers need to be checked. First, your robots.txt should explicitly allow the AI user-agents you want — GPTBot, ChatGPT-User, ClaudeBot, Claude-User, PerplexityBot, Perplexity-User, Google-Extended. Second, your CDN or WAF must not be blocking AI bots at the edge — on Cloudflare, check Security → Bots → AI Audit. Third, if your site is a JavaScript SPA, server-render your content so bots can read it without executing JS.
Why is GPTBot returning 403 even though robots.txt allows it? +
Because something between OpenAI and your origin server is blocking the request before robots.txt is consulted. The most common culprit is Cloudflare's "Block AI Scrapers and Crawlers" toggle, which became a default on many new zones starting in mid-2024. Other causes include AWS WAF managed bot rules, Fastly VCL filters, custom middleware that rejects bot user-agents, and aggressive per-IP rate limits. If curl returns a 403 with server: cloudflare or a cf-ray header, that's the source.
What's the difference between GPTBot and ChatGPT-User? +
GPTBot is OpenAI's training crawler — it indexes content that may be used to train future models. Opting out of GPTBot is a legitimate choice and is increasingly common. ChatGPT-User is the live retrieval agent that fetches a specific page when a ChatGPT user asks a question that requires it. Blocking ChatGPT-User has nothing to do with training data — it just makes you invisible in ChatGPT answers. The same training-vs-retrieval split applies to ClaudeBot vs Claude-User and PerplexityBot vs Perplexity-User.
Should I block AI crawlers? +
It depends on which bot. Live-retrieval bots (ChatGPT-User, Claude-User, Perplexity-User) should almost always be allowed — they only fetch your page when a human is actively asking about your content, and blocking them is purely a self-inflicted wound. Training crawlers (GPTBot, ClaudeBot, Google-Extended) are a legitimate choice either way: opting out keeps your content out of training data without affecting your AI search visibility. Google-Extended specifically also controls inclusion in Gemini and Google's AI Overviews, which is becoming a larger share of search.
How do I test if Claude can read my website? +
Use curl with the Claude-User user-agent to fetch your page and inspect the response status and body. A 200 response with meaningful body content means Claude can read your page. A 403 means something is blocking the request — usually your CDN. A 200 with an empty or near-empty body means your site is a client-rendered SPA and Claude is seeing the JS shell rather than your actual content. The LintPage AI Crawler Checker runs all three of these tests for ten AI bots at once.

Get notified when we publish new posts.

§ run all 60 checks at once

Want the full picture? Stop checking one thing at a time.

Get a complete pre-launch SEO audit of your site with a single click.

run a full audit →
lintpage

Pre-launch SEO linting for developers. Catch disasters before they ship.

Product

  • Overview
  • Pre-launch checks
  • Full audit

Free tools

  • Meta tag checker
  • Robots.txt validator
  • AI crawler checker
  • OG preview
  • Sitemap validator
  • Heading checker
  • SSL checker
  • Redirect checker
  • Structured data validator
  • Broken link checker
  • Core Web Vitals checker
  • Security headers checker
  • Canonical tag checker
  • All tools →

Resources

  • Blog
  • About
  • RSS feed
  • Contact

Legal

  • Privacy
  • Terms
© 2026 lintpage. All rights reserved.built after one too many post-mortems.