Robots.txt AI Bot Checker

Check whether GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and 8 more AI crawlers can access any domain. Results in seconds, with primary docs cited.

Try:

What this tool does (60 seconds)

Enter any domain. We fetch its robots.txt file and parse it against the 12 AI crawlers operated by OpenAI, Anthropic, Perplexity, Google, Apple, ByteDance, and Meta. You see a clear table of which AI bots are allowed, partially blocked, or fully blocked. Every bot links to its primary vendor documentation. No login, no rate limits for reasonable use, no data stored.

The 12 AI crawlers this tool checks (verified June 2026)

Every crawler links to its vendor's primary documentation — verify before publishing robots.txt changes. AI vendors update crawler names and purposes periodically.

Bot name	Vendor	Purpose	Notes + primary doc
GPTBot	OpenAI	Training (used to improve OpenAI models)	Distinct from OAI-SearchBot and ChatGPT-User — block GPTBot does NOT block live ChatGPT search retrieval. docs →
OAI-SearchBot	OpenAI	Live search retrieval for ChatGPT Search	Separate from GPTBot. If you want to be cited by ChatGPT Search, allow OAI-SearchBot. docs →
ChatGPT-User	OpenAI	User-fetched browsing (when a ChatGPT user asks ChatGPT to read a specific URL)	Blocking ChatGPT-User prevents users from sharing your URL inside ChatGPT. docs →
ClaudeBot	Anthropic	Training (general crawler for Claude model training)	Anthropic distinguishes 3 crawlers: ClaudeBot (training), Claude-User (user-fetched), Claude-SearchBot (live search). docs →
Claude-User	Anthropic	User-fetched browsing inside Claude.ai	Mirrors ChatGPT-User. Blocking prevents users from sharing your URL inside Claude. docs →
Claude-SearchBot	Anthropic	Live search retrieval for Claude (when Claude searches the web in real time)	Added 2024-2025 as Claude expanded live search capability. Allow to be cited in Claude answers. docs →
PerplexityBot	Perplexity AI	Crawl + index for Perplexity answers	Perplexity also operates Perplexity-User for user-initiated fetches. docs →
Perplexity-User	Perplexity AI	User-fetched browsing	Per Perplexity docs, Perplexity-User does not respect robots.txt by design (treats as user-driven). Blocking via robots.txt may not work; use firewall rules instead. docs →
Google-Extended	Google	Opt-out flag for Gemini & Vertex AI training	NOT a separate crawler. It's a control token. Blocking Google-Extended does not affect Googlebot indexing — your search rankings are safe. docs →
Applebot-Extended	Apple	Opt-out flag for Apple Intelligence training	Mirrors Google-Extended pattern. Added 2024. Does not affect Applebot indexing for Siri/Spotlight. docs →
Bytespider	ByteDance (TikTok)	Training for ByteDance LLMs (Doubao)	Among the most aggressive crawlers. Widely blocked by publishers concerned about training without compensation. docs →
Meta-ExternalAgent	Meta	Training for Llama models	Added 2024. Distinct from facebookexternalhit (link previews) and Meta-ExternalFetcher. docs →

What is robots.txt?

Robots.txt is a plain-text file at the root of your domain that tells crawlers which parts of your site they may access. It lives at https://yourdomain.com/robots.txt — always at the root, never in a subdirectory. The file uses two directives:

User-agent: — which crawler the rules apply to (or * for all)
Disallow: / Allow: — which paths they may or may not fetch

Originally defined in 1994 (the Robots Exclusion Protocol by Martijn Koster), it was formalized as an IETF standard in 2022 as RFC 9309. Robots.txt is advisory — it's a request, not enforcement. Well-behaved crawlers (Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot) respect it. Malicious crawlers ignore it entirely. For genuine blocking, use server-level rules (Cloudflare bot fight, firewall rules, IP-based blocks).

Example robots.txt

# Allow all standard crawlers
User-agent: *
Allow: /

# Block AI training crawlers
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Bytespider
User-agent: Meta-ExternalAgent
Disallow: /

# Allow AI live-retrieval bots (for citation visibility)
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: PerplexityBot
Allow: /

# Sitemap declaration (only documented hook for AI bot sitemap discovery)
Sitemap: https://yourdomain.com/sitemap.xml

Should you block AI bots? The honest answer

Three honest scenarios. There is no single right answer — it depends on your business model and whether AI citation traffic is a net positive for you.

Block — Paywalled news, premium research, original journalism

If your business model depends on subscription revenue or syndication fees, allowing GPTBot, ClaudeBot, and Bytespider to train on your archive lets them compete with you using your own work. Major publishers (NYT, Reuters, BBC) have moved toward selective blocking + licensing deals. Block training bots (GPTBot, ClaudeBot, Bytespider, Meta-ExternalAgent, Google-Extended, Applebot-Extended); selectively allow live-retrieval bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) if you want AI citation traffic.

Allow — B2B SaaS, agencies, consultants wanting AI citation traffic

If being cited in ChatGPT, Perplexity, or Claude answers drives qualified inbound, blocking AI crawlers is shooting yourself in the foot. Forrester 2026 B2B Buyer Journey: ~84% of B2B buyers consult AI assistants before talking to vendors. Allow GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, Claude-User, PerplexityBot. Block Bytespider and Meta-ExternalAgent only if you have specific concerns (low-quality training, scrape volume).

Conditional — E-commerce, marketplaces, mid-funnel content sites

AI citation drives discovery but can also commoditize product comparisons. Allow live-retrieval bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) so your category pages can be cited. Consider blocking training bots (GPTBot, ClaudeBot, Bytespider) selectively for product pages where you don't want competitors' AI assistants quoting your specs. Test with the tool above on competitor sites — most large e-commerce brands have NOT blocked AI bots as of mid-2026.

8 common robots.txt mistakes (and what fixes them)

Thinking User-agent: * blocks AI bots▾

It does — but AI vendors honor specific User-agent matches first. If you have both `User-agent: *` Disallow and `User-agent: GPTBot` Allow, GPTBot is allowed. Order doesn't matter; specificity does.

Blocking GPTBot but forgetting OAI-SearchBot▾

GPTBot is OpenAI's training crawler. OAI-SearchBot is OpenAI's live retrieval crawler for ChatGPT Search. They're separate. If you want to block training but allow citations in ChatGPT Search, block GPTBot only and allow OAI-SearchBot explicitly.

Blocking Google-Extended expecting it to affect Gemini▾

Google-Extended is a control flag, not a crawler. It tells Google not to use your content for Gemini and Vertex AI training. It does NOT block Gemini from citing you in live answers — that's a separate signal. And it does NOT affect Googlebot search indexing.

Trying to block Perplexity-User via robots.txt▾

Per Perplexity's own docs, Perplexity-User does not respect robots.txt by design (it treats user-initiated fetches as out of scope for robots.txt). Use server-level rules (Cloudflare bot fight mode, firewall rules) if you genuinely need to block it.

Using disallowed Crawl-delay against AI bots▾

OpenAI explicitly ignores Crawl-delay in robots.txt for GPTBot. Most AI crawlers ignore Crawl-delay too. If you need rate limiting, use server-side throttling (Cloudflare, Vercel firewall rules), not robots.txt.

Robots.txt blocks GPTBot but your site has llms.txt allowing everything▾

Robots.txt is access control; llms.txt is content curation. They serve different purposes. Make them consistent: if you block GPTBot in robots.txt, don't publish an llms.txt that points GPTBot at your premium content.

No Sitemap directive — AI bots have no doc hook to find your sitemap▾

Add `Sitemap: https://yourdomain.com/sitemap.xml` to robots.txt. This is the only documented hook for AI crawlers (GPTBot, ClaudeBot, PerplexityBot) to discover sitemaps. Without it they may miss large portions of your site.

Robots.txt returns 5xx instead of 404 when missing▾

If you have no robots.txt, your server should return 404 (treated as 'allow everything'). A 5xx response is ambiguous — some crawlers treat it as 'try again later' and may stop crawling entirely. Verify with our tool above.

Robots.txt vs llms.txt vs sitemap.xml

Three files at your domain root, three different purposes. Most well-configured 2026 sites have all three.

Dimension	robots.txt	llms.txt	sitemap.xml
Purpose	Access control	Content curation for LLMs	URL discovery
Status	IETF RFC 9309 (2022)	Emerging spec (llmstxt.org, 2024)	Sitemap Protocol 0.9 (2008)
Adoption	Universal	Partial (Anthropic referenced; OpenAI no formal commitment)	Universal
Format	Plain text (User-agent / Allow / Disallow)	Markdown	XML
Location	/robots.txt	/llms.txt	/sitemap.xml (declared in robots.txt)
Affects	What crawlers may fetch	What LLMs prioritize	What gets indexed

Frequently asked questions

What is robots.txt?▾

Robots.txt is a plain-text file at the root of your domain (https://yourdomain.com/robots.txt) that tells web crawlers which parts of your site they may access. It uses two directives: User-agent (which crawler) and Disallow / Allow (what they can fetch). Originally defined in 1994 (the Robots Exclusion Protocol), it was formalized as an IETF standard in 2022 (RFC 9309). Robots.txt is advisory — it's a request, not enforcement. Well-behaved crawlers (Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot) respect it. Malicious crawlers ignore it.

What's the difference between GPTBot, OAI-SearchBot, and ChatGPT-User?▾

GPTBot is OpenAI's training crawler — it fetches pages to improve future OpenAI models. OAI-SearchBot is OpenAI's live retrieval crawler for ChatGPT Search — it fetches pages in real time to answer user queries. ChatGPT-User is the user-fetched browsing agent — triggered when a ChatGPT user asks ChatGPT to read a specific URL. They're three separate crawlers with different purposes. Block GPTBot to opt out of training; allow OAI-SearchBot and ChatGPT-User to remain visible in ChatGPT answers. Source: platform.openai.com/docs/bots.

What's the difference between ClaudeBot, Claude-User, and Claude-SearchBot?▾

Anthropic operates three crawlers, mirroring OpenAI's pattern. ClaudeBot is training; Claude-User is user-fetched browsing inside Claude.ai; Claude-SearchBot is live search retrieval added through 2024-2025 as Claude expanded web search capability. They're listed at support.anthropic.com/en/articles/8896518. To be cited in Claude answers, allow Claude-SearchBot and Claude-User. If you want to opt out of training only, block ClaudeBot but allow the other two.

Why would I block AI crawlers?▾

Three legitimate reasons. (1) Compensation: your content is the product (paywalled news, premium research, syndication), and you don't want AI vendors training on it without paying — the NYT, Reuters, and BBC pattern. (2) Quality control: low-quality crawlers (Bytespider has been flagged for aggressive scraping) can degrade site performance or scrape at undesirable volumes. (3) Brand control: in highly regulated industries (healthcare, finance), you may not want AI-paraphrased versions of your content circulating without your context. For most B2B SaaS and agency sites, blocking AI bots is a net loss because it eliminates AI citation traffic.

Does blocking AI bots hurt my Google SEO rankings?▾

No. Googlebot (search indexing) and Google-Extended (Gemini AI training) are separate. Blocking Google-Extended does not affect your position in Google search results — Google has stated this explicitly in their crawler docs. Same logic for GPTBot (separate from Googlebot), ClaudeBot, and PerplexityBot — they have nothing to do with traditional search ranking. The only SEO-relevant crawlers are Googlebot, Bingbot, and Yandex-bot (depending on your market). AI crawlers affect AI citation visibility, not search ranking.

If a bot isn't listed in my robots.txt, is it allowed?▾

Yes, by default. The Robots Exclusion Protocol treats absence as permission. If robots.txt doesn't mention a User-agent AND has no `User-agent: *` Disallow block matching the requested path, the bot is implicitly allowed to crawl. Only explicit Disallow directives block crawling. This means if you want to block GPTBot, you must add an explicit `User-agent: GPTBot` + `Disallow: /` block — relying on `User-agent: *` works but is overbroad (it blocks everyone).

How do I block a specific AI bot?▾

Add this to your robots.txt: User-agent: GPTBot Disallow: / Replace `GPTBot` with the exact User-agent name (case-sensitive in some implementations — use the exact casing from the vendor's docs). Place the file at the root of your domain at https://yourdomain.com/robots.txt. You can stack multiple blocks: User-agent: GPTBot User-agent: ClaudeBot User-agent: Bytespider Disallow: / Use our tool above to verify the rules took effect.

How do I allow only specific paths to AI bots?▾

Use targeted Allow + Disallow combinations: User-agent: GPTBot Allow: /docs/ Allow: /blog/ Disallow: / This allows GPTBot to crawl /docs/ and /blog/ but blocks everything else. Useful for sites that want their public content indexed by AI but want to protect customer dashboards, paywalled archives, or pricing pages.

What's the difference between robots.txt, llms.txt, and sitemap.xml?▾

Robots.txt is access control — tells crawlers what they may fetch. Llms.txt is content curation — tells LLMs which content matters most (emerging spec; partial adoption). Sitemap.xml is URL discovery — lists every page worth crawling. They're complementary. Most well-configured sites have all three. Use our /tools/sitemap-checker for sitemap.xml validation and /tools/llms-txt-generator to create an llms.txt.

How often should I update robots.txt?▾

Whenever you change site structure, launch a new section, or want to update AI bot policy. AI vendors add new crawlers periodically (Anthropic added Claude-SearchBot through 2024-2025; Apple added Applebot-Extended in 2024). Check vendor docs every 3-6 months. Also re-test with our tool whenever you change robots.txt — typos in User-agent names are the #1 source of accidentally-public content.

Sources

IETF RFC 9309 — Robots Exclusion Protocol (2022). datatracker.ietf.org/doc/html/rfc9309
OpenAI bot documentation — GPTBot, OAI-SearchBot, ChatGPT-User. platform.openai.com/docs/bots
Anthropic crawler documentation — ClaudeBot, Claude-User, Claude-SearchBot. support.anthropic.com
Perplexity bot documentation — PerplexityBot, Perplexity-User. docs.perplexity.ai/guides/bots
Google crawler documentation — Googlebot, Google-Extended. developers.google.com
Apple Applebot documentation — Applebot, Applebot-Extended (2024). support.apple.com/en-us/119829
ByteDance Bytespider documentation. support.tiktok.com
Meta crawler documentation — Meta-ExternalAgent. developers.facebook.com
Forrester 2026 B2B Buyer Journey research — ~84% of B2B buyers consult AI assistants before vendors. forrester.com
llmstxt.org — proposed llms.txt spec (Jeremy Howard / Answer.AI, 2024). llmstxt.org

Related free tools

Sitemap Checker — 10 spec-grounded sitemap.xml checks →Meta Tags Extractor — see every meta tag on a page →Twitter / X Card Validator — visual preview + 10 checks →JSON-LD Schema Generator — Article, FAQ, Product →llms.txt Generator — AI-curation file →AI Search Visibility Audit — 250+ checks per page →

Full AI SEO audit

Want to know what AI engines actually say about you?

The bot checker tells you what's allowed. TurboAudit tells you whether AI engines actually mention you — and what to fix to get cited. 5 free audits, no credit card.

Run a free audit