Deep Dive

AI Crawlers & robots.txt: Complete Reference Guide

All 14 AI crawlers, their roles (training vs. citation), and how to configure robots.txt to control AI crawler access without blocking citations.

TurboAudit TeamFebruary 18, 202611 min

The #1 robots.txt mistake

Blocking GPTBot does NOT stop ChatGPT from citing your pages. GPTBot collects training data. OAI-SearchBot powers live citations. They are separate crawlers with separate purposes.

Complete AI Crawler Reference

As of 2026, at least 14 distinct AI crawlers from 7 major AI companies regularly fetch web pages for training, citations, and real-time answers.

Training Crawlers

Block these to prevent content from being used in AI model training. Does NOT affect citations.

GPTBotOpenAI

Training data

Google-ExtendedGoogle

AI training (NOT Overviews)

CCBotCommon Crawl

Open training datasets

FacebookBotMeta

Llama model training

BytespiderByteDance

TikTok/Douyin AI

Citation Crawlers

Allow these to enable AI citations. Blocking them removes your pages from AI answer results.

OAI-SearchBotOpenAI

ChatGPT live citations via Bing

ChatGPT-UserOpenAI

User-triggered browsing

PerplexityBotPerplexity AI

Real-time answer generation

ClaudeBotAnthropic

Web search for Claude.ai

YouBotYou.com

You.com AI search

GooglebotGoogle

Powers AI Overviews via search index

Training Crawlers vs. Citation Crawlers

Training crawlers (GPTBot, Google-Extended, CCBot, FacebookBot, Bytespider) collect data to train AI models. Blocking them prevents your content from being baked into AI model weights — but does NOT prevent your page from being cited in live AI answers. Citation/search crawlers (OAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeBot, YouBot) power live AI answers. Blocking these directly prevents your page from appearing in AI citations.

The Critical Mistake: Blocking GPTBot vs. OAI-SearchBot

The most common AI crawler mistake: site owners block GPTBot thinking it stops ChatGPT citations. It doesn't. GPTBot is for training data; OAI-SearchBot is what powers ChatGPT's live search citations (via Bing's index). Blocking GPTBot only affects whether your content is included in future model training — not whether ChatGPT cites you today. Most sites should allow OAI-SearchBot, PerplexityBot, and ClaudeBot to maintain AI citation eligibility.

robots.txt Configuration Examples

Allow All (Recommended for most sites)
User-agent: *
Disallow: /private/
Disallow: /admin/

# All AI crawlers inherit from User-agent: * above
# No specific rules needed to allow all
Sitemap: https://yourdomain.com/sitemap.xml
Allow Citations, Block Training (for publishers with IP concerns)
# Block AI training crawlers (IP protection)
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# Allow AI citation crawlers (these power AI answers)
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

How to Check Your Current robots.txt

1

Navigate to yourdomain.com/robots.txt in your browser

2

Use Ctrl+F to search for: GPTBot, OAI-SearchBot, PerplexityBot, ClaudeBot

3

If found: check if the next line is 'Disallow: /' (blocked) or 'Allow: /' (allowed)

4

If not found: the crawler follows the User-agent: * rule

5

Verify with Google's robots.txt Tester in Google Search Console

Common robots.txt Mistakes

Blocking GPTBot thinking it stops ChatGPT citations

GPTBot is for training data only. OAI-SearchBot powers ChatGPT citations. Blocking GPTBot has no effect on citations.

Confusing robots.txt Disallow with meta noindex

Disallow prevents crawling entirely. noindex prevents indexing but allows crawling. Both block AI citations via different mechanisms.

Using 'User-agent: * Disallow: /' and forgetting to Allow specific crawlers

More specific rules override general ones. Add explicit Allow rules for citation crawlers above the wildcard Disallow.

Assuming robots.txt changes propagate instantly

Crawlers cache robots.txt for 24-48 hours. Test with Google Search Console after making changes.

AI Crawler Crawl Rates and Behavior

Most AI citation crawlers recrawl pages every 7-30 days. PerplexityBot is more aggressive due to real-time search requirements. Google AI Overviews use standard Googlebot crawl data — no separate crawler needed; optimizing for Googlebot optimizes for AI Overviews. OAI-SearchBot frequency correlates with how often your content is requested in ChatGPT queries — high-citation-probability pages get recrawled more frequently. TurboAudit's Indexability check verifies all major AI crawlers are permitted in your robots.txt.

Frequently Asked Questions

No — blocking GPTBot only affects whether your content is included in future OpenAI model training. ChatGPT's live search answers (via ChatGPT Search) are powered by OAI-SearchBot, which fetches pages through Bing's index in real time. Many publishers block GPTBot for IP reasons while keeping OAI-SearchBot allowed — this is the correct approach if you want AI citations without contributing to model training.

Yes. robots.txt supports path-specific rules: 'User-agent: GPTBot / Disallow: /private/' blocks GPTBot from /private/ while allowing it on all other pages. You can also use a robots meta tag on specific pages: — though this is less standardized across AI crawlers than the robots.txt approach. For fine-grained control, use path-specific Disallow rules in robots.txt.

Check your server access logs for user-agent strings matching: 'GPTBot', 'OAI-SearchBot', 'PerplexityBot', 'ClaudeBot', 'Google-Extended'. Most hosting dashboards (Cloudflare, Vercel Analytics, Nginx access logs) capture these. You can also use a honeypot URL: create a page not linked from your site, add it only to your sitemap, and see which crawlers find it via the sitemap — AI crawlers that respect sitemaps will show up in logs.

Your pages will not appear in AI-generated answers from ChatGPT, Perplexity, Claude, or Google AI Overviews. Some AI systems (like Google AI Overviews) use Googlebot data — blocking Googlebot also blocks Google Search, which is usually not intended. For most sites, blocking all AI crawlers significantly reduces AI visibility without a meaningful benefit. If IP protection is the concern, blocking training crawlers (GPTBot, Google-Extended, CCBot) while allowing citation crawlers is the right balance.

No — Google-Extended controls AI training data only. Google AI Overviews use standard Googlebot crawl data from Google's existing search index. Blocking Google-Extended does not affect your eligibility for AI Overviews citation. Blocking Googlebot does affect AI Overviews (since Overviews draw from the search index). This is a common confusion: Google-Extended and Googlebot have completely separate purposes.

Blocking CCBot is reasonable if you don't want your content in open training datasets (used by many open-source AI models). It does not affect any major commercial AI citation systems — ChatGPT, Perplexity, Google AI Overviews, and Claude all use their own crawlers. Blocking CCBot has no negative effect on AI search visibility. It's an optional choice based on whether you want to contribute to open-source AI training data.

AI crawlers generally cannot authenticate. Login-gated content (behind a sign-in wall) is not accessible to any AI crawler, regardless of robots.txt settings. This means paywalled content, members-only sections, and private dashboards are not crawled and will never appear in AI citations. If you want AI to cite specific content, it must be publicly accessible. This also means your robots.txt rules for authenticated paths are largely irrelevant — the crawler can't get past the login anyway.

Crawl-delay tells crawlers to wait N seconds between requests. Syntax: 'Crawl-delay: 5' (5 seconds between requests). Google ignores Crawl-delay; Bing and some AI crawlers respect it. Use it if AI crawlers are generating measurable server load — check with your hosting provider. For most sites, AI crawler traffic is negligible and Crawl-delay is unnecessary. Setting a very high Crawl-delay (60+) effectively blocks crawlers without a Disallow.

Coming Soon

Audit Your AI Search Visibility

See exactly how AI systems view your content and what to fix. Join the waitlist to get early access.

3 free auditsNo credit cardEarly access
Coming Soon

Audit Your AI Search Visibility

See exactly how AI systems view your content and what to fix. Join the waitlist to get early access.

3 free auditsNo credit cardEarly access