The #1 robots.txt mistake
Blocking GPTBot does NOT stop ChatGPT from citing your pages. GPTBot collects training data. OAI-SearchBot powers live citations. They are separate crawlers with separate purposes.
Complete AI Crawler Reference
As of 2026, at least 14 distinct AI crawlers from 7 major AI companies regularly fetch web pages for training, citations, and real-time answers.
Training Crawlers
Block these to prevent content from being used in AI model training. Does NOT affect citations.
GPTBotOpenAITraining data
Google-ExtendedGoogleAI training (NOT Overviews)
CCBotCommon CrawlOpen training datasets
FacebookBotMetaLlama model training
BytespiderByteDanceTikTok/Douyin AI
Citation Crawlers
Allow these to enable AI citations. Blocking them removes your pages from AI answer results.
OAI-SearchBotOpenAIChatGPT live citations via Bing
ChatGPT-UserOpenAIUser-triggered browsing
PerplexityBotPerplexity AIReal-time answer generation
ClaudeBotAnthropicWeb search for Claude.ai
YouBotYou.comYou.com AI search
GooglebotGooglePowers AI Overviews via search index
Training Crawlers vs. Citation Crawlers
Training crawlers (GPTBot, Google-Extended, CCBot, FacebookBot, Bytespider) collect data to train AI models. Blocking them prevents your content from being baked into AI model weights — but does NOT prevent your page from being cited in live AI answers. Citation/search crawlers (OAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeBot, YouBot) power live AI answers. Blocking these directly prevents your page from appearing in AI citations.
The Critical Mistake: Blocking GPTBot vs. OAI-SearchBot
The most common AI crawler mistake: site owners block GPTBot thinking it stops ChatGPT citations. It doesn't. GPTBot is for training data; OAI-SearchBot is what powers ChatGPT's live search citations (via Bing's index). Blocking GPTBot only affects whether your content is included in future model training — not whether ChatGPT cites you today. Most sites should allow OAI-SearchBot, PerplexityBot, and ClaudeBot to maintain AI citation eligibility.
robots.txt Configuration Examples
User-agent: * Disallow: /private/ Disallow: /admin/ # All AI crawlers inherit from User-agent: * above # No specific rules needed to allow all Sitemap: https://yourdomain.com/sitemap.xml
# Block AI training crawlers (IP protection) User-agent: GPTBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: / # Allow AI citation crawlers (these power AI answers) User-agent: OAI-SearchBot Allow: / User-agent: PerplexityBot Allow: / User-agent: ClaudeBot Allow: / Sitemap: https://yourdomain.com/sitemap.xml
How to Check Your Current robots.txt
Navigate to yourdomain.com/robots.txt in your browser
Use Ctrl+F to search for: GPTBot, OAI-SearchBot, PerplexityBot, ClaudeBot
If found: check if the next line is 'Disallow: /' (blocked) or 'Allow: /' (allowed)
If not found: the crawler follows the User-agent: * rule
Verify with Google's robots.txt Tester in Google Search Console
Common robots.txt Mistakes
Blocking GPTBot thinking it stops ChatGPT citations
GPTBot is for training data only. OAI-SearchBot powers ChatGPT citations. Blocking GPTBot has no effect on citations.
Confusing robots.txt Disallow with meta noindex
Disallow prevents crawling entirely. noindex prevents indexing but allows crawling. Both block AI citations via different mechanisms.
Using 'User-agent: * Disallow: /' and forgetting to Allow specific crawlers
More specific rules override general ones. Add explicit Allow rules for citation crawlers above the wildcard Disallow.
Assuming robots.txt changes propagate instantly
Crawlers cache robots.txt for 24-48 hours. Test with Google Search Console after making changes.
AI Crawler Crawl Rates and Behavior
Most AI citation crawlers recrawl pages every 7-30 days. PerplexityBot is more aggressive due to real-time search requirements. Google AI Overviews use standard Googlebot crawl data — no separate crawler needed; optimizing for Googlebot optimizes for AI Overviews. OAI-SearchBot frequency correlates with how often your content is requested in ChatGPT queries — high-citation-probability pages get recrawled more frequently. TurboAudit's Indexability check verifies all major AI crawlers are permitted in your robots.txt.
Frequently Asked Questions
No — blocking GPTBot only affects whether your content is included in future OpenAI model training. ChatGPT's live search answers (via ChatGPT Search) are powered by OAI-SearchBot, which fetches pages through Bing's index in real time. Many publishers block GPTBot for IP reasons while keeping OAI-SearchBot allowed — this is the correct approach if you want AI citations without contributing to model training.
Yes. robots.txt supports path-specific rules: 'User-agent: GPTBot / Disallow: /private/' blocks GPTBot from /private/ while allowing it on all other pages. You can also use a robots meta tag on specific pages: — though this is less standardized across AI crawlers than the robots.txt approach. For fine-grained control, use path-specific Disallow rules in robots.txt.
Check your server access logs for user-agent strings matching: 'GPTBot', 'OAI-SearchBot', 'PerplexityBot', 'ClaudeBot', 'Google-Extended'. Most hosting dashboards (Cloudflare, Vercel Analytics, Nginx access logs) capture these. You can also use a honeypot URL: create a page not linked from your site, add it only to your sitemap, and see which crawlers find it via the sitemap — AI crawlers that respect sitemaps will show up in logs.
Your pages will not appear in AI-generated answers from ChatGPT, Perplexity, Claude, or Google AI Overviews. Some AI systems (like Google AI Overviews) use Googlebot data — blocking Googlebot also blocks Google Search, which is usually not intended. For most sites, blocking all AI crawlers significantly reduces AI visibility without a meaningful benefit. If IP protection is the concern, blocking training crawlers (GPTBot, Google-Extended, CCBot) while allowing citation crawlers is the right balance.
No — Google-Extended controls AI training data only. Google AI Overviews use standard Googlebot crawl data from Google's existing search index. Blocking Google-Extended does not affect your eligibility for AI Overviews citation. Blocking Googlebot does affect AI Overviews (since Overviews draw from the search index). This is a common confusion: Google-Extended and Googlebot have completely separate purposes.
Blocking CCBot is reasonable if you don't want your content in open training datasets (used by many open-source AI models). It does not affect any major commercial AI citation systems — ChatGPT, Perplexity, Google AI Overviews, and Claude all use their own crawlers. Blocking CCBot has no negative effect on AI search visibility. It's an optional choice based on whether you want to contribute to open-source AI training data.
AI crawlers generally cannot authenticate. Login-gated content (behind a sign-in wall) is not accessible to any AI crawler, regardless of robots.txt settings. This means paywalled content, members-only sections, and private dashboards are not crawled and will never appear in AI citations. If you want AI to cite specific content, it must be publicly accessible. This also means your robots.txt rules for authenticated paths are largely irrelevant — the crawler can't get past the login anyway.
Crawl-delay tells crawlers to wait N seconds between requests. Syntax: 'Crawl-delay: 5' (5 seconds between requests). Google ignores Crawl-delay; Bing and some AI crawlers respect it. Use it if AI crawlers are generating measurable server load — check with your hosting provider. For most sites, AI crawler traffic is negligible and Crawl-delay is unnecessary. Setting a very high Crawl-delay (60+) effectively blocks crawlers without a Disallow.
Audit Your AI Search Visibility
See exactly how AI systems view your content and what to fix. Join the waitlist to get early access.
Audit Your AI Search Visibility
See exactly how AI systems view your content and what to fix. Join the waitlist to get early access.