Deep Dive

Indexability & AI Crawl Access: Complete Guide

Robots.txt, canonical tags, noindex, redirect chains, HTTPS — every technical signal that determines whether AI systems can access and process your page.

TurboAudit TeamFebruary 18, 202612 min

What Is AI Indexability?

31% of audited pages block at least one major AI crawler

Nearly one in three pages has a disqualifying technical issue that prevents AI from seeing the content before any content quality evaluation even begins.

AI indexability is the set of technical conditions that determine whether an AI crawler can access, fetch, and parse your page content. TurboAudit runs 8 checks. A single failure can block AI access entirely.

1

robots.txt AI Crawler Access

BLOCKER
2

HTTP Status Code

BLOCKER
3

Canonical Tag

BLOCKER
4

JavaScript Rendering

HIGH
5

robots Meta Tag

BLOCKER
6

Redirect Chains

MEDIUM
7

HTTPS and Mixed Content

MEDIUM
8

Page Load Speed

MEDIUM
1

Check 1: robots.txt AI Crawler Access

BLOCKER

The robots.txt file tells crawlers which paths they can and cannot access. When a robots.txt rule disallows an AI crawler from your page’s path, that crawler will not request the page at all — your content is completely inaccessible to it. The major AI crawlers you need to explicitly permit are: - **GPTBot** — OpenAI’s primary training and browsing crawler - **OAI-SearchBot** — OpenAI’s real-time search crawler for ChatGPT search - **Claude-SearchBot** (also ClaudeBot) — Anthropic’s web crawler - **PerplexityBot** — Perplexity AI’s crawler - **Google-Extended** — Google’s crawler for Gemini and AI Overviews training - **Amazonbot** — Amazon’s crawler for Alexa and AI product features **BLOCKER condition:** Any rule matching Disallow: / for these crawlers blocks the entire site. More targeted rules (e.g., Disallow: /admin/) are acceptable but must be audited to confirm they do not cover important pages. **Pass condition:** The robots.txt either omits the crawler’s user-agent (meaning it follows the * default, ideally Allow: /) or explicitly sets Allow: / for each named crawler. **Common trap:** Many SEO plugins auto-add protective robots.txt rules that inadvertently block AI crawlers. Vercel, WordPress with Yoast, and some Shopify themes have added blanket Disallow rules for all non-Google crawlers.

Pass

All major AI crawlers allowed or not disallowed

Fail

Disallow: / for GPTBot, PerplexityBot, or other AI crawlers

How to Audit Your robots.txt

Fetch your robots.txt directly at yourdomain.com/robots.txt. Check each AI crawler’s user-agent explicitly. **WordPress:** Some security plugins add Disallow: / for unknown bots. Open the Yoast SEO or RankMath robots.txt editor and inspect the full file. **Next.js:** If you use the App Router robots.ts export, verify the output by visiting /robots.txt in production. The disallow array must not include / for AI crawlers. **Shopify:** Use the robots.txt.liquid file to add explicit Allow rules for each AI crawler user-agent. After any change, wait 24–48 hours for crawlers to re-fetch the robots.txt before assuming the fix is live.

2

Check 2: HTTP Status Code

BLOCKER

Every page you want AI to index must return HTTP status code 200 OK. AI crawlers interpret the HTTP response code the same way Google does — a non-200 status communicates a problem with the page’s availability. **Status code categories:** - **200 OK** — Pass. Page is available and can be crawled. - **301/302** — Redirect. AI crawlers follow a single redirect but lose some indexability signal. - **404 Not Found** — Fail. AI crawler records the page as unavailable. - **410 Gone** — Fail. Explicit signal that page has been permanently removed. - **403 Forbidden** — BLOCKER. Crawler is denied access, content is invisible. - **429 Too Many Requests** — BLOCKER. Crawler is being rate-limited and cannot access the page. - **500/503 Server Error** — BLOCKER. Server failure prevents crawling. **Soft 404s** are a special case: the server returns 200 OK but the page content is an error message. AI crawlers that parse these pages extract error text instead of useful content. **Pass example:** Server returns HTTP/2 200 with the page’s actual content. **Fail example:** Server returns HTTP/2 200 but the page body contains an error message — a soft 404.

Pass

Page returns HTTP 200 OK with actual content

Fail

403, 404, 429, 5xx, or soft 404 (200 with error content)

3

Check 3: Canonical Tag

BLOCKER

The canonical tag (link rel=canonical) tells crawlers which URL is the authoritative version of a page. For AI indexability, the canonical tag must point back to the page’s own URL — a self-referencing canonical. **Three canonical failure modes:** **1. Canonical pointing to a different page.** If /blog/article-a has a canonical pointing to /blog/article-b, the AI crawler treats /blog/article-a as duplicate content. Your original URL gets zero indexability credit. **2. Missing canonical tag.** Without a canonical, AI crawlers must infer the authoritative URL. If your page is accessible via multiple URLs, crawlers may split indexability between URL variants. **3. Duplicate canonical tags.** Two or more canonical link elements in the same head. Crawlers treat this as contradictory signals and may ignore both. **Pass example:** A self-referencing canonical matching the actual page URL — https://yourdomain.com/blog/article-a. **Fail example:** A canonical pointing to a different URL — https://yourdomain.com/blog/article-b — declaring this page a duplicate.

Pass

Self-referencing canonical matching the actual page URL

Fail

Canonical pointing to a different page, missing, or duplicated

Dynamic Pages and Canonical Tags

E-commerce sites with faceted navigation often generate hundreds of URL variants (e.g., /shoes?color=red&size=10). Every variant should have a canonical pointing to the base URL (/shoes) to consolidate indexability. In Next.js, use the metadata export’s alternates.canonical property to set canonical tags programmatically per page. This ensures the canonical is always accurate and never hardcoded to a staging URL.

4

Check 4: JavaScript Rendering

HIGH

A 2025 SearchVIU study found that 69% of AI crawlers cannot execute JavaScript. When an AI crawler fetches your page, it receives the raw HTML sent by your server. If your main content is inserted into the DOM by JavaScript after the initial load, those crawlers see an empty shell — not your content. This is the “rendering gap.” It is the single highest-impact technical issue for AI indexability, and it disproportionately affects sites built with modern JavaScript frameworks. **Affected frameworks (client-rendered by default):** - Create React App (CRA) - Vite + React without SSR configured - Single-page Angular apps without Angular Universal **Not affected (server-side by default):** - Next.js App Router with Server Components — content in initial HTML - Nuxt.js (Vue) with SSR enabled - Astro — content in initial HTML by default - Traditional WordPress / PHP **How to check:** Right-click your page and select View Page Source (not Inspect Element). Search for a sentence from your main content. If it’s not in the source HTML, AI crawlers cannot see it. **The fix:** Use Server-Side Rendering (SSR) or Static Site Generation (SSG) for all content you want AI to index. In Next.js, keep content in Server Components and avoid useEffect-based data fetching for primary content.

Pass

Main content present in HTML source without JavaScript

Fail

Content only appears after JavaScript executes (rendering gap)

Partial JavaScript Issues

Even with SSR enabled, partial rendering issues can still block AI indexability: - **Tabs and accordions:** Content inside collapsed tabs often is not in the initial HTML — it loads on interaction. Use progressive disclosure patterns with CSS-controlled visibility. - **Infinite scroll:** Only the first batch of results is in the initial HTML. AI sees only that first batch. - **Lazy-loaded text:** Images are fine to lazy-load. Text content behind an IntersectionObserver may not be in the initial HTML.

5

Check 5: robots Meta Tag

BLOCKER

The robots meta tag lives in your page’s head element and provides page-level crawl and indexing directives. Unlike robots.txt (which controls crawler access), the robots meta tag controls what the crawler can do with the page content after accessing it. Two directives are BLOCKERS for AI indexability: **noindex** — Tells crawlers not to include this page in their index. For AI systems, this means the page will not be cited. Intended for staging pages and admin panels — catastrophic on content pages. Usage: meta name=robots content=noindex **nosnippet** — Tells crawlers not to extract content from the page. This effectively prevents citation even if the page is indexed. Usage: meta name=robots content=nosnippet **max-snippet:0** — Sets the allowed snippet length to zero characters, preventing all content extraction. Usage: meta name=robots content=max-snippet:0 **Pass example:** meta name=robots content=index, follow **Common cause of accidental noindex:** CMS preview modes, staging environments where robots tags are copied to production, or A/B testing tools that inject noindex during experiments.

Pass

meta robots = index, follow (or omitted)

Fail

noindex, nosnippet, or max-snippet:0 present

6

Check 6: Redirect Chains

MEDIUM

A redirect chain occurs when URL A redirects to URL B, which redirects to URL C, which redirects to URL D. Each hop adds latency, reduces indexability signal transfer, and increases the probability a crawler abandons the chain. **Single redirect (acceptable):** A → B A single 301 redirect transfers the majority of indexability signal. Acceptable for URL consolidation. **Redirect chain (problematic):** A → B → C → D Each additional hop loses approximately 15–20% of indexability signal. AI crawlers typically follow a maximum of 5 redirects before abandoning. **Redirect loop (BLOCKER):** A → B → A (or longer loops) The crawler continues until the redirect limit is reached. The page is effectively inaccessible. **Common sources of redirect chains:** - HTTP → HTTPS plus www → non-www handled as two separate redirects not merged into one - Old URL → intermediate URL → new URL during site migrations - Affiliate or tracking redirect → marketing redirect → final page **The fix:** Audit all redirect chains with Screaming Frog or an online redirect checker. Collapse each chain to a single direct redirect from the original URL to the final destination.

Pass

Single redirect (A → B) or no redirect

Fail

Chain of 2+ redirects or a redirect loop

7

Check 7: HTTPS and Mixed Content

MEDIUM

AI crawlers, like modern browsers, require HTTPS for secure content delivery. HTTP pages are flagged as insecure by crawlers and browsers alike, and some AI systems deprioritize HTTP-only pages as a trust signal failure. **HTTPS check:** - **Pass:** Page loads at https://yourdomain.com/... with a valid SSL certificate - **Fail:** Page loads at http://yourdomain.com/... (unencrypted) **Mixed content:** Mixed content occurs when an HTTPS page loads resources (images, scripts, stylesheets, iframes) from HTTP URLs. This signals to crawlers that the page is not fully secure. Mixed content is especially common on: - Sites migrated from HTTP to HTTPS without updating hardcoded resource URLs - Pages embedding third-party widgets that only serve resources over HTTP - Blog posts with images uploaded during the HTTP era of the site **How to check:** In Chrome DevTools, open the Console tab. Mixed content errors appear as warnings: the page was loaded over HTTPS but requested an insecure resource. **The fix:** Update all resource URLs to HTTPS. For WordPress, the Better Search Replace plugin can batch-update HTTP URLs across the database.

Pass

Page fully served over HTTPS with no HTTP resource loads

Fail

HTTP page or HTTPS page with mixed HTTP resources

8

Check 8: Page Load Speed

MEDIUM

AI crawlers simulate a mobile browser environment and have a finite budget for how long they wait for a page to respond. If your page does not return content within approximately 1.8 seconds, the crawler records a timeout and treats the page as unavailable. **Speed targets for AI indexability:** - **Time to First Byte (TTFB):** Under 600ms — the server must respond quickly - **Full page response time:** Under 1.8 seconds — content must be in the initial response - **Largest Contentful Paint (LCP):** Under 2.5 seconds — Google’s AI Overviews correlates citation rates with LCP **Common causes of slow AI crawl times:** - Unoptimized server response time (shared hosting, no CDN) - Large uncompressed HTML payloads (pages over 500KB) - Blocking render scripts in the head that delay HTML delivery - SSR with expensive uncached database queries on each request **How to measure:** Use Google PageSpeed Insights (free) to check your TTFB and LCP. **Quick wins:** 1. Add a CDN (Cloudflare free tier reduces TTFB for most pages) 2. Enable gzip or Brotli compression on your server 3. Move non-critical scripts to defer or async loading 4. Cache server-side rendered pages at the CDN edge

Pass

TTFB under 600ms, full response under 1.8 seconds

Fail

TTFB over 1.8s causes AI crawler timeout

Common Mistakes by Platform

Different platforms have distinct failure patterns for AI indexability. Here are the most common mistakes by platform, with specific fixes.

📝

WordPress

  • Check Settings > Reading: disable "Discourage search engines"
  • Review robots.txt in Yoast SEO or RankMath for Disallow rules
  • Verify category/tag archive pages are not noindex
  • Use View Page Source to confirm page builder content is in HTML
  • Run Really Simple SSL and verify DevTools Console shows no mixed content

Next.js / React

  • Visit /robots.txt on production to verify disallow rules
  • Ensure primary content stays in Server Components
  • Check layout.tsx robots metadata does not cascade noindex to content pages
  • Use View Source to verify SSR content is in initial HTML
  • Confirm notFound() only fires for genuinely missing resources
🛒

Shopify

  • Edit robots.txt.liquid to add Allow: / for GPTBot, OAI-SearchBot, PerplexityBot, Claude-SearchBot, Google-Extended
  • Verify paginated collection pages have canonical pointing to page 1
  • Audit third-party apps for client-side content injection
  • Check that product descriptions appear in HTML source

The robots.txt Template for AI Crawlers

Use this template as your baseline. Place the file at your domain root (yourdomain.com/robots.txt). Replace private path prefixes with the actual paths on your site.

/robots.txt
# robots.txt — AI Crawler Permissions Template
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /account/
Disallow: /api/

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Amazonbot
Allow: /

# Sitemap
Sitemap: https://yourdomain.com/sitemap.xml

Do not add Disallow: / for any AI crawler unless you have a specific legal or business reason to block that crawler from your content.

Frequently Asked Questions

Blocking GPTBot does not directly affect your Google search rankings. GPTBot is OpenAI’s crawler and is separate from Googlebot and Google-Extended. Blocking GPTBot means your content won’t be used in ChatGPT’s responses or training data, reducing your AI search visibility on OpenAI’s platforms. Your Google rankings are governed by Googlebot and Google-Extended, not GPTBot.

GPTBot is OpenAI’s primary crawler used for training data collection and periodic content updates. OAI-SearchBot is OpenAI’s real-time search crawler, activated when a ChatGPT user performs a live web search. OAI-SearchBot fetches pages on-demand to provide up-to-date answers. Both must be allowed in your robots.txt for full OpenAI platform visibility — blocking GPTBot blocks training data, blocking OAI-SearchBot blocks real-time search results.

Yes. robots.txt user-agent rules are per-crawler. You can allow GPTBot and PerplexityBot while blocking Google-Extended if you have concerns about Google using your content for Gemini training. Each user-agent rule is independent. Just ensure each crawler you want to allow has either no explicit rule (and the * default is Allow: /) or an explicit Allow: / directive.

Three methods: (1) Fetch your robots.txt at yourdomain.com/robots.txt and manually check each AI crawler’s user-agent entry. (2) Check your server access logs for 403 responses to requests from GPTBot, PerplexityBot, or ClaudeBot user-agents. (3) Use TurboAudit’s Indexability check, which fetches your robots.txt and tests each AI crawler’s access rules against your URL automatically.

It affects both, but Google has more sophisticated JavaScript rendering than most AI crawlers. Google uses a deferred rendering queue — it crawls the initial HTML immediately, then schedules a second pass with a headless Chromium browser. Most AI crawlers (GPTBot, PerplexityBot, ClaudeBot) have no deferred rendering — they see only the initial HTML. So JavaScript rendering issues block AI crawlers immediately and completely, while Google eventually sees the content with a delay.

A soft 404 is a page that returns HTTP status 200 OK but displays error content — messages like “Page not found,” “Product no longer available,” or “No results found.” The server technically reports success, but the content is an error. AI crawlers may extract and cite the error message as if it were real content, or skip the page entirely. Fix soft 404s by returning a proper 404 for missing content and a 410 for permanently deleted content.

Recrawl frequency varies by crawler and page importance. GPTBot and OAI-SearchBot recrawl high-traffic pages every few days; less important pages may be recrawled every 2–4 weeks. PerplexityBot tends to recrawl on-demand when users search for related topics. Google-Extended follows a schedule similar to Googlebot. After fixing an indexability issue, expect 1–2 weeks before the fix is reflected in AI citations.

Yes, in two ways. First, a slow TTFB over 1.8 seconds can cause AI crawlers to time out during the crawl, recording the page as unavailable. Second, Google’s AI Overviews uses Core Web Vitals as a quality signal — pages with poor LCP scores are less likely to be selected as AI Overview sources. Pages with TTFB under 600ms are crawled successfully 97% of the time; pages over 2 seconds see crawl timeouts in roughly 12% of attempts.

Coming Soon

Audit Your AI Search Visibility

See exactly how AI systems view your content and what to fix. Join the waitlist to get early access.

3 free auditsNo credit cardEarly access
Coming Soon

Audit Your AI Search Visibility

See exactly how AI systems view your content and what to fix. Join the waitlist to get early access.

3 free auditsNo credit cardEarly access