Sitemap Checker

Free XML sitemap validator with 10 spec-grounded checks. Verifies Sitemap Protocol 0.9, Google's 50K URL / 50 MB caps, W3C lastmod, path-match, and entity escaping.

Try:Defaults to /sitemap.xml if no path given.

What this tool checks

All 10 checks ground in the documented Sitemap Protocol 0.9 specification (sitemaps.org) and Google's published Search Central constraints. Every flag links back to the underlying spec.

  1. 1

    XML root element

    <urlset> for regular sitemaps, <sitemapindex> for index files.

  2. 2

    Content-Type header

    Should be application/xml or text/xml. Empty is tolerated; other values warned.

  3. 3

    Sitemap namespace

    Confirms xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" per the spec.

  4. 4

    Entry count

    Caps at Google's 50,000 URLs per file. Splits into a sitemap index if exceeded.

  5. 5

    Uncompressed size

    Google caps at 50 MB uncompressed. Larger files require splitting.

  6. 6

    Absolute URLs

    Spec requires fully-qualified absolute URLs in <loc>. No relative paths.

  7. 7

    Path-match (single host)

    Google requires sitemap and listed URLs to share host (unless cross-host verification is set).

  8. 8

    <lastmod> W3C Datetime

    Validates date format per W3C Datetime spec. Invalid values are discounted by Google.

  9. 9

    <priority>/<changefreq>

    Flagged as INFORMATIONAL only — Google explicitly ignores these tags.

  10. 10

    XML entity escaping

    Detects unescaped & characters in URLs. The most common XML parse error.

Common sitemap errors (and what fixes them)

Couldn't fetch / sitemap returned 404

Almost always one of: robots.txt blocks the sitemap path, sitemap doesn't exist at /sitemap.xml, the server returns 5xx, or the path was renamed (e.g., /sitemap_index.xml). Declare the correct path in robots.txt via the Sitemap: directive.

Relative URLs in <loc>

Spec requires absolute URLs like https://site.com/page, not /page. Most CMS sitemap generators do this correctly; hand-rolled sitemaps are the common offender.

Unescaped & in query-string URLs

URLs like ?id=1&cat=2 must be written as <loc>https://site.com/page?id=1&amp;cat=2</loc>. Bare & breaks XML parsing.

Stale <lastmod> values

Google flags <lastmod> as 'consistently and verifiably accurate' or discounts it. A sitemap regenerated quarterly with hand-written lastmod values almost always lies — automate the regeneration on publish.

Listing non-canonical or redirected URLs

Only include canonical URLs that return 200 — not pages that redirect, return 404, or have a different canonical tag pointing elsewhere.

Sitemap not declared in robots.txt

Add Sitemap: https://yourdomain.com/sitemap.xml to robots.txt. Optional but recommended — and it's the only documented hook AI crawlers (GPTBot, ClaudeBot, PerplexityBot) have for discovering sitemaps.

Using <priority> and <changefreq> expecting them to influence ranking

Google explicitly ignores both. Not an error per se — just a wasted optimization effort. Remove on next regeneration to reduce file size.

Frequently asked questions

What does this sitemap checker actually validate?

10 checks per the documented Sitemap Protocol 0.9 spec (sitemaps.org) and Google's published constraints: (1) XML root element (<urlset> or <sitemapindex>), (2) Content-Type header, (3) Sitemap namespace, (4) Entry count vs Google's 50,000 cap, (5) Uncompressed size vs 50 MB cap, (6) Absolute URLs, (7) Single-host path-match, (8) <lastmod> W3C Datetime validity, (9) <priority>/<changefreq> usage flagged as informational (Google ignores these), (10) XML entity escaping. Each check links to the underlying spec or Google Search Central documentation.

Does Google still use <priority> and <changefreq>?

No. Google Search Central explicitly states it ignores both <priority> and <changefreq>. Many 2026 SEO guides still recommend setting these values — that advice is stale. Our checker flags their usage as informational (not an error) so you can clean them up at your next sitemap regeneration. <lastmod> IS still used by Google, but only when Google considers it 'consistently and verifiably accurate' — stale lastmod values are discounted.

What are Google's actual sitemap limits?

50,000 URLs per file, 50 MB uncompressed per file (Google Search Central, verified June 2026). A sitemap index can reference up to 50,000 child sitemaps, each with its own 50K/50MB cap. News sitemaps have a separate hard cap of 1,000 URLs. Gzip compression is allowed; the 50 MB limit still applies to the uncompressed size. UTF-8 encoding is required. URLs must be fully-qualified absolute URLs with all special characters entity-escaped (& → &amp;).

Do AI crawlers like GPTBot and ClaudeBot use sitemap.xml?

Not officially documented. GPTBot (OpenAI), ClaudeBot (Anthropic), and PerplexityBot all obey robots.txt — and robots.txt can include a Sitemap: directive. That's the only documented hook for AI crawlers to discover your sitemap. Whether each AI crawler actually fetches and uses sitemap.xml the way Googlebot does is not publicly confirmed by OpenAI, Anthropic, or Perplexity. The safest practice is to declare Sitemap: in robots.txt and keep the sitemap reachable — if AI crawlers want to use it, they can. Check whether AI crawlers can access your site with our AI Bot Checker.

What's the difference between a sitemap and a sitemap index?

A regular sitemap (<urlset>) lists individual page URLs. A sitemap index (<sitemapindex>) lists OTHER sitemaps — useful for large sites with more than 50,000 URLs (the per-file Google cap). For example: an ecommerce site with 500,000 products would have 10+ child sitemaps, each with 50,000 URLs, referenced from one sitemap-index.xml at the root. This checker auto-detects which kind it found and reports accordingly.

Why does the checker say my sitemap is on the wrong host?

Google's path-match rule: a sitemap can only list URLs at the same protocol + host as the sitemap itself, UNLESS you've set up Search Console cross-site verification. A sitemap at https://example.com/sitemap.xml listing URLs at https://blog.example.com/* will trigger this warning — Google will refuse to index the cross-host URLs. Fix: either move the sitemap to the same host as the URLs, or set up cross-domain sitemap verification in Search Console.

What schema namespace should my sitemap use?

Sitemap Protocol 0.9 namespace: xmlns="http://www.sitemaps.org/schemas/sitemap/0.9". Last updated November 21, 2016 — no newer protocol exists. The same namespace applies to both <urlset> and <sitemapindex> roots. For specialized sitemaps (image, video, news), additional namespaces are layered on top — xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" for image sitemaps, for example.

How often should I update my sitemap?

Automatically, every time a page is published or significantly updated. Static or quarterly-regenerated sitemaps with stale <lastmod> values get discounted by Google. Most modern CMSes regenerate sitemaps on each publish (WordPress, Next.js, Shopify, Webflow, Squarespace). If you're hand-rolling a sitemap, set up an automated build that runs at deploy time. Stale sitemaps are the most common reason for indexability issues.

Related free tools

Full AI SEO audit

Want a full AI search visibility audit?

A valid sitemap is one signal of 250+ TurboAudit checks. We audit page-level AI citation readiness across 7 dimensions anchored to the Princeton GEO paper. 5 free audits, no credit card.

Run a free audit