Robots.txt and XML Sitemap: Complete Setup Guide

Table of Contents

Robots.txt and XML sitemaps are the two technical SEO files that directly communicate with search engine crawlers. They serve complementary functions: robots.txt tells crawlers where not to go; sitemaps tell crawlers where your important content lives. Together, they give you direct influence over how search engines discover and prioritize your pages.

Most websites set these up once and never revisit them — which is a mistake. As your site grows, adds new URL structures, changes platforms, or wants to opt out of AI training crawlers, your robots.txt and sitemaps need to evolve. This guide walks through everything from initial setup to advanced use cases like sitemap indexes, image sitemaps, and blocking AI training bots.

Understanding the distinction between crawling and indexing is essential before touching either file. Robots.txt controls crawling — which URLs bots are permitted to request from your server. Noindex meta tags control indexing — whether a crawled page is included in search results. They are different levers for different problems, and confusing them leads to common SEO mistakes.

Robots.txt Fundamentals

The robots.txt file is a plain text file placed at the root of your domain (yourdomain.com/robots.txt). Crawlers check this file before accessing any other URL on your site. The file uses a simple directive syntax:

User-agent: Googlebot
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

User-agent: *
Disallow: /?s=
Disallow: /tag/
Disallow: /page/

Sitemap: https://yourdomain.com/sitemap.xml

Syntax rules:

Each rule set begins with one or more User-agent: lines identifying which bot(s) the rules apply to
User-agent: * applies to all bots not mentioned by name above
Disallow: with an empty value means allow everything (no restriction)
Disallow: / means block access to the entire site
Rules are case-sensitive: /Admin and /admin are different paths
Comments start with # and are ignored by parsers

Order of rules:

Robots.txt uses the most specific matching rule that applies. If Googlebot has a specific rule block, it follows that block's rules. Only if no specific block exists does it fall back to the * wildcard block.

What to Block in Robots.txt — and What to Allow

Deciding what to disallow requires understanding which URLs have SEO value and which consume crawl budget without contributing to rankings. Crawl budget — the number of pages Googlebot will crawl on your site in a given time period — is finite, especially for smaller and medium-sized sites.

Always disallow (no SEO value):

/wp-admin/ — WordPress admin interface
/?s= — WordPress search result pages (thin, near-duplicate content)
/cart/ and /checkout/ — E-commerce transaction pages
/account/ — User account pages
/api/ — API endpoints (often noisy and machine-readable)
Staging subdirectories if served from the same domain

Conditionally disallow (evaluate per site):

/tag/ — Tag archive pages (valuable if tags are keyword-focused; block if they are thin)
/author/ — Author archives (block on single-author blogs; allow on multi-author publications)
/page/ — Paginated archives (Googlebot can crawl pagination, but it consumes crawl budget on low-traffic sites)

Never disallow:

Your main content URLs
Your sitemap URL (counterproductive to block the file that helps Googlebot discover pages)
CSS and JavaScript files (blocking these prevents Google from rendering your pages correctly)
Images that you want indexed in Google Image Search

Generate your robots.txt file— Build and download your robots.txt in seconds

Blocking AI Training Crawlers

Since 2023, AI companies have deployed training crawlers that systematically scrape web content to train large language models. Unlike search engine crawlers that bring traffic in exchange for indexing, AI training crawlers consume server bandwidth and copy content without providing any direct benefit to the site owner.

Known AI training crawler user agents:

GPTBot — OpenAI's training crawler
CCBot — Common Crawl (used by many AI research projects and datasets)
Claude-Web — Anthropic's crawler
Google-Extended — Google's dedicated AI training opt-out bot
PerplexityBot — Perplexity AI
Omgilibot — Omgili / Webz.io
Diffbot — Diffbot AI data extraction

Robots.txt rules to block all major AI training bots:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

Important: These rules only affect crawlers that respect robots.txt — ethical AI labs do, but not all scrapers do. Blocking known compliant crawlers here is effective. For persistent non-compliant scrapers, server-level rate limiting and IP blocking at the CDN level are needed as additional layers.

Generate robots.txt with AI bot blocking— Block GPTBot, CCBot, Claude-Web, and more

XML Sitemap Structure and Best Practices

An XML sitemap lists the URLs on your site that you want search engines to know about and provides optional metadata — last modification date, change frequency, and relative priority — as hints for crawlers.

Basic XML sitemap structure:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://yourdomain.com/</loc>
    <lastmod>2025-01-15</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://yourdomain.com/blog/post-title</loc>
    <lastmod>2025-01-10</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.7</priority>
  </url>
</urlset>

Sitemap limits:

Maximum 50,000 URLs per sitemap file
Maximum 50 MB uncompressed per file
For larger sites, use a sitemap index file pointing to multiple individual sitemaps

Sitemap index structure:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://yourdomain.com/sitemap-pages.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://yourdomain.com/sitemap-posts.xml</loc>
  </sitemap>
</sitemapindex>

Include only canonical, indexable URLs. Exclude pages with noindex tags, redirects, 404 pages, duplicate content, and any URL blocked in robots.txt (which would be contradictory). Quality over quantity applies — a sitemap with 1,000 high-quality pages is more effective than one with 10,000 mixed-quality pages.

Generate your XML sitemap— Create a valid sitemap.xml ready to upload

Submitting Your Sitemap to Google and Bing

Placing a sitemap on your server makes it available, but proactively submitting it to search engine webmaster tools ensures it is processed promptly.

Google Search Console submission:

1. Go to search.google.com/search-console

2. Select your property

3. Click Sitemaps in the left navigation

4. Enter the URL path of your sitemap (e.g., sitemap.xml) and click Submit

5. Refresh after a few minutes to see the submission status and URL counts

Bing Webmaster Tools submission:

1. Go to bing.com/webmasters

2. Select your site

3. Click Sitemaps in the left navigation

4. Click Submit Sitemap and enter your sitemap URL

Robots.txt sitemap reference (automated discovery):

Add a Sitemap: directive to your robots.txt file:

Sitemap: https://yourdomain.com/sitemap.xml

This allows any crawler that reads robots.txt to discover your sitemap automatically — not just Googlebot and Bingbot. Useful for ensuring coverage across all compliant crawlers.

Monitoring sitemap health:

After submission, watch Google Search Console for:

Errors in sitemap parsing (malformed XML, invalid URLs)
Discovered URLs count (Google shows how many URLs it found)
Indexed URLs vs discovered URLs (a large gap indicates indexing issues on those pages)

Frequently Asked Questions

What is the difference between robots.txt and a noindex tag?

Robots.txt controls crawling — it tells bots whether they are allowed to request a URL from your server. A noindex meta tag controls indexing — it tells crawlers they may visit the page but should not include it in search results. A page blocked by robots.txt can still be indexed if it has inbound links. For pages you want excluded from search results with certainty, use noindex rather than robots.txt.

How often should I update my XML sitemap?

Update your sitemap whenever you add, remove, or significantly update pages. Most CMS platforms and SEO plugins update sitemaps automatically when content changes. For manually maintained sitemaps, update the lastmod date whenever you publish new content and remove URLs for deleted or redirected pages.

Does a larger sitemap always mean better SEO?

No. Only include URLs in your sitemap that are canonical, indexable, and have genuine content value. Submitting thousands of low-quality pages inflates your sitemap without improving crawl efficiency. Google's quality assessment of your sitemap is part of how it allocates crawl budget to your site — a clean sitemap with high-quality URLs signals a high-quality site.

Can robots.txt block a page from being indexed?

Robots.txt blocking prevents crawling but does not guarantee a page will not be indexed. If a blocked page has inbound links from other indexed pages, Google can index the URL with a generic snippet without crawling it. To reliably prevent indexing, use a noindex meta tag on the page — which requires the page to be crawlable so the tag can be read.

Should my sitemap and robots.txt be consistent with each other?

Yes. Never include a URL in your sitemap that is also blocked in robots.txt — submitting a URL you are simultaneously blocking sends conflicting signals and wastes crawl budget. Google may choose to crawl the sitemap-listed URL anyway to resolve the conflict. Audit both files together periodically to ensure they are consistent.

Summary

Robots.txt and XML sitemaps are the crawl control layer of technical SEO — they shape how search engines discover and prioritize your content. Most sites configure these once and achieve a functional baseline, but as your site grows, a periodic audit of both files pays dividends in improved crawl efficiency and faster content discovery.

The key discipline is consistency: every URL in your sitemap should be accessible (not blocked in robots.txt), canonical, and indexable. Every URL you want excluded from search results should use noindex rather than robots.txt blocking for reliable results. Keep your sitemap current as content changes. Reference your sitemap in robots.txt for maximum crawler discovery. Submit to Google Search Console and monitor the results. These fundamentals, maintained consistently, give your site the best possible foundation for technical SEO.

Try these tools

Robots.txt Generator

Generate and validate robots.txt rules

Sitemap Generator

Generate XML sitemaps from URL lists

All Guides

How to Configure Robots.txt and XML Sitemaps for Maximum SEO Crawl Efficiency