Skip to content

Block AI Training Crawlers with Robots.txt

Free

Generate robots.txt rules to block GPTBot, CCBot, Claude-Web, and other AI training crawlers from scraping your website content.

block gptbotrobots txt ai crawlersblock ccbot robots txt
All SEO Tools

Settings guide

Block all AI crawlers at once:

Add User-agent: * followed by Disallow: / as a catch-all, then selectively allow search engine crawlers. This approach blocks any new AI bots without requiring future updates.

Block specific AI bots only:

Add individual User-agent: GPTBot blocks if you want to allow some crawlers while blocking others. This lets you block AI training bots while still allowing Google, Bing, and other search crawlers full access.

Common AI crawler user agents to block:

  • ·GPTBot — OpenAI's training crawler
  • ·CCBot — Common Crawl (used by many AI projects)
  • ·Claude-Web — Anthropic's crawler
  • ·Google-Extended — Google's AI training opt-out bot
  • ·PerplexityBot — Perplexity AI
  • ·Omgilibot — Omgili / Webz.io AI data

Keep Googlebot and Bingbot allowed:

Never add a global Disallow: / to User-agent: * unless you intend to block all bots including search crawlers. Block only the specific AI user agents to preserve your organic search visibility.

Format comparison

Robots.txt vs server-level blocking:

Robots.txt is the polite opt-out that compliant crawlers respect. Server-level blocking (via Cloudflare rules, nginx deny directives, or WAF rules) works regardless of bot compliance. Use robots.txt first — it is easier to deploy and covers major AI labs. Add server-level blocks for scrapers that ignore robots.txt.

Blocking AI crawlers vs blocking all crawlers:

A blanket Disallow: / for all user agents blocks Googlebot too, destroying your organic rankings. Target only the specific AI user agent strings. This gives you AI training opt-out without any impact on search engine crawling.

How it works

1

Select AI bots to block

Choose from the list of known AI training crawlers: GPTBot, CCBot, Claude-Web, Google-Extended, and others.

2

Configure search engine access

Verify that Googlebot, Bingbot, and other search crawlers remain in the allow list.

3

Add your sitemap URL

Include your sitemap location at the bottom of the file so search crawlers can find all your pages.

4

Download and deploy

Download the generated robots.txt and upload it to your domain root at yourdomain.com/robots.txt.

About this format

AI companies train large language models on content crawled from the public web. OpenAI's GPTBot, Common Crawl's CCBot, Anthropic's Claude-Web, Google's Google-Extended, and a growing number of other AI training crawlers index your content without contributing traffic or compensation. Many publishers and website owners want to opt out.

The Robots Exclusion Protocol allows you to specify which bots can access your site and which paths they may crawl. AI training crawlers — when they respect robots.txt — will honor `Disallow: /` directives targeting their user agent name. This generator produces a robots.txt file with properly formatted `User-agent` and `Disallow` blocks for all major AI training crawlers, ready to deploy at your domain root.

Note that robots.txt compliance is voluntary. Ethical crawlers like GPTBot and Google-Extended honor these rules. Some less scrupulous scrapers do not. Robots.txt is the first and easiest line of defense — for stronger protection against non-compliant scrapers, server-level IP blocks or rate limiting at the CDN level are needed additionally.

Frequently asked questions

Does blocking GPTBot affect my Google search rankings?+
No. GPTBot is OpenAI's training crawler and is completely separate from Googlebot. Blocking GPTBot in robots.txt has zero effect on Google Search crawling, indexing, or your organic rankings. The two bots operate independently under different user agent strings. Only add Disallow rules under User-agent: Googlebot if you intentionally want to restrict Google Search from specific paths on your site.
Are AI companies required to respect robots.txt?+
There is no legal requirement, but major AI labs have committed to respecting robots.txt. OpenAI, Anthropic, and Google have all stated their training crawlers honor Disallow directives. Less established scrapers and data brokers may not comply — for those, server-level blocking is more effective.
Can I block AI crawlers but still appear in AI search products?+
Blocking training crawlers prevents your content from being used in future model training, but it does not remove your content from models already trained on it. Some AI search products use live web search rather than pre-trained data — blocking training crawlers does not necessarily block real-time AI search results.
How do I check if my robots.txt is blocking AI bots correctly?+
Use Google Search Console's robots.txt Tester to validate your file syntax. To verify specific bot blocking, check your server access logs for requests from the user agent strings you have blocked — compliant bots should stop appearing after the robots.txt is deployed.
What happens if I accidentally block Googlebot with my robots.txt?+
Googlebot respects Disallow rules and will stop crawling blocked paths. Pages that cannot be crawled may be dropped from Google's index over time. If you block Googlebot accidentally, fix the robots.txt immediately and use Google Search Console's URL Inspection tool to request recrawling of affected pages.

Related tools and guides