Block AI Training Crawlers with Robots.txt
FreeGenerate robots.txt rules to block GPTBot, CCBot, Claude-Web, and other AI training crawlers from scraping your website content.
What's next
Settings guide
Block all AI crawlers at once:
Add User-agent: * followed by Disallow: / as a catch-all, then selectively allow search engine crawlers. This approach blocks any new AI bots without requiring future updates.
Block specific AI bots only:
Add individual User-agent: GPTBot blocks if you want to allow some crawlers while blocking others. This lets you block AI training bots while still allowing Google, Bing, and other search crawlers full access.
Common AI crawler user agents to block:
- ·
GPTBot— OpenAI's training crawler - ·
CCBot— Common Crawl (used by many AI projects) - ·
Claude-Web— Anthropic's crawler - ·
Google-Extended— Google's AI training opt-out bot - ·
PerplexityBot— Perplexity AI - ·
Omgilibot— Omgili / Webz.io AI data
Keep Googlebot and Bingbot allowed:
Never add a global Disallow: / to User-agent: * unless you intend to block all bots including search crawlers. Block only the specific AI user agents to preserve your organic search visibility.
Format comparison
Robots.txt vs server-level blocking:
Robots.txt is the polite opt-out that compliant crawlers respect. Server-level blocking (via Cloudflare rules, nginx deny directives, or WAF rules) works regardless of bot compliance. Use robots.txt first — it is easier to deploy and covers major AI labs. Add server-level blocks for scrapers that ignore robots.txt.
Blocking AI crawlers vs blocking all crawlers:
A blanket Disallow: / for all user agents blocks Googlebot too, destroying your organic rankings. Target only the specific AI user agent strings. This gives you AI training opt-out without any impact on search engine crawling.
How it works
Select AI bots to block
Choose from the list of known AI training crawlers: GPTBot, CCBot, Claude-Web, Google-Extended, and others.
Configure search engine access
Verify that Googlebot, Bingbot, and other search crawlers remain in the allow list.
Add your sitemap URL
Include your sitemap location at the bottom of the file so search crawlers can find all your pages.
Download and deploy
Download the generated robots.txt and upload it to your domain root at yourdomain.com/robots.txt.
About this format
AI companies train large language models on content crawled from the public web. OpenAI's GPTBot, Common Crawl's CCBot, Anthropic's Claude-Web, Google's Google-Extended, and a growing number of other AI training crawlers index your content without contributing traffic or compensation. Many publishers and website owners want to opt out.
The Robots Exclusion Protocol allows you to specify which bots can access your site and which paths they may crawl. AI training crawlers — when they respect robots.txt — will honor `Disallow: /` directives targeting their user agent name. This generator produces a robots.txt file with properly formatted `User-agent` and `Disallow` blocks for all major AI training crawlers, ready to deploy at your domain root.
Note that robots.txt compliance is voluntary. Ethical crawlers like GPTBot and Google-Extended honor these rules. Some less scrupulous scrapers do not. Robots.txt is the first and easiest line of defense — for stronger protection against non-compliant scrapers, server-level IP blocks or rate limiting at the CDN level are needed additionally.