• Latest
  • Trending
  • All

Robots.txt AI Crawler Blocker: Stop GPTBot, ClaudeBot, Google-Extended, Perplexity & 17 Other AI Bots

May 27, 2026
Maximizing Website Speed with Image Optimization Techniques for 2026 - cover image

Maximizing Website Speed with Image Optimization Techniques for 2026

June 3, 2026
SSL certificate renewal manager - 8 ACME clients, expiry calculator and monitoring - cover image

SSL Certificate Renewal Manager: certbot, acme.sh, lego, Caddy, cert-manager

June 3, 2026
CORS policy generator - 14 server and framework configs with presets and live security review - cover image

CORS Policy Generator: Headers + Nginx, Apache, Express, FastAPI, Django Config

June 3, 2026
netsh wlan command reference - 72 commands with example output and copy - cover image

netsh wlan Commands: Windows Wi-Fi Cheat Sheet (Show Password, Profiles, Hotspot)

June 2, 2026
Fix: ESXi Host Not Responding / Disconnected in vCenter (2026) - cover image

Fix: ESXi Host Not Responding / Disconnected in vCenter (2026)

June 1, 2026
VMware ESXi Purple Screen of Death (PSOD): Diagnose and Recover (2026) - cover image

VMware ESXi Purple Screen of Death (PSOD): Diagnose and Recover (2026)

June 1, 2026
VMware PowerCLI command generator cover

VMware PowerCLI Command Generator: VM, Snapshots, Networking, esxcli

June 1, 2026
dd Command Generator: Write ISO to USB, Image Disks, Wipe Drives - cover image

dd Command Generator: Write ISO to USB, Image Disks, Wipe Drives

June 1, 2026
SSH Tunnel Command Generator: Local, Remote and Dynamic Forwarding - cover image

SSH Tunnel Command Generator: Local, Remote and Dynamic Forwarding

June 1, 2026
sed Command Generator: Build Substitute, Delete and Print Commands - cover image

sed Command Generator: Build Substitute, Delete and Print Commands

May 31, 2026
VMware Workstation and Hyper-V on the Same Machine (2026 Fix) - cover image

VMware Workstation and Hyper-V on the Same Machine (2026 Fix)

May 31, 2026
VMware ESXi error reference - 70 errors with fixes - cover image

VMware ESXi Error Reference: Searchable Fix Database (PSOD, APD, vMotion)

June 1, 2026
  • Online Tools
  • Network Tools
  • Developer Tools
  • Security Tools
Wednesday, June 3, 2026
  • Login
People Are Geek
  • Online Tools
  • Network Tools
  • Developer Tools
  • Security Tools
No Result
View All Result
People Are Geek
No Result
View All Result
Home Security Tools

Robots.txt AI Crawler Blocker: Stop GPTBot, ClaudeBot, Google-Extended, Perplexity & 17 Other AI Bots

by People Are Geek
May 27, 2026
in Security Tools, SEO Tools
0
0
SHARES
5
VIEWS
Share on FacebookShare on Twitter

Local robots.txt generator for AI crawlers

Build a robots.txt file that blocks the AI training crawlers and in-product fetchers used by ChatGPT, Claude, Gemini, Perplexity, Meta AI, Apple Intelligence, Common Crawl and 15 other AI bots. Choose which ones to block, keep search engines like Googlebot and Bingbot crawling, add your sitemap URL and custom paths, then copy the file or download it. The generator runs entirely in your browser; no data is sent anywhere.

robots.txt is voluntary. Reputable AI vendors (OpenAI, Anthropic, Google, Perplexity, Apple, Meta) honour it. For hard enforcement, also add server-level blocks (.htaccess or Nginx) which are generated above.

What an AI crawler blocker does for your site

Generative AI systems collect web content in two distinct ways. Training crawlers, like OpenAI’s GPTBot, Anthropic’s ClaudeBot, Google-Extended and Common Crawl’s CCBot, sweep the open web to build the datasets that future model versions will learn from. In-product agents, like ChatGPT-User, Perplexity-User and Claude’s web fetcher, do live retrieval when a chatbot needs an up-to-date page to answer a user prompt. Each kind of bot has its own user-agent string and its own purpose, and many of them now honour robots.txt as the standard opt-out signal.

This generator builds a robots.txt file that addresses every documented AI bot we know about as of 2026: GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, claude-web, anthropic-ai, Google-Extended, Applebot-Extended, PerplexityBot, Perplexity-User, Meta-ExternalAgent, FacebookBot, Bytespider, CCBot, Amazonbot, Diffbot, omgilibot, YouBot, Kagibot, Cohere-AI, Timpibot and a few legacy variants. You toggle each bot individually or use a preset like “Block AI training” or “EU AIA Article 4(3) opt-out”. The tool keeps Googlebot, Bingbot and other search crawlers untouched by default so your organic visibility stays intact.

How AI crawler blocking with robots.txt works

The Robots Exclusion Standard, originally drafted in 1994 and formalised in RFC 9309 in 2022, defines a plain-text file at /robots.txt on the root of a domain. The file lists one or more User-agent blocks, each followed by Allow and Disallow directives. When a crawler arrives, it fetches the file, finds the block that matches its user-agent name, and obeys the rules. The mechanism is voluntary: there is no technical enforcement, only a public convention that reputable crawler operators follow. The largest AI vendors publish their bot names and have committed to honouring robots.txt.

  1. Pick the bots you want to block. Training crawlers are the obvious target if you want to opt out of being used as model fuel. In-product fetchers matter if you do not want your live pages summarised by other people’s chatbots.
  2. Decide on the path scope. Disallow: / blocks the entire site for that user-agent. You can also block sub-paths only, for example /blog/ or /archive/, and leave the rest crawlable.
  3. Keep search engines crawling by not adding their user-agents to the block list. Googlebot, Bingbot, DuckDuckBot, and others should remain explicitly or implicitly allowed.
  4. Add server-level enforcement when the legal or commercial stakes are high. The Nginx and Apache blocks generated by this tool reject the listed user-agents with HTTP 403 even if they ignore robots.txt.
  5. Deploy and verify by uploading the file to your web root and fetching https://yourdomain.com/robots.txt. The tool’s verification panel suggests the exact lines to check.

Common use cases for blocking AI crawlers

  • Publisher protecting editorial work. If your business model depends on visits to your articles, having a language model that summarises everything you publish without sending traffic back is a long-term threat. Blocking training crawlers is the cleanest signal that your content is not free training material.
  • SaaS hiding paid documentation. Knowledge bases that sit behind a paywall or a login should not show up in scraped training corpora. Blocking GPTBot and CCBot reduces the risk that paying customers’ answers leak into a public model.
  • EU rightholder exercising Article 4(3) opt-out. The EU AI Act extends the existing copyright text-and-data-mining opt-out into a machine-readable signal. The “EU AIA opt-out” preset on this page produces the user-agent blocks that match what major AI vendors have announced they will respect.
  • Brand keeping AI answers consistent. Some brands prefer that AI products reference their authoritative help centre through search rather than caching their content. Blocking in-product fetchers, but keeping training crawlers off, is one signal that says “use my official channels”.
  • Internal staging or low-quality content. A draft blog or staging environment should not leak into model training. Adding the AI bot block on top of a generic Disallow: / keeps things tidy even if the staging environment becomes public by accident.
  • Compliance with internal policy. Some organisations require a documented opt-out signal as part of their data governance. Even if enforcement is imperfect, having the file is part of meeting the policy.

Limitations and privacy notes

Blocking AI crawlers with robots.txt is the right first step but it is not bulletproof. The file is a public request, not a technical barrier. Bots that ignore the standard, scrape through residential proxies, or download content via a non-AI intermediary (like Common Crawl on behalf of a downstream model) will still see your pages. Old training datasets that already contain your content are not affected by a future robots.txt change. Different AI vendors interpret the same user-agent name slightly differently, and some bots like Google-Extended only opt the site out of training without removing it from Google Search.

This tool runs in your browser. The user-agent list, the sitemap URL, the custom paths and the existing robots.txt you paste are processed locally. Nothing is sent to PeopleAreGeek or to any third party. The bot reference table is shipped with the page; we update it when vendors publish new user-agents.

Frequently asked questions

Will blocking GPTBot hurt my Google ranking?

No. GPTBot is OpenAI’s training crawler, separate from Googlebot. Blocking GPTBot only opts you out of being used in future OpenAI model training. Your Google Search ranking depends on Googlebot, which you should leave allowed. Google-Extended is a separate Google user-agent that controls AI training opt-out without affecting search visibility.

Do AI vendors actually honour robots.txt?

The major reputable vendors do. OpenAI, Anthropic, Google, Apple, Perplexity, Meta and Common Crawl have all publicly committed to honouring the file. They have business reasons to comply: a reputation for ignoring robots.txt would invite litigation and breach contractual commitments to publishers. Smaller or anonymous scraper operators are a different story.

What is the difference between training crawlers and in-product fetchers?

Training crawlers gather pages to build datasets that future model versions learn from. They visit at scale, follow links, and respect crawl rates. In-product fetchers like ChatGPT-User or PerplexityBot fetch a single page on demand when a user asks an AI for the contents of a specific URL. Both can be blocked independently using their separate user-agent names.

What is the EU AIA Article 4(3) opt-out about?

The EU AI Act incorporates the text-and-data-mining opt-out from the 2019 EU Copyright Directive. Rightholders can express their opt-out in a machine-readable way, and robots.txt user-agent blocks are the most widely supported expression. Major AI vendors that train models on EU data have committed to honouring user-agent-based opt-outs as one valid signal.

Should I also block Common Crawl (CCBot)?

If your goal is to opt out of AI training, yes. Common Crawl is a free public dataset that many open-source AI projects use as a base. Blocking CCBot prevents your content from ending up in that dataset. Common Crawl honours robots.txt and excludes blocked sites from future crawls.

Can I block AI bots without blocking my own RSS readers and link previews?

Yes. RSS readers and link-preview agents on Slack, Discord, Twitter and LinkedIn use their own distinct user-agents that are not in this generator. Blocking AI bots by user-agent name leaves those alone. If you want to be extra careful, review the generator output and the bot reference table before deploying.

Related tools and resources

Robots.txt is one signal in a wider site posture. The tools below help validate the file, test crawler access, audit headers and check sitemap visibility.

Robots.txt Tester Robots.txt Generator (general) Robots Meta Checker Sitemap Checker .htaccess Generator HTTP Headers Checker Indexability Checker
ShareTweetPin
People Are Geek

People Are Geek

People Are Geek

Copyright © 2017 JNews.

Navigate Site

  • About PeopleAreGeek
  • All Tools and Articles
  • Contact
  • Cookie Policy
  • Hyper-V Hub: Tools, Error Fixes and Lab Guides
  • Linux Hub: Cross-Distro Reference, Articles, Tools
  • Page de test Codex
  • Privacy Policy
  • Sample Page
  • Terms of Service
  • VMware vSphere & ESXi Hub: Tools, Error Fixes and Guides

Follow Us

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Online Tools
  • Network Tools
  • Developer Tools
  • Security Tools

Copyright © 2017 JNews.