• Latest
  • Trending
  • All

Robots.txt AI Crawler Blocker: Stop GPTBot, ClaudeBot, Google-Extended, Perplexity & 17 Other AI Bots

June 14, 2026
ssh command cheatsheet

SSH Command Cheatsheet: Connect, Keys, scp, Tunnels (2026)

June 16, 2026
chmod-chown-cheatsheet

chmod and chown Cheatsheet: Linux Permissions, Decoded (2026)

June 16, 2026
systemctl-journalctl-cheatsheet

systemctl + journalctl Cheatsheet: Services and Logs (2026)

June 16, 2026
grep-cheatsheet

The grep Cheatsheet: Search a File, Search a Tree (2026)

June 16, 2026
rsync-cheatsheet

The rsync Cheatsheet: Mirror, Sync, Copy Over SSH (2026)

June 16, 2026
curl-cheatsheet

curl Cheatsheet: Download Files and Test APIs (2026)

June 16, 2026
iptables-vs-nftables-cheatsheet cheatsheet

iptables vs nftables: Linux Firewall Cheatsheet, Side by Side

June 16, 2026
nmcli-cheatsheet cheatsheet

nmcli Cheatsheet: Wi-Fi and Network Connections From the Linux Terminal

June 16, 2026
powershell-networking-cheatsheet cheatsheet

PowerShell Networking Cheatsheet: Test-NetConnection, IP, DNS (2026)

June 16, 2026
tar command cheatsheet

The tar Command Cheatsheet: Create, Extract, Stop Guessing (2026)

June 16, 2026
Linux find command cheatsheet

The find Command Cheatsheet: Every Recipe You Actually Use (2026)

June 15, 2026
Linux networking commands cheatsheet, ip and ss

Linux Networking Commands in 2026: the ip and ss Cheatsheet

June 15, 2026
  • Online Tools
  • Network Tools
  • Developer Tools
  • Security Tools
Tuesday, June 16, 2026
  • Login
People Are Geek
  • Online Tools
  • Network Tools
  • Developer Tools
  • Security Tools
No Result
View All Result
People Are Geek
No Result
View All Result
Home Security Tools

Robots.txt AI Crawler Blocker: Stop GPTBot, ClaudeBot, Google-Extended, Perplexity & 17 Other AI Bots

by People Are Geek
June 14, 2026
in Security Tools, SEO Tools
0
0
SHARES
7
VIEWS
Share on FacebookShare on Twitter

Local robots.txt generator for AI crawlers

Another month, another AI crawler in my logs. I got sick of hand-editing robots.txt every single time, so I built this instead. It writes the blocks for the training crawlers and the in-product fetchers behind ChatGPT, Claude, Gemini, Perplexity, Meta AI, Apple Intelligence, Common Crawl, plus 15 other bots. Tick the ones you want gone. Leave Googlebot and Bingbot alone so you don’t quietly torch your search traffic, drop in your sitemap and whatever custom paths you’ve got, then copy or download. Runs in your browser, all of it. I never see your site and nothing leaves the page.

robots.txt is voluntary. The reputable AI vendors honour it (OpenAI, Anthropic, Google, Perplexity, Apple, Meta). Want it actually enforced? Bolt on the server-level blocks too, the .htaccess or Nginx ones generated above.

Recommended security gearWe may earn a commission, at no extra cost to you.
Yubikey Security KeyCheck price on Amazon →Password ManagerCheck price on Amazon →Usb Data BlockerCheck price on Amazon →Webcam Cover SlideCheck price on Amazon →

What an AI crawler blocker does for your site

Most people miss this. AI grabs your pages two completely different ways, and you might only care about one of them. Training crawlers, like OpenAI’s GPTBot, Anthropic’s ClaudeBot, Google-Extended and Common Crawl’s CCBot, vacuum up the open web to build the datasets the next model version learns from. In-product agents are a different animal. ChatGPT-User, Perplexity-User, Claude’s web fetcher and the rest pull a single page in real time, right when someone asks a chatbot about it. Every bot carries its own user-agent string, doing its own narrow job, and the reputable ones now read robots.txt as the opt-out signal.

So this generator covers every AI bot I could find a documented user-agent for as of 2026: GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, claude-web, anthropic-ai, Google-Extended, Applebot-Extended, PerplexityBot, Perplexity-User, Meta-ExternalAgent, FacebookBot, Bytespider, CCBot, Amazonbot, Diffbot, omgilibot, YouBot, Kagibot, Cohere-AI, Timpibot, and a couple of legacy names that still show up. Flip them one by one. Or grab a preset like “Block AI training” or “EU AIA Article 4(3) opt-out” and tweak from there. Googlebot, Bingbot and the other search crawlers stay untouched by default, because nuking your own organic traffic is the last thing anybody wants.

How AI crawler blocking with robots.txt works

None of this is new tech. The Robots Exclusion Standard goes back to 1994, and it finally got written up properly as RFC 9309 in 2022. It’s a plain-text file, nothing fancier, living at /robots.txt at the root of your domain. Inside you’ve got one or more User-agent blocks, each carrying its own Allow and Disallow lines. A crawler shows up, reads the file, finds the block with its name on it, then does what it’s told. Here’s the catch though: it’s an honour system. Nothing physically forces a bot to obey. It’s a convention, and only the well-behaved operators bother following one. Upside is, the big AI vendors all publish their bot names and have said, on the record, that they’ll respect the file.

  1. Pick the bots you want to block. Don’t want to be model fuel? Training crawlers are your target. If it’s the live “summarise this page” behaviour that bugs you, go after the in-product fetchers instead. Different fight entirely.
  2. Decide on the path scope. Disallow: / shuts the whole site to that user-agent. Or fence off one corner, say /blog/ or /archive/, and let them roam the rest.
  3. Keep search engines crawling by just leaving their user-agents off the block list. Googlebot, Bingbot, DuckDuckBot and the others stay allowed whether you spell it out or not.
  4. Add server-level enforcement when there’s real money or legal weight on the line. The Nginx and Apache blocks this tool spits out hit the listed user-agents with a hard HTTP 403, even the ones that pretend robots.txt doesn’t exist.
  5. Deploy and verify. Drop the file in your web root, then actually load https://yourdomain.com/robots.txt in a browser and read what comes back. Don’t just assume the upload took. A stale cache has burned me more than once, honestly.

Common use cases for blocking AI crawlers

  • Publisher protecting editorial work. You live off the clicks to your articles. So a model that swallows everything you write and answers in your place, never sending one reader your way, is a slow bleed. Blocking training crawlers is the bluntest way to say your work isn’t free fuel.
  • SaaS hiding paid documentation. Docs behind a paywall or a login have no business showing up in a scraped training corpus. Block GPTBot and CCBot and you cut the odds that what your customers paid for ends up answered, for free, inside someone else’s chatbot.
  • EU rightholder exercising Article 4(3) opt-out. The EU AI Act takes the old copyright text-and-data-mining opt-out and turns it machine-readable. The “EU AIA opt-out” preset here writes exactly the user-agent blocks the major vendors have said they’ll honour.
  • Brand keeping AI answers consistent. Maybe you’d rather AI products send people to your real help centre via search than parrot some cached copy of it. Blocking the in-product fetchers while leaving the training crawlers alone is a quiet way of saying “use my official channels”.
  • Internal staging or low-quality content. A half-written blog or a staging box should never bleed into model training. Layer the AI bot block on top of a plain Disallow: / and you’re covered even if that staging environment leaks to the public one day. It happens.
  • Compliance with internal policy. Some shops just need a documented opt-out signal to keep their data governance people happy. Enforcement might be shaky. Having the file on disk still ticks half the box.

Limitations and privacy notes

Let me be straight with you. robots.txt is the right first move, but it’s not a wall. It’s a polite request, full stop. Anything that ignores the standard still walks right in, whether it’s scraping through residential proxies or pulling your pages via a middleman (Common Crawl feeding some downstream model, say). And it does nothing about the past. If your content got baked into a dataset last year, a fresh robots.txt won’t claw it back out. Vendors read the same user-agent name a little differently too, and a few like Google-Extended only pull you out of training while leaving you fully in Google Search. Honestly, I’d treat robots.txt as necessary but not sufficient. Maybe that’s overly cautious, but I’ve stopped assuming any single signal makes me invisible.

One more thing, since people ask. This runs entirely in your browser. The bot list, your sitemap URL, those custom paths, the existing robots.txt you paste in, none of it leaves your machine. Nothing goes to PeopleAreGeek or anyone else. The reference table ships with the page, and I refresh it whenever a vendor publishes a new user-agent.

Frequently asked questions

Will blocking GPTBot hurt my Google ranking?

No. This is the worry I hear most, by far. GPTBot is OpenAI’s training crawler and it has nothing to do with Googlebot. Block it and the only thing that changes is OpenAI won’t use you to train its next model. Your ranking rides on Googlebot, which you’ve left allowed. Want out of Google’s AI training as well? That’s a separate user-agent, Google-Extended, and it won’t touch your search visibility either.

Do AI vendors actually honour robots.txt?

The big, named ones do. OpenAI, Anthropic, Google, Apple, Perplexity, Meta and Common Crawl have all said out loud that they’ll respect the file. They’ve got skin in the game too: a public reputation for ignoring robots.txt is a lawsuit magnet, and it torches the deals they’ve signed with publishers. The anonymous scrapers and fly-by-night operators? Don’t count on those. The server-level blocks are there for that exact crowd.

What is the difference between training crawlers and in-product fetchers?

Training crawlers are the bulk operation. They sweep pages at scale, follow your links, and pace themselves to build the datasets a future model learns from. In-product fetchers like ChatGPT-User or PerplexityBot are surgical instead. One page, right now, because a user asked an AI to read that specific URL. They carry different user-agent names, so you can block one and keep the other if that’s what you’re after.

What is the EU AIA Article 4(3) opt-out about?

It builds on the text-and-data-mining opt-out from the 2019 EU Copyright Directive. The gist? Rightholders are allowed to say “no” in a way machines can read, and a robots.txt user-agent block is the format most vendors actually support. The big AI companies training on EU data have agreed to treat those blocks as one valid way of opting out. It’s no magic legal shield. It’s just the signal the industry settled on, which is better than shouting into a void.

Should I also block Common Crawl (CCBot)?

Serious about staying out of AI training? Then yes, and honestly it might be the single most important one to catch. Common Crawl is a free public archive that loads of open-source models start from, so your content there has a way of quietly turning up everywhere downstream. Block CCBot and you keep yourself out of that dataset. Common Crawl plays by robots.txt and drops blocked sites from future crawls.

Can I block AI bots without blocking my own RSS readers and link previews?

Yep. That’s the whole point of blocking by name. RSS readers and the link-preview bots on Slack, Discord, Twitter and LinkedIn run their own separate user-agents, and not one of them is in this generator. So when you block the AI bots, those previews keep working fine. Cautious type? Skim the generated output and the bot reference table before you ship. Thirty seconds, and you won’t get a nasty surprise in the morning.

Sources & further reading

  • RFC 9309, Robots Exclusion Protocol
  • Google, robots.txt introduction

Related tools and resources

A robots.txt is one piece of how your site faces crawlers. Just one. These are the tools I reach for to sanity-check the file, see what a crawler actually gets back, poke at the headers, then confirm the sitemap is pulling its weight.

Robots.txt Tester Robots.txt Generator (general) Robots Meta Checker Sitemap Checker .htaccess Generator HTTP Headers Checker Indexability Checker
ShareTweetPin
People Are Geek

People Are Geek

I'm Stephane, a network and systems engineer with over 15 years of hands-on experience on production infrastructure, virtualization (ESXi, Proxmox), networking, and self-hosting. Earlier in my career I built and ran a Linux resource site that became a well-known reference for sysadmins. Today I focus on cybersecurity, and I also work as a technical trainer, teaching networking and security to people who do it for a living. Everything on People Are Geek comes from real-world practice, not theory. I build every tool on this site myself, and I write about what I've actually deployed, broken, and fixed. If it's here, I've used it.

People Are Geek

Copyright © 2017 JNews.

Navigate Site

  • About PeopleAreGeek
  • Affiliate Disclosure
  • All Tools and Articles
  • Contact
  • Cookie Policy
  • Hyper-V Hub: Tools, Error Fixes and Lab Guides
  • Linux Hub: Cross-Distro Reference, Articles, Tools
  • Privacy Policy
  • Sample Page
  • Terms of Service
  • VMware vSphere & ESXi Hub: Tools, Error Fixes and Guides

Follow Us

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Online Tools
  • Network Tools
  • Developer Tools
  • Security Tools

Copyright © 2017 JNews.