• Latest
  • Trending
  • All

Robots.txt Tester: Crawler Rules, Winning Path Match and Sitemap Audit

June 14, 2026
ssh command cheatsheet

SSH Command Cheatsheet: Connect, Keys, scp, Tunnels (2026)

June 16, 2026
chmod-chown-cheatsheet

chmod and chown Cheatsheet: Linux Permissions, Decoded (2026)

June 16, 2026
systemctl-journalctl-cheatsheet

systemctl + journalctl Cheatsheet: Services and Logs (2026)

June 16, 2026
grep-cheatsheet

The grep Cheatsheet: Search a File, Search a Tree (2026)

June 16, 2026
rsync-cheatsheet

The rsync Cheatsheet: Mirror, Sync, Copy Over SSH (2026)

June 16, 2026
curl-cheatsheet

curl Cheatsheet: Download Files and Test APIs (2026)

June 16, 2026
iptables-vs-nftables-cheatsheet cheatsheet

iptables vs nftables: Linux Firewall Cheatsheet, Side by Side

June 16, 2026
nmcli-cheatsheet cheatsheet

nmcli Cheatsheet: Wi-Fi and Network Connections From the Linux Terminal

June 16, 2026
powershell-networking-cheatsheet cheatsheet

PowerShell Networking Cheatsheet: Test-NetConnection, IP, DNS (2026)

June 16, 2026
tar command cheatsheet

The tar Command Cheatsheet: Create, Extract, Stop Guessing (2026)

June 16, 2026
Linux find command cheatsheet

The find Command Cheatsheet: Every Recipe You Actually Use (2026)

June 15, 2026
Linux networking commands cheatsheet, ip and ss

Linux Networking Commands in 2026: the ip and ss Cheatsheet

June 15, 2026
  • Online Tools
  • Network Tools
  • Developer Tools
  • Security Tools
Tuesday, June 16, 2026
  • Login
People Are Geek
  • Online Tools
  • Network Tools
  • Developer Tools
  • Security Tools
No Result
View All Result
People Are Geek
No Result
View All Result
Home Online Tools

Robots.txt Tester: Crawler Rules, Winning Path Match and Sitemap Audit

by People Are Geek
June 14, 2026
in Online Tools, SEO Tools
0
0
SHARES
11
VIEWS
Share on FacebookShare on Twitter

Live robots.txt fetch and path rule simulation

Reading a robots.txt by eye is where mistakes happen. This fetches the live file server-side, pulls out the sitemap lines and the crawler groups, runs a few paths through for whatever token you pick, and just tells you which group got picked and which allow or disallow rule actually won. No more squinting at a wall of text and hoping you read it right.

The path simulator copies Google-style group selection and path precedence for the common rules. It gets you close. For anything you really care about, still confirm with your actual crawler tools and how the live host behaves.

Recommended desk gearWe may earn a commission, at no extra cost to you.
Seo BookCheck price on Amazon →Portable MonitorCheck price on Amazon →Ergonomic MouseCheck price on Amazon →Blue Light GlassesCheck price on Amazon →

What a robots.txt tester should make clear

Robots.txt looks simple. That’s the trap. You skim it, you think you get it, and then you’ve misread a nested allow as a blanket block. A real file mixes a general User-agent: * group with a more specific crawler group, throws an allow exception inside a broader disallow, scatters a sitemap line or two around, and buries the part that actually matters under a comment. So a tester worth using has to show you both things at once: the raw file you can read line by line, and the plain verdict for the one path you came here to check.

Here’s what this does. It hits the live file at the domain root through the backend, parses out the crawler groups, runs several paths in one shot, then says which rule won for the token you gave it. Honestly the moment it earns its keep is after you’ve touched something: a WordPress update, a theme or plugin that quietly rewrites the virtual robots output, a sitemap plugin swap, a migration. Or a Search Console report flagging some URL as blocked and you have no idea why.

Robots.txt controls crawling, not everything about indexing

A robots rule tells a well-behaved crawler whether it’s allowed to request a path. That’s it. It isn’t a privacy wall, and it sure isn’t a clean delete button for a page. Block a URL from crawling and you may have just stopped the engine from ever reading the canonical or robots meta tag sitting on that page. So for actual index cleanup, match the signal to the job: redirects when content moved, a noindex tag on stuff you’ll still let them fetch but don’t want indexed, the right status code when something’s gone. Robots rules are for crawl access, full stop.

How crawler and path matching are read here

The simulator is built around the one decision a technical SEO actually needs answered. It grabs the most specific crawler group matching the token you typed, folds in any groups that are equally specific, then works through the allow and disallow rules. Longest match wins. When two equally specific rules fight, allow takes it. And yeah, it handles the * wildcard and the $ end marker that modern engines use when they parse robots files.

  • Type the exact host you want to audit. Robots rules are bound to the host and scheme of the file that got fetched, nothing else.
  • Throw a real public path at it, plus an admin or search path, plus whatever URL got reported as blocked.
  • Read the winning rule. Not the rule count, the actual winner.
  • Glance at the declared sitemap URLs and make sure they still parse.
  • Keep the raw output on screen when you’re comparing the live file against what a plugin thinks it set.

WordPress robots checks worth doing

On a public WordPress site you’ll almost always see /wp-admin/ blocked with an allow carved out for /wp-admin/admin-ajax.php. Fine. That’s normal, and it proves nothing about the rest of your crawl setup. Test your money pages too, the articles and tools that actually matter. Then the search or parameter patterns you meant to limit. Then sitemap discovery. And don’t forget whatever your security plugin or host quietly injected when you weren’t looking.

Good technical SEO habits around robots.txt

  1. Re-fetch the live file any time you’ve touched an SEO plugin, a sitemap, a cache, or done a migration.
  2. When the rules look split, test the exact same URL twice: once as Googlebot, once as the generic star group.
  3. Don’t block the CSS and JS crawlers need to render a public page. Not without a really good reason, anyway.
  4. Keep sitemap lines absolute and current. Stale ones are easy to forget.
  5. Pair the robots check with an indexability and canonical look on the pages you actually want ranking.

Common questions

Does an allowed robots result guarantee indexing?

Nope. All it does is clear one doubt: the crawler can reach the thing. The URL still has to earn it. Useful content, a healthy status code, a canonical that makes sense, some internal links pointing at it, and no stray noindex undoing the whole effort.

Is a blocked path always a mistake?

Not at all. Admin pages, carts, search results, duplicate or private workflow URLs often get blocked on purpose. The real question is simpler: does blocking that path line up with what you want the site to do?

Why test several paths at once?

Because robots files love broad rules with one narrow exception buried inside. Put a public page, a blocked area, and the exception path next to each other and the pattern just clicks. Much harder to fool yourself that way.

Does robots.txt stop a page from being indexed?

No, and this one trips people up constantly. It blocks crawling, not indexing. A disallowed URL can still land in the index without a snippet if other pages link to it. Want it gone? Allow the crawl, then serve a noindex.

What is the difference between Disallow and noindex?

Disallow stops the crawl. Noindex (a meta tag or a header) tells the engine not to index. Here’s the catch most people miss: block a page with disallow and the crawler never reaches the noindex you put on it, so the page you wanted gone just sits there.

Where must robots.txt live?

Right at the host root, exactly /robots.txt. Stick it in a subfolder and it’s ignored, plain and simple. Oh, and every subdomain needs its own.

Sitemap AnalyzerIndexability CheckerRobots Meta CheckerRobots.txt Generator

Sources & further reading

  • RFC 9309: Robots Exclusion Protocol
  • Google: robots.txt introduction
ShareTweetPin
People Are Geek

People Are Geek

I'm Stephane, a network and systems engineer with over 15 years of hands-on experience on production infrastructure, virtualization (ESXi, Proxmox), networking, and self-hosting. Earlier in my career I built and ran a Linux resource site that became a well-known reference for sysadmins. Today I focus on cybersecurity, and I also work as a technical trainer, teaching networking and security to people who do it for a living. Everything on People Are Geek comes from real-world practice, not theory. I build every tool on this site myself, and I write about what I've actually deployed, broken, and fixed. If it's here, I've used it.

People Are Geek

Copyright © 2017 JNews.

Navigate Site

  • About PeopleAreGeek
  • Affiliate Disclosure
  • All Tools and Articles
  • Contact
  • Cookie Policy
  • Hyper-V Hub: Tools, Error Fixes and Lab Guides
  • Linux Hub: Cross-Distro Reference, Articles, Tools
  • Privacy Policy
  • Sample Page
  • Terms of Service
  • VMware vSphere & ESXi Hub: Tools, Error Fixes and Guides

Follow Us

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Online Tools
  • Network Tools
  • Developer Tools
  • Security Tools

Copyright © 2017 JNews.