Indexability Checker: Status, Robots.txt, Noindex, Canonical, Sitemap and Crawl Signals
Pages drop out of Google for the dumbest reasons. A stray noindex nobody remembers adding. Or some robots.txt rule a contractor left behind in 2019. So I throw a live URL in here and it pulls the HTTP status, the robots meta, the X-Robots header, the canonical, plus a robots.txt check, all on the first request. One look and you can see if something’s quietly blocking the page.
Indexable does not mean indexed
This checker only hunts for the technical blockers. Your page can be crawlable and indexable and Google still leaves it out, because the content’s thin or nothing links to it. I run this first to clear the technical suspects. Then I go fix the content and the internal links. Honestly, that’s where the real problem usually hides.
Signals checked
- The HTTP status. You want a clean 200 on a page meant for the index. A redirect or a 404 sitting here is a red flag.
- The robots meta and the X-Robots-Tag header. Neither should say noindex. That header gets forgotten constantly, since it lives in the response, not the HTML.
- Robots.txt. It shouldn’t block the path for Googlebot. One sloppy disallow can wipe out a whole section of the site.
- The canonical. It has to point at the version you actually want indexed, not some duplicate or a URL dragging query parameters around.
- Discoverability. Google has to find the page somehow first, whether that’s your sitemap or a real internal link pointing at it.
Frequently asked questions
What makes a page non-indexable?
A handful of usual suspects. A noindex (meta robots or the X-Robots-Tag header), an HTTP status that isn’t 200, a canonical pointing off somewhere else, or a login wall. Here’s the one that trips people up. A robots.txt disallow blocks crawling, which is not the same as blocking indexing. Different problem entirely, so I treat it as its own signal.
Does a robots.txt disallow remove a page from Google?
No, and that answer surprises people constantly. Disallow stops the crawl. But if other pages link to that URL, Google can still list it, just with no snippet, that sad “no information is available” blurb. So if you genuinely want it gone, do the opposite of what feels right. Allow the crawl so Googlebot can reach the page, then serve a noindex. You have to let it in before it’ll agree to leave.
What is the difference between noindex and canonical?
They feel similar. They’re really not. Noindex is a flat no, keep this page out of the index, full stop. A canonical is softer, just a hint that says these pages are basically the same, treat this one as the master copy. With a canonical the page still gets crawled and can still surface. Maybe it’s just me, but I think most people reach for noindex when a plain canonical would’ve done the job. So my rule: noindex when I want a page gone, canonical when I’ve got near-duplicates and only need Google to pick a winner.
Why is my page indexable but still not indexed?
Indexable just means nothing’s actively blocking it. That’s the floor, not a promise. Google still gets the final say. It weighs whether it even found the page, whether the crawl is worth the budget, and whether the content holds up against the near-duplicate it might already have. When I want the actual answer instead of guessing, I drop the URL into Search Console’s URL Inspection. It shows you the exact coverage state straight from Google, no interpretation needed.
Does this tool render JavaScript?
It doesn’t. It reads the served HTML and the response headers, exactly what comes back on that first request. Which matters more than it sounds. If JavaScript injects your meta robots or canonical after load, what Google eventually renders can drift from what you see here. So when a page leans on JS for that stuff, I go confirm the final state in Search Console’s URL Inspection, because that one actually renders the page the way Google does.













