Local AI token estimator
Paste a prompt. Or an article, or a wall of code. The page guesses how many tokens it’ll burn across GPT-5.5, GPT-5 mini, Claude Opus 4.7, Claude Sonnet 4.6, Claude Haiku 4, Gemini 3 Pro and Gemini 3 Flash, then lines up the counts side by side. You get a per-run cost, a read on how snugly the text sits inside each model’s context window (they range from 256K all the way up to 2M), plus a rough picture of the tokenisation so you can spot where a bloated prompt quietly leaks budget. None of it leaves your browser.
These are estimates, built from each provider’s published tokenisation profile. Want exact GPT numbers? Use OpenAI tiktoken. For Claude, the official Anthropic counter. Your text stays put: this page sends it nowhere.
What is a token counter and why it matters
Models don’t read the way you and I do. They don’t see words. Before a single GPT, Claude or Gemini prompt gets processed, a tokenizer chops the text into tokens. A token might be a short word, a fragment of a longer one, a punctuation mark, a clump of digits, a newline, sometimes a lone Unicode character if you’re working in a less common script. Everything after that, reading, generating, billing, happens per token. So if you don’t know roughly how many tokens your prompt and its reply will eat, you’re basically guessing at the cost. Worse, you’re risking a silent truncation that lops off the one paragraph that mattered.
What this thing does: you paste text, it runs a per-provider estimator that tries to mirror how the GPT-4o, Claude and Gemini tokenizers actually behave, and it works out how that same content would get billed and stored on each model. I built it to lean cautious. It overcounts a little on tiny snippets and undercounts a little on dense code, which honestly is the direction you want to err when you’re planning production spend. And the text never leaves the browser, which matters a lot when what you’re pasting is a secret, customer data, or some feature nobody’s announced yet.
How tokenisation works across GPT-5, Claude 4 and Gemini 3
OpenAI runs on byte-pair encoding, BPE for short. The GPT-5 family inherits that o200k_base-style vocabulary that showed up with GPT-4o, plus some extra entries that make it leaner on code and non-English text. Anthropic’s Claude 4 (Opus 4.7, Sonnet 4.6 and Haiku 4) uses its own BPE tokenizer, also tuned for code and other languages, and it usually lands within a few percent of the OpenAI counts on comparable text. Gemini 3, both Pro and Flash, goes a different route with a SentencePiece-style tokenizer on a separate vocabulary. The upshot? It sometimes carves the exact same input into more tokens. Punctuation-heavy text and emoji are where you’ll feel it most.
- Normalise the text so encoded characters, line endings and that stray whitespace nobody notices all match what the API actually receives.
- Detect common chunks: words, numbers, code symbols, line breaks, the odd rare character.
- Apply per-category cost. Short everyday words tend to be one token. Longer ones split into several. Code symbols and punctuation are usually a token apiece, line breaks count too, and Unicode-heavy text can chew through a few tokens per single character.
- Apply a model-specific multiplier to account for how BPE and SentencePiece vocabularies disagree.
- Project context fit and cost. Take the input estimate, add the expected output, fold in the model’s published pricing, and you’ve got a per-run figure that scales more or less linearly up to monthly volume.
Common use cases for a token counter
- Cost forecasting before a launch. Multiply the input-plus-output estimate by the monthly run count you’re projecting, and you’ll see fast whether a feature flies on a premium model like Opus 4.7 or GPT-5.5, or whether it only pencils out on something smaller like Haiku 4 or Gemini 3 Flash. A 1,500-token system prompt fired 100,000 times a month? That climbs faster than most teams brace for.
- Prompt budget design. When a system prompt, a retrieved knowledge chunk and a user message are all fighting for the same window, knowing each piece’s token count lets you carve up the budget on purpose. Beats discovering at runtime that your document got truncated.
- Model selection. A 50,000-token input is nothing on Opus 4.7 (1M window) or Gemini 3 Pro (2M). Drop it into some older model with a 32K window and suddenly it’s a squeeze. The fit chart here shows that gap in one look.
- Auditing long content pipelines. Articles, transcripts, PDFs, they almost always tokenise larger than you’d guess. Run a typical document through and you’ll find out whether a summarisation pipeline blows past the context limit on the outliers.
- RAG and embeddings sizing. Retrieval augmented generation chops documents into chunks of some target token size. Paste a representative chunk and you can confirm the splitter’s behaving before you go embed thousands of pages.
- Educational walk-throughs. Show a non-technical stakeholder that the same paragraph is 80 tokens on one model and 92 on another, and suddenly the pricing decision is a lot easier to defend.
- Latency planning. Output tokens usually drive response time more than input does. A decent guess at how long the reply runs helps you dodge timeouts on the big generations.
Limitations and privacy notes
Let’s be clear: this is an estimator, not anybody’s official tokenizer. The real byte-pair encoding behind GPT-4o, and the SentencePiece encoding Gemini leans on, are both deterministic, but running them properly means shipping multi-megabyte vocabulary files into your browser. That would drag the page down for everyone, so I didn’t. What’s here matches the official counts within a few percent on ordinary English, code and mixed content. It can wander further off on dense emoji strings, on unusual scripts, on very short snippets. When you need the number to be exact before you bill, reach for the OpenAI tiktoken library, the Anthropic token counter API, or Google’s own AI counters.
Everything’s processed inside your browser. Nothing goes to PeopleAreGeek, nothing goes to a third party. Paste your prompts, your system messages, retrieval chunks, real production samples, none of it touches a network round trip. One caveat on the pricing: the cost estimator uses published 2026 list rates, so it leaves out enterprise discounts, batch APIs and cached-input pricing. Those can slash the bill quite a bit once you’re running at real volume.
Frequently asked questions
Why does the same text use a different number of tokens on each model?
Because each provider trains its own tokenizer on its own vocabulary. For a single word the split might differ by just one token. Sounds tiny. But it piles up across a long input. GPT-5 and Claude 4 are usually the leanest on mixed languages and stay within a few percent of each other. Gemini 3 spends a touch more on punctuation-heavy text and emoji.
How accurate is this estimator compared to the official tokenizer?
On normal English prose and code, it sits within roughly three to five percent of tiktoken and the Anthropic counter. Honestly that’s close enough for most planning. It drifts more on emoji bursts, weird scripts, or really short snippets. When the call is final billing or a hard context-limit decision, run the actual tokenizer of whatever provider you ship to.
Do output tokens cost the same as input tokens?
Nope. Most providers bill output tokens at something like three to five times the input rate. The estimator here uses the right per-direction price for each model, so even a tiny prompt with a long, rambling reply gets costed fairly.
Does the page send my text to a server to count it?
No. The whole estimator runs in your browser. Paste secrets, customer messages, live production prompts, whatever, and none of it leaves your machine. Once the page has loaded, it even works offline.
What is a context window and how does it relate to tokens?
It’s the ceiling on how many tokens a model can hold in one call. And it counts everything: the system prompt, the chat history, retrieved chunks, the user message, plus the model’s own reply. GPT-5.5 takes 400,000 tokens. Claude Opus 4.7 takes 1,000,000. Gemini 3 Pro goes up to 2,000,000. The fit chart shows what slice of each window your text would eat.
How can I reduce a token bill without losing quality?
A few things that actually work. Cut the repetitive boilerplate out of your system prompts. Park the stable instructions in a cached-prompt feature if the provider offers one. Lean on Markdown or structured JSON instead of wordy natural-language framing. Summarise retrieved chunks before you send them. And for routing or classification steps, just use a smaller model. Shave 20 percent off a prompt and at production volume that’s real money, not rounding.
Sources & further reading
Related tools and resources
Counting tokens is step one. The real wins come when you pair it with prompt engineering and a proper content audit. Here’s what to reach for next, once you know what the tokens cost you.













