I pay for all three. Out of my own pocket, and I hammer their APIs at work every day, so no, this isn’t a spec-sheet drive-by. Three families left standing: ChatGPT (OpenAI, GPT-5 turbo / GPT-5 mini), Claude (Anthropic, Opus 4.8 / Sonnet 4.6), Gemini (Google, 3.0 Pro / 3.0 Flash). And here’s the part nobody loves saying out loud: they’re all good now. Show me an abstract leaderboard and I’ll shrug. The gaps only surface when you put a real job in front of them, and even then they’re smaller than the marketing wants you to believe. So that’s what I did. Ten jobs I actually run, the one I’d open first for each, plus a cheat-sheet at the end keyed to whatever eats your day. Shipping code. Wrangling data. Writing, or just keeping ops alive.
Contents
- Methodology: models tested, samples, criteria
- Case 1: Code generation (Python, TypeScript, refactor)
- Case 2: Research and document synthesis
- Case 3: Long-form writing (article, narrative)
- Case 4: Data analysis (CSV, SQL, charts)
- Case 5: Vision multimodal (screenshots, diagrams)
- Case 6: Audio multimodal (transcription, voice)
- Case 7: Tool use and function calling
- Case 8: Long context (200k to 2M tokens)
- Case 9: Cost per million tokens
- Case 10: Latency and throughput (TTFT, tokens/sec)
- Overall verdict and decision grid
- FAQ
Methodology: models tested, samples, criteria
I tested each family at both ends, the flagship and the cheap one. So that’s GPT-5 turbo and GPT-5 mini from OpenAI. Claude Opus 4.8 (standard, plus the fast mode that dropped May 28, 2026) and Claude Sonnet 4.6 from Anthropic. Gemini 3.0 Pro with Gemini 3.0 Flash from Google. Everything went through the public APIs, not the chat UIs. Temperature pinned at 0.3, bumped to 0.7 for the creative prompts. Three runs each, median kept, because anyone who’s shipped against these things knows a single lucky run tells you exactly nothing.
I didn’t grade on closed benchmarks. I graded on one thing: could I ship the output. Which really breaks into four questions I actually lose sleep over. Did it get the answer right and complete so I’m not rewriting half of it. Did it do that repeatably across all three runs, or did it just get lucky once. What did it cost me in real input plus output tokens. And how long did I sit there waiting (p95 latency, first token and total). Those weights swing hard by job. Writing a story? I’ll wait all afternoon. A chatbot answering a customer, though, and every extra second is a person already tabbing away.
Case 1: Code generation Claude wins
Prompt: “Implement a TypeScript function that parses a malformed CSV (mixed separators, nested quotes, empty lines), returns an array of typed objects and handles errors line by line with a structured report.”
Case 2: Research and document synthesis Gemini wins
Prompt: “Synthesise the 12 provided documents (academic papers + industry reports on LLMs, 380 pages total) into 5 main themes with precise citations (paragraph or page number).”
Case 3: Long-form writing (article, narrative) Claude wins
Prompt: “Write a 2200-word article on the history of CPU architectures for a technical general audience: lively tone, concrete examples, narrative transitions, no bullet lists.”
Case 4: Data analysis (CSV, SQL, charts) GPT-5 wins
Prompt: “Here is a 50,000-line CSV (server logs). Identify the 5 main anomalies, propose a SQL query to filter them, and generate a summary chart.”
Case 5: Vision multimodal Claude wins
Prompt: “Here is a complex Grafana dashboard screenshot. Describe what is abnormal (3 visible alerts, 1 metric declining, 2 missing graphs) and propose actions.”
Case 6: Audio multimodal (transcription, voice) Gemini wins
Prompt: “Transcribe 30 minutes of audio (English meeting, 4 speakers, technical terminology), with speaker diarization and a bullet summary.”
Case 7: Tool use and function calling GPT-5 wins
Prompt: “With these 8 defined tools (search_web, read_file, write_file, execute_sql, send_email, etc.), run the task: ‘find the last 3 churned customers, send them a re-engagement email, log the result’.”
Case 8: Long context Gemini wins
Prompt: “Here is an entire codebase of 1.4M tokens (an average Django project). Find the function responsible for tax calculation, explain its logic, and propose a refactor in under 200 lines.”
Case 9: Cost per million tokens
Here’s the real per-million-token damage, public rates as of May 2026. I split flagships from cheap models on purpose. You don’t reach for them on the same jobs, and shoving them into one list just hides where the money actually leaks out.
| Model | Input $/M | Output $/M | Max context |
|---|---|---|---|
| GPT-5 turbo | $5.00 | $15.00 | 256k (1M tier+) |
| GPT-5 mini | $0.40 | $1.60 | 200k |
| Claude Opus 4.8 (standard) | $5.00 | $25.00 | 200k |
| Claude Opus 4.8 (fast mode) | $10.00 | $50.00 | 200k |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 200k |
| Gemini 3.0 Pro | $3.50 | $14.00 | 2M |
| Gemini 3.0 Flash | $0.30 | $1.20 | 1M |
Want the number for your actual workload, not some per-million abstraction? Run it through our AI cost calculator. It already knows the rates for all six models above, plus Mistral and Cohere.
Case 10: Latency and throughput (TTFT, tokens/sec)
Most of the time latency doesn’t matter. People obsess over it anyway. It genuinely matters in two places, and that’s about it. One, anything a human waits on live, chat or voice, where they feel every single beat. Two, agent chains, where each step stacks its delay onto the one before, and ten “fast enough” calls add up to a spinner nobody sticks around for.
| Model | Median TTFT | Output tokens/sec |
|---|---|---|
| GPT-5 turbo | 520 ms | ~85 tok/s |
| GPT-5 mini | 280 ms | ~140 tok/s |
| Claude Opus 4.8 (standard) | 620 ms | ~70 tok/s |
| Claude Opus 4.8 (fast mode) | 320 ms | ~175 tok/s |
| Claude Sonnet 4.6 | 420 ms | ~95 tok/s |
| Gemini 3.0 Pro | 580 ms | ~78 tok/s |
| Gemini 3.0 Flash | 180 ms | ~210 tok/s |
| Llama 4 405B via Groq | 120 ms | ~750 tok/s |
Overall verdict and decision grid
Came here for me to crown one winner? Sorry. There isn’t one, and anyone handing you a single answer in 2026 is selling you something. What I actually do is route. Pick the model per job, by whichever one wins on quality and cost and latency for that one specific thing. The grid below is where I’d start you. Defaults, not gospel. Your traffic mix will shove a few of these cells around, probably within the first week.
| Profile / dominant task | Default choice | Alternative |
|---|---|---|
| Developer (code, refactor) | Claude Sonnet 4.6 | GPT-5 turbo (multi-file) |
| Research / document synthesis | Gemini 3.0 Pro (grounding) | Claude Opus 4.8 (regulated) |
| Content creation (article, narrative) | Claude Opus 4.8 | GPT-5 turbo |
| End-to-end data analysis | GPT-5 turbo (Code Interpreter) | Gemini 3.0 Pro |
| Vision (screenshots, technical OCR) | Claude Opus 4.8 | Gemini 3.0 Pro |
| Audio / transcription / diarization | Gemini 3.0 Pro | GPT-5 + Whisper |
| Multi-tool agent, function calling | GPT-5 turbo | Claude Sonnet 4.6 |
| Long context (codebase, books) | Gemini 3.0 Pro (2M) | Claude Opus 4.8 (200k quality) |
| Mass generation (classification, extraction) | Gemini 3.0 Flash | GPT-5 mini |
| Streaming UX at very low latency | Groq Llama 4 / Gemini Flash | GPT-5 mini |
Test the three side by side?
Stop reading my opinions. Go test yours. Our AI API Compatibility Tester takes your OpenAI request and spits out ready-to-paste code for Anthropic, Gemini, plus four other providers, so you can fire your real prompt at all three without rewriting a line.
What changed at the May 2026 update
I refreshed this on May 29, the morning after Claude Opus 4.8 landed, and three of the changes deserve your attention if you’re planning a 2026 stack. The big one for me is the new “fast mode” sitting right beside standard: 2.5x the throughput for a third of what the old fast tiers wanted. In an agent chain, time-to-first-token is often the whole difference between the chain finishing and the user just bailing. It changes which calls I’m willing to make at all. Then there’s dynamic workflows, in research preview inside Claude Code, where the agent plans the job, fans out hundreds of parallel sub-agents in one session, then checks its own homework before reporting back. And third, an effort control slider (low, default, extra, max) that’s live across the subscription tiers and lets you dial quality against speed without rerouting to a different model entirely. Anthropic’s own numbers put agentic coding at 69.2 percent, up from 64.3 on 4.7, and multi-disciplinary reasoning with tools at 57.9, from 54.7. Their benchmarks, so, grain of salt. The jump still tracks with what I’ve felt using it, for whatever that’s worth.
What to watch in late 2026
A few fights are taking shape for the back half of the year. Deep reasoning, the hidden chain-of-thought stuff (o3, Claude Extended Thinking, Gemini Deep Think), now shows up in every flagship, so expect the whole pecking order to get re-litigated on that one axis alone. Native multimodal, where audio and video and image generation all live inside one model, is sprinting at Google with Veo 3 and over at OpenAI with GPT-5 Vision and Sora. And long-running autonomous agents, sessions that run for hours and fire dozens of tools while managing their own memory, are the thing Anthropic and OpenAI keep insisting is the priority. Watch the Q3 announcements; that’s where I’d put my money.
FAQ
Which model should I use if I am starting out and only want one subscription?
One subscription and a quiet life? For chat, a bit of writing, the occasional snippet of code, ChatGPT Plus and Claude Pro feel about the same day to day. Pick whichever UI you actually enjoy looking at. Code is the one place I’d split them. If you live in an editor most days, Claude Pro is what I’d hand you. And if your work is mostly long documents plus chasing things across the web, I’d nudge you toward Gemini Advanced instead, purely for the Google grounding.
Is Claude Opus 4.8 still significantly more expensive than Sonnet 4.6?
Not the way it used to. Standard Opus 4.8 runs $5 / $25 per million tokens against Sonnet 4.6 at $3 / $15. Call it 1.7x, where the old Opus tiers stung you a full 5x. So my routing shifted. Sonnet 4.6 handles everyday code and writing. Opus 4.8 standard gets the call the moment a task is worth a small premium for its sharper judgement. And Opus 4.8 fast mode ($10 / $50, 2.5x throughput) shows up when I’m in an agent loop and latency hurts more than the invoice does.
Will prices keep falling through 2026?
They will, and the pattern’s held steady enough that I’d put money on it. Today’s flagships shed maybe 30 percent by year-end and quietly slide down into the “Sonnet/Pro” tier, while the shiny new flagship lands costing two to three times more. The rule of thumb I plan around: whatever you’re running in production is roughly six months old and costs about half the latest flagship. Size your prompt-engineering effort to match. Don’t pour three weeks into squeezing a model you’ll demote by Christmas.
How do production teams that use several models actually route?
Most teams I know either roll their own router in the app or lean on OpenRouter, Portkey or LiteLLM to handle it. The call usually comes down to a handful of things: what kind of task it is (code, vision, long context), how much it actually matters (a customer is staring at the screen versus a batch job nobody watches), and what it’s allowed to cost. The bit that surprises people: a cheap little classifier, Gemini Flash or GPT-5 mini, often makes the decision about which expensive model the real request even gets handed to.
Is my data used to train the models?
On the paid APIs (OpenAI, Anthropic, Google Cloud Vertex) no, not by default, and the terms spell out the opt-out. The free web interfaces are murkier. On the ChatGPT.com free tier and Gemini.google.com you’re usually opted in unless you go dig the setting out yourself, so, go do that today. And if it’s professional or regulated work, don’t lean on a toggle at all. Go through the API, or an Enterprise plan that contractually swears your data never gets trained on. Get it in writing.
Mistral, Llama, DeepSeek: still relevant in 2026?
Absolutely, if you’ve got the right niche for them. Mistral Large 3 is the one I’d raise the moment European data sovereignty or an on-prem deployment hits the table. Llama 4 405B on Groq stays untouchable on latency for streaming UX. Said it above, saying it again. And DeepSeek-R3 reasons genuinely well for what it costs, which is close to nothing. None of them flat-out replaces the three American flagships. I still keep every one of them in the toolbox for the jobs they happen to win.













