XTTS v2 vs ElevenLabs comes down to one honest trade: do you want a voice that runs free on your own machine, or the easiest path to the most natural output on the market. We have used both for real work, narration and quick voice clones, and neither is a clean win. XTTS v2 is open source, runs offline, costs nothing per word, and clones a voice from a few seconds of audio. ElevenLabs sounds a touch more human, ships in more languages, answers faster, and gives you clear commercial rights the moment you pay. The catch most comparisons skip is licensing, and it is the part that decides it for a lot of people. Here is the whole picture, with no fake prices.
The short answer
Pick XTTS v2 if you want a voice that runs free and offline on your own GPU, keeps every word private, and clones from a few seconds of audio, and your use is personal or research, because the model weights are non-commercial. Pick ElevenLabs if you are publishing or monetizing: it sounds a little more human, answers faster, speaks more languages, and hands you clear commercial rights the moment you pay.
We came to this the way most people do: needing a voice for something, a video, a tutorial, a prototype that talks, and finding the field had narrowed to two names that keep coming up. ElevenLabs, the cloud service everyone benchmarks against, and XTTS v2, the open-source model from Coqui that you run yourself. They solve the same problem from opposite ends. One is a subscription that just works; the other is a free download that asks for a GPU and an afternoon. We have shipped real audio with both, so this is the comparison we wish we had read first. Prices on the ElevenLabs side shift, so we will describe the shape of the plans and send you to their page for the current numbers rather than feed you figures that age badly.
The quick verdict
If you are skimming, here is the short version, by who you are.
| You are... | The pick | Why |
|---|---|---|
| A hobbyist, student or researcher | XTTS v2 | Free, private, offline, unlimited generation on your own GPU |
| Publishing or monetizing content | ElevenLabs | Top quality, zero setup, and a clear commercial license |
| Building a real-time app | ElevenLabs (Flash) | Sub-100ms latency that local engines cannot match |
| Privacy-bound (no cloud allowed) | XTTS v2 | Nothing ever leaves your machine, full stop |
| Shipping a paid product on a budget | Neither, look wider | XTTS weights are non-commercial; use Piper or Kokoro instead |
That last row is the one that surprises people, and we get into it below, because it is the detail that quietly decides a lot of projects.
Side by side
The same comparison as a table, across the things that actually decide it. Read the licensing row twice.
| XTTS v2 (Coqui) | ElevenLabs | |
|---|---|---|
| Where it runs | Your hardware, fully offline | The cloud, internet required |
| Cost | Free to run (you pay in GPU) | Free eval tier, then paid plans by usage |
| Output quality | Excellent, ~94% of ElevenLabs in open tests | The benchmark, slightly more natural and expressive |
| Languages | 17 | Around 32 on Multilingual v2, more on the newest model |
| Voice cloning | From ~6s of audio, 15 to 30s is better | Instant clone from ~1 min, pro clone from a recording session |
| Latency | Depends on your GPU, grows with text length | Very low, the Flash model is under ~75ms |
| Privacy | Total, no audio leaves your machine | Audio is processed on their servers |
| Commercial use | Non-commercial license, and no one to buy one from | Included on any paid plan |
| Setup effort | Install, a GPU, some tinkering | Sign up and type |
Where ElevenLabs wins
There is a reason it is the name everyone measures against. The output from its Multilingual v2 model, and the newer, more expressive model above it, is genuinely hard to tell from a real recording at normal listening speed. The emotional range is wider, the consistency across a long script is steadier, and you get there with zero setup: you paste text and it talks. It speaks more languages out of the box, and for anything interactive its Flash model answers in well under a tenth of a second, which no local engine on consumer hardware will match. And when you pay, the commercial rights are spelled out and yours, which matters more than people realize until a client asks.
The honest cost is exactly that: it is a paid, cloud service. Your text and audio go to their servers, so it is a non-starter where data cannot leave the building. The free tier is for trying it, not shipping with it: no commercial use, an attribution requirement, and no instant voice clone. And because billing is by usage credits, a heavy month costs real money. You are buying quality, speed and convenience, and they are genuinely worth buying, but you are renting them.
Where XTTS v2 wins
XTTS v2 wins on everything the cloud cannot give you. It is free to run, so once you have the hardware there is no per-word meter ticking, which changes how you work: you regenerate a line twenty times without thinking about cost. It runs entirely offline, so it is the obvious answer when privacy or air-gapping is a hard requirement, nothing you type or clone ever leaves your machine. It clones a voice from as little as six seconds of audio across 17 languages. And the quality is the surprise: independent listening tests put it at around 94 percent of ElevenLabs, with the cloud service holding only a small lead on emotion and consistency. For a lot of narration, that gap is hard to hear.
The price you pay is in hardware and effort. You want an NVIDIA GPU for anything near real time, the model runs on a CPU but slowly, and the install is a real step rather than a sign-up form. And then there is the licensing, which deserves its own section.
The licensing catch nobody mentions
This is the part that turns a clear win into a careful one, and most comparisons skip it. The XTTS v2 model weights ship under the Coqui Public Model License, which is non-commercial. The code around the model is the permissive MPL 2.0, so people assume the whole thing is free to use commercially. It is not: the weights, the part that actually makes the voice, are the restricted bit.
It gets sharper. Coqui, the company that made XTTS and was the only party who could sell you a commercial license, shut down in January 2024. So there is now no legal path to a commercial license at all, the seller no longer exists. The model stays free to download and use, and an actively maintained community fork keeps it running on current Python and PyTorch, but the license on the weights did not change. The honest reading in 2026 is simple: use XTTS v2 for personal, research and non-commercial work, and do not build it into a product you sell.
If you need a local engine you can actually ship, the good news is there are several whose weights are genuinely permissive: Piper, Kokoro (Apache 2.0) and StyleTTS 2 (MIT) are the names to look at. They are the right tool when the use is commercial and you still want everything local. And if you want XTTS-level polish with no licensing homework at all, a paid ElevenLabs plan is, again, the clean answer.
Running XTTS v2 locally
If XTTS v2 is your pick, here is how little it takes to get a voice out of it. Install the maintained community fork, point it at a short voice sample, and pick one of the 17 languages.
The quality of your clone rides almost entirely on the sample. A clean, consistent 15 to 30 second recording, one voice, no music, no room echo, beats a longer messy one every time. That is true for ElevenLabs too, so if you are going to clone a voice, it is worth recording the sample properly once.
So which should you choose?
Strip it all back and it is two clean cases. If your work is personal, research, or anything where the audio cannot touch the cloud, and you have a GPU, XTTS v2 is a genuinely excellent free voice, as long as you respect the non-commercial license. If you are publishing or monetizing, want the most natural result with no setup, need many languages or low latency, or simply want commercial rights you do not have to think about, ElevenLabs earns its subscription.
For most people reading this who are making something to put out into the world, that points at ElevenLabs, with XTTS v2 as the brilliant free option for everything private and personal. And if you are shipping a paid product on a tight budget, remember the third door: a permissively licensed local engine like Piper or Kokoro. The right voice depends entirely on what you are doing with it, which is exactly how it should be.
Sources and further reading
- ElevenLabs, models and the differences between them
- ElevenLabs pricing
- Coqui XTTS v2 model and license on Hugging Face
- The maintained community fork of Coqui TTS
Frequently asked questions
Is XTTS v2 as good as ElevenLabs?
It is closer than you would expect. In independent listening tests XTTS v2 lands at roughly 94 percent of ElevenLabs quality, with ElevenLabs keeping a small edge on emotional range and consistency. For a lot of narration and voice-clone work the gap is hard to hear at normal listening speed. Where ElevenLabs still pulls clearly ahead is latency and its professional voice clone, which is trained from a real recording session and beats any local engine. So XTTS v2 is good enough for most jobs, ElevenLabs is the safer pick when the output has to be flawless.
Can I use XTTS v2 commercially?
Treat it as non-commercial. The XTTS v2 model weights ship under the Coqui Public Model License, which forbids commercial use, even though the surrounding code is the permissive MPL 2.0. Coqui, the company, shut down in January 2024, so there is no longer anyone to sell you a commercial license. If you need a local engine you can ship in a paid product, look at Piper, Kokoro or StyleTTS 2, whose weights are permissive (MIT or Apache). For commercial work with XTTS-level polish and no licensing risk, a paid ElevenLabs plan is the clean answer.
Is ElevenLabs free?
There is a free tier, but it is for evaluation only. It gives you a small monthly credit budget, no commercial rights, and it requires you to attribute ElevenLabs in anything public, and instant voice cloning is off. The moment you want to publish or monetize, you need a paid plan, which is also where the commercial license and voice cloning unlock. So you can test it for nothing, but you cannot legally ship free-tier audio in a product or a monetized video.
How much audio do I need to clone a voice?
For XTTS v2, as little as 6 seconds works, though 15 to 30 seconds of clean audio gives a noticeably better clone. ElevenLabs instant voice clone wants about a minute and lands in the same quality range as XTTS. Its professional voice clone is a different tier: it asks for around 30 minutes or more of studio-grade recording and produces a clone that exceeds any local engine. The cleaner and more consistent your sample, the better every one of these gets.
Do I need a GPU to run XTTS v2?
For usable speed, yes. XTTS v2 will run on a CPU, but generation is slow enough that it stops being fun for anything beyond a test. A consumer NVIDIA GPU brings it down to near real time. That is the hidden cost of the free engine: you are trading a subscription for hardware and setup. ElevenLabs needs none of that, since everything runs on their servers, which is the whole point of paying for it.