Pocket TTS

A high-quality TTS with voice cloning that runs on CPU.

13 January 2026

We present Pocket TTS, a 100M-parameter text-to-speech model with voice cloning abilities. The model is small enough to run in real time easily on your laptop’s CPU. To try it out locally, install uv and run:

uvx pocket-tts serve

You can also use uvx pocket-tts generate for a CLI interface. Add --help to see what options are available. See the repo for more info. Or try the online demo:

Kyutai Pocket TTS

You can also clone the voice from any audio sample by using our repo. You can find more voices in our voices repository. We recommend cleaning the sample before using it with Pocket TTS.

The current state of the art in text-to-speech is split in two: on one side, LLM-based approaches with models of around 1B parameters, like Kyutai TTS 1.6B. These models are powerful enough to model any voice, emotion, and acoustic conditions, but they are too bulky to run without a GPU.

On the other side, we have small specialized models like the 82M-parameter Kokoro TTS: lightweight but less flexible. Kokoro has a fixed set of voices and starts with a handcrafted pipeline to convert the text to phonemes before passing it to the model. These two changes make the modeling task much easier, allowing for a smaller model, but are deterministic and come with a limited set of voices.

With Pocket TTS, we manage to bridge this gap and obtain a tiny model that runs on CPU without sacrificing any advantages of the larger models.

As usual with Kyutai releases, we open-source Pocket TTS under the MIT license. To maximize reproducibility, we trained the model on public English datasets only. We’re excited to see how far the method can be pushed with additional private data.

Pocket TTS can faithfully reproduce the voice of a given sample, only needing around 5 seconds of audio input, meaning you can easily use it with any voice you like. The color of the voice, its emotion, accent, cadence and the acoustic conditions like reverb and the microphone, are all captured accurately and reproduced by the TTS.

You can use our existing library of voices or provide your own voice sample.

Evaluation

We evaluate Pocket TTS on the Librispeech test-clean set following the same protocol as F5-TTS, with the difference that we cleaned the audio input using Adobe Enhance Speech to obtain 24kHz high-quality audio.

We compare against three baselines: F5-TTS, DSM, Chatterbox Turbo and Kokoro TTS. We report three metrics: Word Error Rate (WER) using Whisper-large-v3, as well as the results of a pairwise human evaluation for audio quality and speaker similarity. For audio quality, we ask raters “Which of the two audio clips has the best audio quality?”, and for speaker similarity, we ask “Which of the two audio clips sounds more similar to the reference audio clip in terms of voice characteristics?” and provide the voice prompt as a reference.

As seen in the table below, Pocket TTS has the lowest Word Error Rate, a better Audio Quality than the ground truth F5-TTS and DSM, as well as an on-par Speaker Similarity with the ground truth while being a significantly smaller model than competitors, and being the only one that can run faster than real-time on CPU.

Model	Param size (generative model only)	Word Error Rate $(\downarrow)$	Audio Quality (ELO) $(\uparrow)$	Speaker Sim (ELO) $(\uparrow)$	Faster than real-time on CPU
F5 TTS	336M	2.21	1949 ± 27	1946 ± 26	$\text{✗}$
Kyutai TTS 1.6B	750M	1.84	1959 ± 25	2037 ± 21	$\text{✗}$
Chatterbox Turbo	350M	3.24	2055 ± 23	2012 ± 22	$\text{✗}$
Kokoro	82M	1.93	No voice cloning	No voice cloning	$\checkmark$
Pocket TTS (ours)	100M	1.84	2016 ± 25	1898 ± 26	$\checkmark$

To assess practical performance, we run the various TTS on a common laptop with an Intel Core Ultra 7 165H CPU and a MacBook Air with an Apple M3 CPU to compare their latency. Pocket TTS and Kokoro are the only ones that run on real-time on this CPU by a significant margin, others being far from real time.

Architecture

The core components of Pocket TTS, allowing for its small size but high performance, come from the research described in our recent paper, Continuous Audio Language Models.

As described in our codec tutorial, the standard way to model audio for text-to-speech applications is to use a neural audio codec to convert audio to discrete tokens, then autoregressively predict continuations of these token sequences with a transformer, and finally decode those back into audio.

In our previous text-to-speech release, we use a so-called RQ-transformer to get audio tokens from the backbone. But when attempting to shrink the model, this ends up becoming a computational bottleneck as it is very challenging to make the RQ-transformer smaller without sacrificing quality.

With Pocket TTS, we remove this bottleneck by entirely avoiding discrete tokens and instead having the model predict sequences of continuous latents directly. This may need like a simple change, but a number of tricks and optimizations is needed to make this work. We provide a technical overview here; for more details, please refer to the Continuous Audio Language Models paper. A v3 of the paper with more details on Pocket TTS is coming soon.

Neural audio codec

Our codec is based on Mimi, the neural audio codec we designed for Moshi and later used in our delayed streams modeling paper. The primary difference is that Mimi compresses the audio into discrete tokens, whereas here we use continuous latents, regularized to follow a normal distribution as it is done in standard VAE training. Like in Mimi, to enforce semanticity of the representations, we distill WavLM into the inner latent representation of our codec with a cosine similarity loss. Mimi applies this loss only to the first RVQ level, but here, since there is no RVQ, we apply the distillation loss to the entire latent representation.

Generative Model

We train the model to predict continuations for $\bigl(\mathbf{x}^1, \ldots, \mathbf{x}^{S - 1}\bigr)$ , the sequence of continuous latent vectors produced by the codec’s encoder.

We build on the Masked Autoregressive (MAR) framework by employing a causal transformer backbone $T_{\theta}$ that outputs $\bigl(\mathbf{z}^2, \ldots, \mathbf{z}^S\bigr)$ , and conditions an MLP sampler to build the next continuous latents $\bigl(\mathbf{x}^2, \ldots, \mathbf{x}^S\bigr)$ . In MAR, the sampler is a diffusion model, but here we use a Lagrangian Self-Distillation (LSD) loss to natively enable 1-step sampling. At step $i$ , the sampler outputs the prediction for the next latent $\mathbf{x}^{i+1}$ , which, at inference time, is autoregressively fed back to the model.

Voice and text conditioning

To condition the model with the text to say and the voice to say it with, we prefix the generated audio with a few second voice prompt followed by the text to say. The audio is encoded using the neural audio codec encoder, and the text is embedded using a SentencePiece tokenizer.

Model size breakdown

In total, the generative model (causal transformer + MLP head) has 90M parameters and the codec’s decoder has 10M, adding up to 100M parameters in total. There is also the 18M-parameter encoder of the codec, which is only used once to encode a given voice sample. Afterwards, we can keep the embedding in memory to generate different audio from the same voice.

Data

The model is trained purely on publicly released data. Specifically, the dataset is composed of AMI, EARNINGS22, GIGASpeech, SPGISpeech, TED-LIUM, VoxPopuli, LibriHeavy, and Emilia. These datasets are all in English and add up to 88k hours of audio.

Scientific contributions

We employ several strategies to train this new model with continuous latent in an efficient manner:

Head batch multiplier

The training is bottlenecked by the transformer backbone that generates the conditioning variable $\mathbf{z}^s$ . To address this, we introduce the Head Batch Multiplier, which amortizes this cost by reusing $\mathbf{z}^s$ multiple times per training step. Specifically, for each input sequence, we compute $\mathbf{z}^s$ once and use it across $N$ loss computations, each with independently sampled LSD noise levels $s$ , $t$ and gaussian noise $\epsilon$ . This not only improves efficiency but also stabilizes training by averaging the loss over multiple samples. We use $N=8$ .

Gaussian Temperature Sampling

Sampling strategies, such as temperature sampling, can have a significant positive impact on generation quality in the discrete setup, particularly for speech. To replicate this behavior in the continuous domain, we introduce a dedicated sampling heuristic that results in comparable gains.

Specifically, we reduce the variance of the Gaussian noise passed to the LSD head. Applying a temperature of $\tau$ is mathematically equivalent to multiplying the standard deviation by $\sqrt{\tau}$ . We found that a temperature of 0.7 brought good results in practice.

Latent Classifier-Free Guidance

Similarly, Classifier-Free Guidance (CFG) is known to improve the generation quality of conditioned generative models. It can be – and is generally – applied for diffusion and flow matching models on the sampling trajectory as well as on the logits of autoregressive language models.

However, CFG cannot be applied on the trajectory of 1-step flow models. Intuitively, this is because the output space cannot be used for interpolation/extrapolation. For instance, if we had a flow model that predicts a waveform directly, we would be interpolating between two waveforms, which just amounts to layering the sounds over each other. Here we predict latents for a neural audio codec, but the same intuition holds.

Instead, we apply the CFG on the outputs of the causal transformer backbone. Formally, given $C$ a conditioning and $\alpha$ the CFG coefficient, we compute for every $s$ of the sequence $\mathbf{z}^s_{CFG} = \mathbf{z}^s_{\emptyset} + \alpha(\mathbf{z}^s_C - \mathbf{z}^s_{\emptyset})$ , where $\mathbf{z}^s_{C}$ is the output of the conditioned forward pass, and $\mathbf{z}^s_{\emptyset}$ that of the unconditioned forward pass. We then generate $\mathbf{x}^s$ with the LSD head conditioned on $\mathbf{z}^s_{CFG}$ .

We call this method Latent CFG, as it operates on the latent variable $\mathbf{z}^s$ instead of the model output. It is somewhat surprising that this works at all, because the modified latents could be completely out-of-distribution for the Flow Head, but we find that it significantly improves performance. In practice, we use $\alpha=1.5$ .

We discovered latent CFG independently, we also note that a similar idea also appears in the video-to-audio literature with SoundReactor.

Distillation

Once we have trained a model with a set CFG coefficient $\alpha$ , we can use this model as a teacher to distill it into a model that generates with this coefficient without doubling the batch size. The distilled model has a frozen copy of the MLP head of the teacher. The distillation objective for the student model is to output $\mathbf{z}^s_{distill}$ out of its backbone that matches the $\mathbf{z}^s_{CFG}$ of the guided teacher with an L2 loss. We observe that the student can remain accurate even with fewer layers than the teacher, which enabled us to have a teacher with 24 layers and a student with a mere 6.

Authors

Manu Orsini*, Simon Rouard*, Gabriel De Marmiesse*, Václav Volhejn, Neil Zeghidour, Alexandre Défossez

*equal contribution