Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Wang, Chengyi; Chen, Sanyuan; Wu, Yu; Zhang, Ziqiang; Zhou, Long; Liu, Shujie; Chen, Zhuo; Liu, Yanqing; Wang, Huaming; Li, Jinyu; He, Lei; Zhao, Sheng; Wei, Furu

doi:10.48550/arXiv.2301.02111

Computer Science > Computation and Language

arXiv:2301.02111 (cs)

[Submitted on 5 Jan 2023]

Title:Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Authors:Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei

Download PDF

Abstract: We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See this https URL for demos of our work.

Comments:	Working in progress
Subjects:	Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2301.02111 [cs.CL]
	(or arXiv:2301.02111v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2301.02111

Submission history

From: Yu Wu [view email]
[v1] Thu, 5 Jan 2023 15:37:15 UTC (818 KB)

Computer Science > Computation and Language

Title:Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Submission history

Download:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Submission history

Download:

References & Citations

Bibtex formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators