-
Notifications
You must be signed in to change notification settings - Fork 14.6k
Closed
Labels
Description
This new model seems suitable for integration: https://github.com/edwko/OuteTTS
We should add a very minimalistic example for generating audio with it. Ideally, we will implement the (audio tokens) -> (wav) from scratch.
ylsdamxssjxxdd, liuzl, lin72h, Sebastian-Getts and hans00lin72h, martindevans, apepkuss, vansatchen, bachittle and 2 morengxson, Vaibhavs10, LSXAxeller, lin72h, martindevans and 7 more
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
JohannesGaessler commentedon Nov 5, 2024
Do you have any opinions regarding if and how TTS should be integrated into the server? Directly make it part of the HTTP server? Run another server which the llama.cpp server in turn sends requests to? (The first approach would be more suitable for multimodel models I think, the second one would be more modular.)
ggerganov commentedon Nov 5, 2024
Not yet. Seems like the biggest question is how to implement the WavTokenizer. If it's too complex, it might have to live in a separate project? With it's own server? Not sure.
Pinging @PABannier as they have experience with encodec.cpp and to my understanding, WavTokenizer is something similar to Encodec?
ngxson commentedon Nov 5, 2024
A bit off-topic, but having some kind of
audio-tokenizer.cppinsidellama.cppwill be a very huge deal. It could potentially unlock all the pipeline like TTS, speech-to-text (ASR), speech-to-speech.bachittle commentedon Nov 6, 2024
The paper mentions Encodec a lot, and says it follows the same paradigm in using a VQ-GAN: https://arxiv.org/pdf/2408.16532 . It is definitely feasible to implement an audio tokenizer here.
PABannier commentedon Nov 6, 2024
+1 to @ngxson , tokens to WAV is a big step that bridges the gap between LLMs and TTS models. Encodec is one of those models, and a lot of neural codes are derived from Encodec (see Vocos for example).
Happy to explain in greater details what I did and help integrate Encodec (or a similar model to llama.cpp). As an example of how Encodec integrates after LLMs, you can check Bark.cpp.
FYI, I'm in the process of upstreaming a bench of Metal kernels to ggml which come very handy to support Encodec (
ggml_conv_transpose_1d,ggml_elu,ggml_argmax,ggml_set_i32, etc.).