Quick introVASynth is an AI based app for creating new voice lines using neural speech synthesis. The app loads models individually trained on voices from several Bethesda games. The app gives users control over details such as pitch and durations of individual letters to provide control over emotion and emphasis. To see it in action, watch this short intro/tutorial video, narrated by various supported voices:
Other games
Discord: https://discord.gg/nv7c6E2TzV
Patreon:
https://www.patreon.com/xvasynth
Twitter:
@dan_ruta
Note: To keep things fair, avoid using the tool in an offensive/explicit manner. Make it obvious where you can in descriptions that the voice samples are generated, and not from the original voice actors. Any issues you cause with this are on you.IntroductionxVASynth (or [SK]VASynth, for [Skyrim] voices) wraps around FastPitch [1] models trained on datasets compiled from in-game voice acted lines. The strengths of this model are in the artistic control over the generated audio. Once you generate the audio from your text prompt, you can adjust the pitch and durations using the editor:
The use of neural speech synthesis leads to natural sounding voices, something which is very difficult to do with more traditional methods involving concatenations of existing data. It also means new vocabulary can be generated, outside of what the voice actors have already read out.
There are several potential use cases for this tool:
- Creating voice lines for new quest mods
- Creating machinima
- Give voices to custom follower mods, or expand existing followers' with more life
- Expand/edit/fix existing quests' stories/dialogue options
- Add variety to in-game voices by adding packs from other games' voice (eg add Fallout/Oblivion voices to skyrim voice lists)
- Add new vocabulary to voices not already found in scripts (cannot easily be done with audio splicing)
- (Fallout 4) Add new player names to name lists
- Enhance vanilla quests by adding more lore/explanations in conversations
- Make English translations for mods voiced in other languages
- (Morrowind) Add voice to unvoiced areas of the game
- Change the voice used by a character (if you believe a different voice would suit them better)
- memes
- probably more...
InstallationYou may need to install Microsoft Visual C++ Redistributable if you don't already have it.
The tool is not tied to the game files, so it can go anywhere. To make it compatible with Vortex however, I've placed it in the Data folder. I'd recommend installing the files manually, until I figure out the Vortex paths.
Download the main file, and place its contents wherever you want. Launch the app by double clicking the xVASynth.exe file (you may want to create a shortcut to this somewhere convenient). To install individual voice models, place the contents of the downloaded zip files (the "resources" folder) into the same folder as the xVASynth.exe file, where there is already a "resources" folder - like you would install a texture mod into the game folder.
To confirm, you should see 3 files (a .json, a .pt, and a .wav file) all found in <your xVASynth install directory>/resources/app/models/<game>/ (where <game> is skyrim, for models on this page).
Important: Make sure you click "Allow" if windows asks you for permission to run the python server. I use a local HTTP server to enable communication between the python code (for the AI models) and the JavaScript code (for the Electron front-end). If there are any issues, check the server.log file (located next to xVASynth.exe) - there should be an error at the end which I'll need to see for helping with issues.
Check the faq channel on Discord if you run into any issues, or send me a message on the bugs channel if your issue is not fixed by its contents.
For Skyrim, the voices trained so far are as follows ("Track" the mod for updates):
- Astrid / Valerica
- FemaleCommander
- FemaleCondescending
- FemaleElfHaughty
- FemaleNord
- FemaleNordEvenToned
- FemaleSultry
- FemaleYoungEager
- MaleGuard
- Serana
- Adril Arano / Jiub
- MaleDunmer
- MaleSoldier
- MaleNord
- MaleNordCommander
- MaleYoungEager
- Miraak
Where green represents good quality, yellow means ok quality (but might need a good deal of playing with the input to get something good), and red being currently unsuccessful.
Tips- If you have an NVIDIA GPU, you can enable GPU inference for much faster speeds. You do need to install CUDA dependencies yourself first, as a pre-requisite (tested for version 10.1, check with "nvcc -V" in cmd to verify correct installation)
- If you are using only the CPU to do inference, tick the "Quick and dirty" checkbox whenever you can (except for the final generation of the final audio). This option is much quicker, as it uses HiFi-GAN[2] instead of WaveGlow[3] for audio generation, but the quality is lower. Leaving it ticked on also speeds up the app start-up.
- Aim to generate audio between 1 and 5 seconds long. Audio much longer than this starts breaking down in quality. If you have a really long sentence, try to break it down into separate clauses and splice the audios together in a tool like Audacity
- You can right-click voices in the panel on the left to hear a quick preview of the voice
Downstream usesIf you make anything with this tool (mod or otherwise), let me know and I will include it here.
Stop right there criminal scum:
https://www.nexusmods.com/skyrimspecialedition/mods/44181A mod to add the infamous Oblivion line to Skyrim guards
Future PlansGenerally the plan is to keep going down the fairly long list of voices remaining, in Bethesda's games. I do plan on returning to some the voices already released to improve them with further/re-training.
At the moment, male voices are lower quality than female, due to the AI models pipeline being built around female speakers. I am working on trying to improve that, but I've so far been unsuccessful with the hardware I'm using.
The FastPitch [1] model trains using output from the Tacotron2 [4] model as pre-processed input. The issue here is that the pre-trained Tacotron2 model is trained on the LJSpeech dataset [5], a female speaker. When this Tacotron2 model is applied on female speakers, it works quite well, but applying it to male speakers mostly fails, meaning FastPitch training is unsuccessful. Training FastPitch is relatively easy, but Tacotron2 is quite difficult and requires a lot of VRAM (which I do not have on my GPU) to work well.
There are quite a few voices left to train (across all games). You can track/vote on further progress of the models being trained on my
patreon page (which I'm hoping will help fund the new hardware, so I can make more/better models), or on the GitHub page, at
https://github.com/DanRuta/xVA-Synth.
SupportThe best support is using the tool, making something cool with it, and letting me know about it! Or spreading the word, to anyone that may get some use/fun out of this. Join the discord here, and let me know if you have any ideas/suggestions, show off something you made, or you just want to chat about this:
https://discord.gg/nv7c6E2TzVTraining models for 150+ voices takes a very long time, and I'm running ~5 year old hardware. The v1.0 release took about 2.5 years to release (though ~2y of that was spent reading research papers and compiling datasets).
To get things moving along faster, I definitely need to save money for some time, to invest in a new GPU with higher VRAM capacity (for quality) and speed. The previously mentioned issue with training Tacotron2 might also just need me to use a GPU with more VRAM. I work on this out of my personal interest, but if people wish to expedite this hardware investment, you can boost my (phd) student budget through support on my patreon, or direct donations. I've tried to provide some incentives at a few tiers, such as development updates, early access, and votes on order of upcoming voices.
Special thanks:
- Rachel Wiles
- minermanb
- Billyro
- Baki Balcioglu
- Flipdark95
- Beto
- Harsh
- Pseudo Immortal
- TrueBlue
- My Best Friend Is A Squid
- Agito Rivers
- Thuggysmurf
- radbeetle
- Five More Minutes
- Ryan W
- Laura Almeida
- Alexandra Whitton
- Zelda Hadley
- Cookie
- Aluraine
- batteryjar
- Vahzah Vulom
- Eir
Adrian Łańcucki for FastPitch and the helpful discussions on GitHub.
All the amazing researchers behind the many tools and models I've used in creating this.
References [1] FastPitch -
https://arxiv.org/abs/2006.06873 [2] HiFi GAN -
https://arxiv.org/abs/2010.05646 [3] WaveGlow -
https://arxiv.org/abs/1811.00002 [4] Tacotron2 -
https://arxiv.org/abs/1712.05884 [5] LJSpeech -
https://keithito.com/LJ-Speech-Dataset/
Changelog:v1.0.5
- Bug fixes
v1.0.4
- Bug fixes
v1.0.3
- Improved the UI - reorganized the buttons to group the letter-specific tasks to the top bar, and sequence-wide setting on the bottom bar
- Added a "Reset letter" button
- Made it more clear which letter is being edited by highlighting it in red
- Added an option in the settings for auto-playing the audio when it's generated
- Fixed an issue where question marks in the input broke file saving
v1.0.2
- Fixed some filepath issues causing infinite loading
v1.0.1
- Added updates menu