-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VITS model gives bad results (training an italian tts model) #4017
Comments
UPDATE: that wasn't a phonemes problem. The training wasn't loading because i didn't use the colab gpu. So the problem it's all about the configuration i guess? Now I set |
i keep training the model and now it has done 33k steps. the function of avg_loss_1 on the tensorboard is coverging to 25 but im noticing it's improving very slowly... is there a method to use efficiently colab's gpu? am i doing the training right based on the configuration? how many steps should i do in order to get a good vits model? |
i stopped training the model because i thought the dataset might not be sufficient (~8 hours of speech). i'll try with bigger dataset such as the MLS one... additionally, i came to a conclusion: since the tts model generates mel spectrograms (not sure about that), i need to train an italian vocoder model, but i dont really know how to do that with a specific language. any comment is appreciated |
Vits directly outputs audio because it trains its own vocoder internally, you don't need to train a separate one. |
you saved me a lot of time, thank you. by the way, now im building the transcripts for MLS dataset deleting all the files with a male voice (i actually want to train a woman voice model) but the thing is: |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels. |
Describe the bug
Hi everyone. I'm new to the world of ML, so I'm not used to training AI models...
I really want to create my own TTS model using coqui's VITS trainer, so I've done a lot of research about it. I configured some dataset parameters and configuration functions and then started training. For the training I used almost 10 hours of audio spoken in Italian. After training I tried the model but the result is not bad, it's FAIRLY bad... The model doesn't even "speak" a language. Here is an example of the sentence:
"input_text": ""input_text": "Oh, finalmente sei arrivato fin qui. Non è affatto comune che un semplice essere umano riesca a penetrare così profondamente nella mia dimora. Scarlet Devil Mansion non è un posto per i deboli di cuore, lo sapevi?""
(I do not recommend to listen to the audio at full volume.)
generated_audio.mp4
The voice of the audio is actually from a RVC model. I imported the model into a program that makes TTS first and then uses the weights of a RVC model to the generated audio. It's not a RVC problem because I used this program with the same RVC and other TTS models (mostly in english and one in italian) and they work well, especially the english ones.
To Reproduce
Here's my configuration:
Dataset config:
Dataset format:
Audio:
Characters:
General config:
Expected behavior
No response
Logs
No response
Environment
- TTS version: 0.22.0 - Python version: 3.10.9 - OS: Windows - CUDA version: 11.8 - GPU: GTX 1650 with 4GB of VRAM All the libraries were installed via pip command
Additional context
Additionally, After few days I tried to use espeak phonemes but the trainer.fit() function stucks at the beginning with this output:
The text was updated successfully, but these errors were encountered: