Train or Fine Tune VITS on (theoretically) Any Language | Train Multi-Speaker Model | Train YourTTS

preview_player
Показать описание
VITS Multispeaker English Training and Fine Tuning Notebook:

VITS Alternate Language Training and Fine Tuning Notebook:

YourTTS Training and Fine Tuning notebook:

Updated YourTTS and VITS multi-speaker English-language notebooks. New notebook is for training a VITS model with languages other than English.

In this one I take a look at alternate language training a VITS model using Coqui TTS on Google Colab. I trained a Spanish-speaking model on mostly-blind sample data. I don't speak Spanish, so I can't evaluate this, but it started sounding pretty good for what it was.

Then I review some of the change/differences in the multispeaker VITS notebook and YourTTS notebook

Other videos:

RTFM:
Рекомендации по теме
Комментарии
Автор

Woah! It's really really useful for Spanish training. Thank you!

blakusp
Автор

on the YourTTS paper, they train for 140K steps then fine tune for 50k steps with SCL enabled. not sure if you are doing this also.

making multi language models on YourTTS is probably the only thing you are missing and you are guaranteed to have every person using Coqui on you since the documentation is rather lacking to say the least.

even with lots of technical knowledge it was still a struggle setting this up before i found your notebooks.
Seriously, thank you for your efforts.

DestinyHax_YT
Автор

Hi, very nice video, does anyone know if there is a version of YourTTS that works well in Spanish? The CoquiTTS model seems to only accept English, French and Portuguese.

javierdiez
Автор

hi,

if i am new in training model and what you showed is to complicated from where i need to start to be able to understand what are you describing in this video?

ŁukaszMadajczyk
Автор

Does anyone have issue with one of the last steps? (Run trainer) .. it keeps giving me error: TypeError: object of type 'NoneType' has no len() .. I'm running single speaker in czech language. I've set up everything for czech (cs), but this step will not work if I try anything.

TheRonoxcz
Автор

Hello. I am trying to create a TTS with Japanese voice, referring to your wonderful video. I've heard a lot about RVC, but I don't know much about VITS. Is it possible to make TTS with Japanese voice using the method shown in the video? (I don't even know what pre-trained model means.). Thanks!

Jeaho
Автор

HOW CAN I SOLVE THIS ERROR WHEN INSTALLING IN SPANISH
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires numpy<1.24, >=1.22, but you have numpy 1.21.6 which is incompatible.
tensorflow 2.12.0 requires protobuf!=4.21.0, !=4.21.1, !=4.21.2, !=4.21.3, !=4.21.4, !=4.21.5, <5.0.0dev, >=3.20.3, but you have protobuf 3.19.6 which is incompatible.
panel 0.14.4 requires bokeh<2.5.0, >=2.4.0, but you have bokeh 1.4.0 which is incompatible.

jeilyamv
Автор

Could you please make a video for tamil text to speech with my own voice.

cready
Автор

how can i fix the following error


@nano
ModuleNotFoundError Traceback (most recent call last)
in <cell line: 1>()
----> 1 from transformers import WhisperProcessor,
2 options = dict(language=whisper_lang, beam_size=5, best_of=5)
3 transcribe_options = dict(task="transcribe", **options)
4

ModuleNotFoundError: No module named 'transformers'

tiemposrevelados
Автор

what about tex tokenizer? shouldnt it be seperate for different language?

pranilpatil
Автор

Hello, thank you for your video. I need help... I used the alternate language training notebook and only edited the dataset formatter (ljspeech) and the phoneme language (Bulgarian). When i try to synthesize a model i get an error. I did not run the processing options because my dataset is processed and i did not run tensorflow.

tts --text "Първите обитатели на териториите са били хомо сапиенс." \
--model_path \
--config_path \
--out_path output.wav

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/bin/tts", line 11, in <module>
load_entry_point('TTS', 'console_scripts', 'tts')()
File "/Users/dennis/Desktop/AI/TTS/TTS/bin/synthesize.py", line 439, in main
reference_speaker_name=args.reference_speaker_idx,
File "/Users/dennis/Desktop/AI/TTS/TTS/utils/synthesizer.py", line 384, in tts
language_id=language_id,
File "/Users/dennis/Desktop/AI/TTS/TTS/tts/utils/synthesis.py", line 220, in synthesis
language_id=language_id,
File "/Users/dennis/Desktop/AI/TTS/TTS/tts/utils/synthesis.py", line 58, in run_model_torch
"language_ids": language_id,
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/Users/dennis/Desktop/AI/TTS/TTS/tts/models/vits.py", line 1161, in inference
o = self.waveform_decoder((z * y_mask)[:, :, : self.max_inference_len], g=g)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/dennis/Desktop/AI/TTS/TTS/vocoder/models/hifigan_generator.py", line 250, in forward
o = o + self.cond_layer(g)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 313, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 310, in _conv_forward
self.padding, self.dilation, self.groups)
TypeError: conv1d() received an invalid combination of arguments - got (NoneType, Parameter, Parameter, tuple, tuple, tuple, int), but expected one of:
* (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups)
didn't match because some of the arguments have invalid types: (NoneType, Parameter, Parameter, tuple, tuple, tuple, int)
* (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, str padding, tuple of ints dilation, int groups)
didn't match because some of the arguments have invalid types: (NoneType, Parameter, Parameter, tuple, tuple, tuple, int)

onenoone
Автор

Hi, do tou think that is possible to use this tutorial for fine tune in Latin American Spanish leanguage? Thanks

paulaortegariera
Автор

Hello all, just want to ask if someone wants to train the language on regional language such as telugu, bengali and specially hindi than where to get the pretrained model file weights.

prateekkumarsingh
Автор

hello, my english is very poor so i don't understand much, where do we enter the audio files, I would be glad if you make a flashy video

okru
Автор

have you tried finetuning using the whole VCTK dataset + new speaker?

jazza
Автор

how would i go about increasing the khz (higher then 16k) since it sounds i would say "to bad"

julin
Автор

How can I solve error 3 from trainer import Trainer, TrainerArgs
4
----> 5 from import BaseDatasetConfig
6 from TTS.tts.configs.vits_config import VitsConfig
7 from TTS.tts.datasets import load_tts_samples

ModuleNotFoundError: No module named 'TTS.tts'

jeilyamv
Автор

Make a practical video for hindi language voice cloning for multi speakers

Can you clear me one thing did we have to repeat the whole process for multi speaker or what we have do not getting it correctly

I hope you will make me understand

I am glad for this video

Thanks dear🥰🥰

shailendrarathore
Автор

HELLO DEAR NANONOMAD PLEASE MAKE A PRACTICAL VIDEO OF VOICE CLONING HINDI LANGUAGE OF SPECIFIC PERSON

GETTING SOME ISSUES

NEARLY TRIED TO DO WITH 38 TIMES BUT GETTING NO OUTPUT

AND LET ME KNOW HOW TO FINETUNE THE FIRST DIRTY OUTPUT SO THAT ITS SOUNDS SO NATURAL

WITH REGARDS TO MR. NANOMOMAD

PLEASE HELP AND GIVE A TRY 🙏🙏

shailendrarathore
Автор

Hi, We are trying to fine-tune with Hindi audio. Each epoch takes approximately 2.5 hrs. Can you share the machine configuration used to create this, and the time it took to fine-tune the model. Thanks..!

AvinashTulasi