You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Im trying to understand the GTA part of your paper which seems to have a huge influence and im unsure if I understood it correctly. I understood that much: You have two networks, one which maps a source and target speaker mel spectogram and the transcription to a transformed spectogram and the vocoder which maps the transformed spectogram to waveform.
You first train the first network. Then instead of transforming the waveform to a spectogram and using that as input to the vocoder, in order to train it, you pass the audio through your proposed network and use the output as input to train the vocoder, is that correct?
The text was updated successfully, but these errors were encountered:
Hi.
Yes, as you understand, we put the same mel spectrogram as the source and the target in the first network, which will reconstruct the original mel since it is the same situation as the training.
And then we use the reconstructed mel spectrogram as the input to train the vocoder.
Thank you :)
Im trying to understand the GTA part of your paper which seems to have a huge influence and im unsure if I understood it correctly. I understood that much: You have two networks, one which maps a source and target speaker mel spectogram and the transcription to a transformed spectogram and the vocoder which maps the transformed spectogram to waveform.
You first train the first network. Then instead of transforming the waveform to a spectogram and using that as input to the vocoder, in order to train it, you pass the audio through your proposed network and use the output as input to train the vocoder, is that correct?
The text was updated successfully, but these errors were encountered: