Adapting TTS models For New Speakers using Transfer Learning

Paarth Neekhara1, Jason Li2, Boris Ginsburg2

1University of California San Diego
2NVIDIA

[Code and Notebook for TTS finetuning]

We present sound examples for the experiments in our paper Adapting TTS models For New Speakers using Transfer Learning. In this work, we adapt a single speaker TTS system for new speakers using a few minutes of training data. We use a baseline TTS model that is trained on speaker 8051 (Female) of the HiFiTTS dataset and adapt it for speakers 92 (Female) and 6097 (Male) using two finetuning techniques. We first present the original speaker's audio samples and then the synthesis results for our two target speakers.

Original Speaker Samples

The original TTS model is trained on speaker 8051 on more than 27 hours of speech and text pairs. The spectrogram synthesis model is a FastPitch model with a learnable alignment module, while the vocoder model is a HiFiGAN vocoder. The real validation examples along with the synthesized examples for the validation text are presented in the table below.

Speaker Real Validation Sample Synthesized
>27 hrs Train from Scratch

Direct Finetuning

In this finetuning approach, we finetune all the parameters of the pre-trained TTS models directly on the data of the new speaker. For finetuning the spectrogram-synthesis model, we require the text and speech pairs of the new speaker, while for the vocoder model, we only require the speech examples of the speaker for generating the spectrogram and waveform pairs. The real validation examples along with the synthesized examples from the validation text are presented in the table below.

Target Speaker Real Validation Sample Synthesized
1 min Fine Tuning
Synthesized
5 min Fine Tuning
Synthesized
30 min Fine Tuning
Synthesized
60 min Fine Tuning
Synthesized
>27 hrs Train from Scratch

Mixed Finetuning

Direct finetuning can result in overfitting or catastrophic forgetting when the amount of training data of the new speaker is very limited. To address this challenge, we explore another transfer learning method in which we mix the original speaker's data with the new speaker's data during finetuning. In this setting, we assume that we have enough training samples of the original speaker while the number of samples of the new speaker is limited. We create a data-loading pipeline that samples equal number of examples from the original and new speaker in each mini-batch. We add a speaker embedding layer in the spectrogram synthesis module, to make it suitable for multiple speaker training. The vocoder model architecture is unchanged and the model is finetuned on the spectrogram and audio pairs of the two speakers. The real validation examples along with the synthesized examples from the validation text are presented in the table below.

Target Speaker Real Validation Sample Synthesized
1 min Fine Tuning
Synthesized
5 min Fine Tuning
Synthesized
30 min Fine Tuning
Synthesized
60 min Fine Tuning
Synthesized
>27 hrs Train from Scratch