ACE-VC Training and Inference Instructions

[Paper] [Project Page]


Setup

The ACE-VC code is now available in the main NeMo repository. To get started, clone the repository and install NeMo from source.


Step 1: Train the Speech Representation Extractor (SRE)

To train the SRE model, use the examples/tts/ssl_tts.py script as follows:

python examples/tts/ssl_tts.py \
model.train_ds.manifest_speaker_verification_fp=PATH_TO_SV_TRAIN.json \
model.train_ds.manifest_content_fp=PATH_TO_LIBRI_TTS_TRAIN.json \
model.train_ds.sup_data_path=PATH_TO_STORE_TRAIN_SUPDATA \
model.validation_ds.manifest_speaker_verification_fp=PATH_TO_SV_VAL.json \
model.validation_ds.manifest_content_fp=PATH_TO_LIBRI_TTS_VAL.json \
model.validation_ds.sup_data_path=PATH_TO_STORE_VAL_SUPDATA \
exp_manager.exp_dir=PATH_TO_EXPERIMENT_DIR \
exp_manager.name=EXP_NAME \
model.train_ds.batch_size_content=16 \
model.train_ds.batch_size_sv=64 \
model.train_ds.num_workers_sv=12 \
model.train_ds.num_workers_content=12 \
model.validation_ds.batch_size_content=16 \
model.validation_ds.batch_size_sv=64 \
trainer.devices=-1 \ # number of GPUs, -1 would use all available GPUs

The above script loads a pretrained Conformer-SSL model and then finetunes the SRE in a multi-task manner for speaker verification and text recognition. We require two sets of manifest files for the two tasks.
For reproducing the results in the paper, train the Speaker Verification head on the VoxCeleb2 dataset (~6000 speakers. The speaker numbers in the manifest files should be from 0 to 5993 for the speaker verification manifests. Text field can be set as "dummy" since text is ignored for training the speaker verification head. Entries in the PATH_TO_SV_TRAIN.json and PATH_TO_SV_VAL.json should look like:

{"audio_filepath": "VoxCelebData/dev/aac/id02311/9I73r0AF300/00097.wav", "offset": 0, "duration": 6.08, "speaker": 5, "text": "dummy"}
{"audio_filepath": "VoxCelebData/dev/aac/id07352/uUO_-H6_nNc/00217.wav", "offset": 0, "duration": 8.40, "speaker": 13, "text": "dummy"}
{"audio_filepath": "VoxCelebData/dev/aac/id06771/mk-jzpaOQqo/00356.wav", "offset": 0, "duration": 9.47, "speaker": 3, "text": "dummy"}
...

The content head is trained on the Libri-360-Clean dataset. Entries in the PATH_TO_LIBRI_TTS_TRAIN.json and PATH_TO_LIBRI_TTS_VAL.json should look like:

{"audio_filepath": "LibriTTS/train-clean-360/2204/131732/2204_131732_000007_000003.wav", "text": "When I came into the box, the orchestra played the 'Star Spangled Banner,' and all the people in the house arose; whereupon I was very much embarrassed.", "speaker_id": "2204", "duration": 7.82, "speaker": 177}
{"audio_filepath": "LibriTTS/train-clean-360/6258/49755/6258_49755_000029_000011.wav", "text": "He rode to Southampton, that he might find a ship equipped for sea, and so came to Barfleur.", "speaker_id": "6258", "duration": 5.36, "speaker": 591}
{"audio_filepath": "LibriTTS/train-clean-360/6458/61323/6458_61323_000080_000000.wav", "text": "Give me a draught from thy palms, O Finn, Son of my king for my succour, For my life and my dwelling.", "speaker_id": "6458", "duration": 6.12, "speaker": 616}
...
We use around 500 files each in the validation sets for the speaker verification and content heads. We train the Conformer SRE model for 37 epochs.


Step 2: Generating Supplementary data/Embeddings for training the synthesizer

Generate supplementary data for training the synthesizer as follows:

python scripts/ssl_tts/make_supdata.py \
--ssl_model_ckpt_path PATH_TO_TRAINED_SRE_CKPT_FROM_STEP1.ckpt \
--manifest_paths "PATH_TO_LIBRI_TTS_TRAIN.json,PATH_TO_LIBRI_TTS_VAL.json" \
--sup_data_dir PATH_TO_DIR_TO_SAVE_THE_EMBEDDINGS \
--use_unique_tokens 1 \ # To allow duration control

The above script, first extracts the content and speaker embeddings and saves them in PATH_TO_DIR_TO_SAVE_THE_EMBEDDINGS. Next, it extracts pitch contours using multi-procesing. The PATH_TO_LIBRI_TTS_TRAIN.json and PATH_TO_LIBRI_TTS_VAL.json should have the same format as given in Step 1.


Step 3: Train the FastPitch SSL synthesizer

The next step is to train the FastPitch SSL synthesizer that can convert the SSL representations into mel-spectrograms. Train the FastPitch SSL model using examples/tts/fastpitch_ssl.py script as follows:

python examples/tts/fastpitch_ssl.py \
pitch_mean=212.35 \
pitch_std=38.75 \
train_dataset=PATH_TO_TRAIN_MANIFEST.json \
validation_datasets=PATH_TO_VAL_MANIFEST.json \
hifi_ckpt_path=PATH_TO_TRAINED_HIFIGAN_VOCODER.ckpt \
model.train_ds.dataloader_params.batch_size=32 \
model.validation_ds.dataloader_params.batch_size=32 \
exp_manager.exp_dir=PATH_TO_EXPDIR \
exp_manager.name=EXP_NAME \
sup_data_dir=PATH_TO_DIR_TO_SAVE_THE_EMBEDDINGS \ # From Step 2
trainer.devices=-1 \ # number of GPUs, -1 would use all available GPUs

The above script requires a HiFiGAN checkpoint trained on LibriTTS data to listen to the generated audio during training. The HiFiGAN model can be trained using the default arguments with the examples/tts/hifigan.py script on LibrTTS data. We trained HiFiGAN for around 330 epochs on LibriTTS-360 for the experiments in our paper. We train the FastPitch synthesizer model for 604 epochs.



Step 4: Voice Conversion Inference

To voice convert a single file:

python scripts/ssl_tts/ssl_tts_vc.py \
--fastpitch_ckpt_path PATH_TO_TRAINED_CKPT_FROM_STEP3.ckpt
--ssl_model_ckpt_path PATH_TO_TRAINED_CKPT_FROM_STEP1.ckpt \
--hifi_ckpt_path PATH_TO_TRAINED_HIFIGAN_CKPT.ckpt \
--source_audio_path PATH_TO_SOURCE_CONTENT_AUDIO.wav \
--target_audio_path PATH_TO_TARGET_SPEAKER_AUDIO.wav \
--out_path PATH_FOR_CONVERTED_AUDIO.wav \
--compute_pitch 1 \ # To allow the module compute the pitch
--compute_duration 0 \ # Set as 0 to have same duration as original, 1 to adapt
--use_unique_tokens 0; \# Set 0 if compute_duration is 0, 1 otherwise

To voice convert a number of pairs you can prepare a pairs.txt file as follows:

/data/source1.wav;/data/target1_example1.wav,/data/target1_example2.wav;/data/output1.wav
/data/source2.wav;/data/target2_example1.wav,/data/target2_example2.wav;/data/output2.wav
...

And then run the script as follows:

python scripts/ssl_tts/ssl_tts_vc.py \
--fastpitch_ckpt_path PATH_TO_TRAINED_CKPT_FROM_STEP3.ckpt
--ssl_model_ckpt_path PATH_TO_TRAINED_CKPT_FROM_STEP1.ckpt \
--hifi_ckpt_path PATH_TO_TRAINED_HIFIGAN_CKPT.ckpt \
--source_target_out_pairs pairs.txt \ --compute_pitch 1 \ # To allow the module compute the pitch
--compute_duration 0 \ # Set as 0 to have same duration as original, 1 to adapt
--use_unique_tokens 0; \# Set 0 if compute_duration is 0, 1 otherwise



Citing our Work

@inproceedings{acevc2023,
  title={ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations},
  author={Hussain, S. and Neekhara, P. and Huang, J. and Li, J. and Ginsburg, B.},
  booktitle={ICASSP},
  year={2023},
  organization={IEEE}
}