We present audio examples for our paper ACE-VC. To perform zero-shot voice conversion, we use our synthesis model to combine the content embedding of any given source utterance with the speaker embedding of the target speaker, both of which are derived from our Speech Representation Extractor. ACE-VC can perform voice conversion in two modes:
To perform voice conversion for speakers not seen during training, we randomly select 10 male and 10 female speakers from the dev-clean subset of the LibriTTS dataset as our target speakers. Next, we choose 10 random source utterances from the remaining speakers and perform voice conversion for each of the 20 speakers. We present a few audio examples for this experiment in the table below.
Conversion Type | Source Utterance | Target Speaker | ACE-VC (Adapt) | ACE-VC (Mimic) |
---|---|---|---|---|
Male to Female | [Show transcript]Going back to camp I procured a light, and after hooping and hallowing for a long time, I heard another groan. This time much louder than before. | |||
Female to Female | [Show transcript]When quite crisp, they are ready for use. | |||
Female to Male | [Show transcript]Various dishes are frequently ornamented and garnished with its graceful leaves and these are sometimes boiled in soups. Although it is more usually confined to English cookery... | |||
Male to Male | [Show transcript]Going back to camp I procured a light, and after hooping and hallowing for a long time, I heard another groan. This time much louder than before. |
To perform voice conversion for seen speakers, we use the hold-out utterances of speakers seen during training. Similar to the unseen speaker scenario, we select 10 male and 10 female speakers as the target speakers and choose source utterances from other speakers.
Conversion Type | Source Utterance | Target Speaker | ACE-VC (Adapt) | ACE-VC (Mimic) |
---|---|---|---|---|
Male to Female | [Show transcript]I am convinced that it is as natural for a human being to swim as it is for a duck. | |||
Female to Female | [Show transcript]Here lived Las Casas, a priest who was the Indians' greatest champion in the early days and who was said to be the father of African slavery in the new world. | |||
Female to Male | [Show transcript]Here lived Las Casas, a priest who was the Indians' greatest champion in the early days and who was said to be the father of African slavery in the new world. | |||
Female to Male | [Show transcript]Shoot him through the right elbow if he makes one sour move. | |||
Male to Male | [Show transcript]This is quite sudden said the scare crow. |
We present audio examples for the same pair of source and target audio using different voice conversion techniques including our own. We use the Adapt mode for our technique (ACE-VC). We produce audio examples for other techniques using the voice convesion inference script provided in the respective github repositories.
Conversion Type | Source Utterance | Target Speaker | MediumVC | S3PRL-VC | YourTTS | ACE-VC (Ours) |
---|---|---|---|---|---|---|
Male to Female | [Show transcript]What then must be the state of the less known and more distant parts of the island. | |||||
Male to Female | [Show transcript]The Sacred Lock of Hair. Reincarnation and the Converse of Spirits. | |||||
Female to Male | [Show transcript]When quite crisp, they are ready for use. | |||||
Female to Male | [Show transcript]Randall decided on leaving her |
Seen Speakers | Unseen Speakers |
---|---|
![]() |
![]() |
We present audio examples where source utterances are from expressive/emotional speakers. We use the ADEPT dataset for these examples. The source utterances are from the expressive audio of the two speakers in the dataset. The neutral utterances are used for deriving the speaker embedding. Both the male and female speakers are not seen during training.
Conversion Type | Source Utterance | Target Speaker | ACE-VC (Adapt) | ACE-VC (Mimic) |
---|---|---|---|---|
Male to Female | [Show transcript]The leaves are changing colors. | |||
Male to Female | [Show transcript]This report is due tomorrow. | |||
Female to Male | [Show transcript]She is being fired. | |||
Female to Male | [Show transcript]Look at that puppy |
ACE-VC synthesizer allows control over the pace/duration of the synthesized utterances by changing the target duration for each time-step. We can slow down or speed up the speaking rate and also do more fine-grained control. In the following table we present audio examples for speeding up and slowing down the synthesis utterance.
Conversion Type | Source Utterance | Target Speaker | Same Pace | Fast Pace (1.5 X) | Slow Pace (0.7 X) |
---|---|---|---|---|---|
Male to Female | [Show transcript]Going back to camp I procured a light, and after hooping and hallowing for a long time, I heard another groan. This time much louder than before. | ||||
Female to Male | [Show transcript]Various dishes are frequently ornamented and garnished with its graceful leaves and these are sometimes boiled in soups. Although it is more usually confined to English cookery... |
ACE-VC synthesizer allows control over pitch contour (fundamental frequency) of the synthesized speech. We can perform fine-grained control over the modulation of the pitch contour or simply scale the pitch contour by a factor. In the below table we present audio examples obtained by scaling the reference pitch contour by a factor.
Conversion Type | Source Utterance | Target Speaker | Same Pitch (1X) | Higher Pitch (3X) | Lower Pitch (0.5X) |
---|---|---|---|---|---|
Female to Male | [Show transcript]Various dishes are frequently ornamented and garnished with its graceful leaves and these are sometimes boiled in soups. Although it is more usually confined to English cookery... | ||||
Male to Female | [Show transcript]Going back to camp I procured a light, and after hooping and hallowing for a long time, I heard another groan. This time much louder than before. |
@inproceedings{acevc2023,
title={ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations},
author={Hussain, S. and Neekhara, P. and Huang, J. and Li, J. and Ginsburg, B.},
booktitle={ICASSP},
year={2023},
organization={IEEE}
}