Phone-level speaker embedding based speaker adaptation method audio demo
Here we provide some audio samples generated aftering different speaker embedding methods, and explain the abbreviations below
- OracleVocode: orignal acoustic feature re-resynthesized by vocoder
- Xvec: use xvector as speaker embedding
- UttEmb: use reference audios to get utterance level speaker embedding using reference encoder
- Attentron: use reference audios to get frame level speaker embedding using attention based reference encoder
- PhnEmb(proposed): use predictor to get phon level speaker embedding
- +Adapt: also updating the LSTM in decode
test speaker1 `KHW`
Text |
- |
- |
- |
OracleVocode |
|
|
|
Xvec |
|
|
|
UttEmb |
|
|
|
Attentron |
|
|
|
PhnEmb(proposed) |
|
|
|
test speaker1 `KHW` with adaptation
Text |
- |
- |
- |
OracleVocode |
|
|
|
Xvec+Adapt |
|
|
|
Attentron+Adapt |
|
|
|
UttEmb+Adapt |
|
|
|
PhnEmb+Adapt(proposed) |
|
|
|
test speaker2 `HJX`
Text |
- |
- |
- |
OracleVocode |
|
|
|
Xvec |
|
|
|
UttEmb |
|
|
|
Attentron |
|
|
|
PhnEmb(proposed) |
|
|
|
test speaker2 `HJX` with adaptation
Text |
- |
- |
- |
OracleVocode |
|
|
|
Xvec+Adapt |
|
|
|
UttEmb+Adapt |
|
|
|
Attentron+Adapt |
|
|
|
PhnEmb+Adapt(proposed) |
|
|
|