File Download

There are no files associated with this item.

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)
Related Researcher

김태환

Kim, Taehwan
Read More

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

One-shot voice conversion with disentangled representations by leveraging phonetic posteriorgrams

Author(s)
Mohammadi, Seyed HamidrezaKim, Taehwan
Issued Date
2019-09
DOI
10.21437/Interspeech.2019-1798
URI
https://scholarworks.unist.ac.kr/handle/201301/79313
Citation
Annual Conference of the International Speech Communication Association, pp.704 - 708
Abstract
We propose voice conversion model from arbitrary source speaker to arbitrary target speaker with disentangled representations. Voice conversion is a task to convert the voice of spoken utterance of source speaker to that of target speaker. Most prior work require to know either source speaker or target speaker or both in training, with either parallel or non-parallel corpus. Instead, we study the problem of voice conversion in nonparallel speech corpora and one-shot learning setting. We convert an arbitrary sentences of an arbitrary source speaker to target speakers given only one or few target speaker training utterances. To achieve this, we propose to use disentangled representations of speaker identity and linguistic context. We use a recurrent neural network (RNN) encoder for speaker embedding and phonetic posteriorgram as linguistic context encoding, along with a RNN decoder to generate converted utterances. Ours is a simpler model without adversarial training or hierarchical model design and thus more efficient. In the subjective tests, our approach achieved significantly better results compared to baseline regarding similarity.
Publisher
International Speech Communication Association
ISSN
2308-457X

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.