One-shot voice conversion with disentangled representations by leveraging phonetic posteriorgrams

Mohammadi, Seyed Hamidreza; Kim, Taehwan

doi:10.21437/Interspeech.2019-1798

Scholarworks@UNIST

UNIST Library

File Download

There are no files associated with this item.

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Related Researcher

김태환

Kim, Taehwan

Read More

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Full metadata record

DC Field	Value	Language
dc.citation.conferencePlace	AU	-
dc.citation.conferencePlace	Graz	-
dc.citation.endPage	708	-
dc.citation.startPage	704	-
dc.citation.title	Annual Conference of the International Speech Communication Association	-
dc.contributor.author	Mohammadi, Seyed Hamidreza	-
dc.contributor.author	Kim, Taehwan	-
dc.date.accessioned	2024-01-31T23:40:51Z	-
dc.date.available	2024-01-31T23:40:51Z	-
dc.date.created	2021-09-01	-
dc.date.issued	2019-09	-
dc.description.abstract	We propose voice conversion model from arbitrary source speaker to arbitrary target speaker with disentangled representations. Voice conversion is a task to convert the voice of spoken utterance of source speaker to that of target speaker. Most prior work require to know either source speaker or target speaker or both in training, with either parallel or non-parallel corpus. Instead, we study the problem of voice conversion in nonparallel speech corpora and one-shot learning setting. We convert an arbitrary sentences of an arbitrary source speaker to target speakers given only one or few target speaker training utterances. To achieve this, we propose to use disentangled representations of speaker identity and linguistic context. We use a recurrent neural network (RNN) encoder for speaker embedding and phonetic posteriorgram as linguistic context encoding, along with a RNN decoder to generate converted utterances. Ours is a simpler model without adversarial training or hierarchical model design and thus more efficient. In the subjective tests, our approach achieved significantly better results compared to baseline regarding similarity.	-
dc.identifier.bibliographicCitation	Annual Conference of the International Speech Communication Association, pp.704 - 708	-
dc.identifier.doi	10.21437/Interspeech.2019-1798	-
dc.identifier.issn	2308-457X	-
dc.identifier.scopusid	2-s2.0-85074730037	-
dc.identifier.uri	https://scholarworks.unist.ac.kr/handle/201301/79313	-
dc.language	영어	-
dc.publisher	International Speech Communication Association	-
dc.title	One-shot voice conversion with disentangled representations by leveraging phonetic posteriorgrams	-
dc.type	Conference Paper	-
dc.date.conferenceDate	2019-09-15	-

Show Simple Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1404 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.