Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

Senocak, Arda; Oh, Tae-Hyun; Kim, Junsik; Yang, Ming-Hsuan; Kweon, In So

doi:10.1109/TPAMI.2019.2952095

Scholarworks@UNIST

UNIST Library

File Download

There are no files associated with this item.

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Related Researcher

SENOCAKARDA

Senocak, Arda

Read More

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Full metadata record

DC Field	Value	Language
dc.citation.endPage	1619	-
dc.citation.number	5	-
dc.citation.startPage	1605	-
dc.citation.title	IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE	-
dc.citation.volume	43	-
dc.contributor.author	Senocak, Arda	-
dc.contributor.author	Oh, Tae-Hyun	-
dc.contributor.author	Kim, Junsik	-
dc.contributor.author	Yang, Ming-Hsuan	-
dc.contributor.author	Kweon, In So	-
dc.date.accessioned	2025-09-03T14:00:01Z	-
dc.date.available	2025-09-03T14:00:01Z	-
dc.date.created	2025-09-03	-
dc.date.issued	2021-05	-
dc.description.abstract	Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network. We show that the false conclusions can be effectively corrected even with a small amount of supervision, i.e., semi-supervised setup. Furthermore, we present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we extend this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360 degree videos.	-
dc.identifier.bibliographicCitation	IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, v.43, no.5, pp.1605 - 1619	-
dc.identifier.doi	10.1109/TPAMI.2019.2952095	-
dc.identifier.issn	0162-8828	-
dc.identifier.scopusid	2-s2.0-85103800881	-
dc.identifier.uri	https://scholarworks.unist.ac.kr/handle/201301/87864	-
dc.identifier.wosid	000637533800009	-
dc.language	영어	-
dc.publisher	IEEE COMPUTER SOC	-
dc.title	Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications	-
dc.type	Article	-
dc.description.isOpenAccess	FALSE	-
dc.relation.journalWebOfScienceCategory	Computer Science, Artificial Intelligence; Engineering, Electrical & Electronic	-
dc.relation.journalResearchArea	Computer Science; Engineering	-
dc.type.docType	Article	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	scopus	-
dc.subject.keywordAuthor	Videos	-
dc.subject.keywordAuthor	Task analysis	-
dc.subject.keywordAuthor	Correlation	-
dc.subject.keywordAuthor	Deep learning	-
dc.subject.keywordAuthor	Network architecture	-
dc.subject.keywordAuthor	Unsupervised learning	-
dc.subject.keywordAuthor	Audio-visual learning	-
dc.subject.keywordAuthor	sound localization	-
dc.subject.keywordAuthor	self-supervision	-
dc.subject.keywordAuthor	multi-modal learning	-
dc.subject.keywordAuthor	cross-modal retrieval	-
dc.subject.keywordAuthor	Visualization	-
dc.subject.keywordPlus	IDENTIFICATION	-
dc.subject.keywordPlus	SEARCH	-

Show Simple Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1403 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.