File Download

There are no files associated with this item.

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)
Related Researcher

SENOCAKARDA

Senocak, Arda
Read More

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

Full metadata record

DC Field Value Language
dc.citation.endPage 1619 -
dc.citation.number 5 -
dc.citation.startPage 1605 -
dc.citation.title IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE -
dc.citation.volume 43 -
dc.contributor.author Senocak, Arda -
dc.contributor.author Oh, Tae-Hyun -
dc.contributor.author Kim, Junsik -
dc.contributor.author Yang, Ming-Hsuan -
dc.contributor.author Kweon, In So -
dc.date.accessioned 2025-09-03T14:00:01Z -
dc.date.available 2025-09-03T14:00:01Z -
dc.date.created 2025-09-03 -
dc.date.issued 2021-05 -
dc.description.abstract Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network. We show that the false conclusions can be effectively corrected even with a small amount of supervision, i.e., semi-supervised setup. Furthermore, we present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we extend this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360 degree videos. -
dc.identifier.bibliographicCitation IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, v.43, no.5, pp.1605 - 1619 -
dc.identifier.doi 10.1109/TPAMI.2019.2952095 -
dc.identifier.issn 0162-8828 -
dc.identifier.uri https://scholarworks.unist.ac.kr/handle/201301/87864 -
dc.identifier.wosid 000637533800009 -
dc.language 영어 -
dc.publisher IEEE COMPUTER SOC -
dc.title Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications -
dc.type Article -
dc.description.isOpenAccess FALSE -
dc.relation.journalWebOfScienceCategory Computer Science, Artificial Intelligence; Engineering, Electrical & Electronic -
dc.relation.journalResearchArea Computer Science; Engineering -
dc.type.docType Article -
dc.description.journalRegisteredClass scie -
dc.description.journalRegisteredClass scopus -
dc.subject.keywordAuthor Videos -
dc.subject.keywordAuthor Task analysis -
dc.subject.keywordAuthor Correlation -
dc.subject.keywordAuthor Deep learning -
dc.subject.keywordAuthor Network architecture -
dc.subject.keywordAuthor Unsupervised learning -
dc.subject.keywordAuthor Audio-visual learning -
dc.subject.keywordAuthor sound localization -
dc.subject.keywordAuthor self-supervision -
dc.subject.keywordAuthor multi-modal learning -
dc.subject.keywordAuthor cross-modal retrieval -
dc.subject.keywordAuthor Visualization -
dc.subject.keywordPlus IDENTIFICATION -
dc.subject.keywordPlus SEARCH -

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.