Toward Interactive Sound Source Localization: Better Align Sight and Sound!

Senocak, Arda; Ryu, Hyeonggon; Kim, Junsik; Oh, Tae-Hyun; Pfister, Hanspeter; Chung, Joon Son

doi:10.1109/TPAMI.2025.3573994

Scholarworks@UNIST

UNIST Library

File Download

There are no files associated with this item.

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Related Researcher

SENOCAKARDA

Senocak, Arda

Read More

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Full metadata record

DC Field	Value	Language
dc.citation.endPage	7659	-
dc.citation.number	9	-
dc.citation.startPage	7643	-
dc.citation.title	IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE	-
dc.citation.volume	47	-
dc.contributor.author	Senocak, Arda	-
dc.contributor.author	Ryu, Hyeonggon	-
dc.contributor.author	Kim, Junsik	-
dc.contributor.author	Oh, Tae-Hyun	-
dc.contributor.author	Pfister, Hanspeter	-
dc.contributor.author	Chung, Joon Son	-
dc.date.accessioned	2025-09-03T14:00:00Z	-
dc.date.available	2025-09-03T14:00:00Z	-
dc.date.created	2025-09-03	-
dc.date.issued	2025-09	-
dc.description.abstract	Recent studies on learning-based sound source localization have primarily focused on localization performance. However, prior work and existing benchmarks often overlook a crucial aspect: cross-modal interaction, which is essential for interactive sound source localization. This interaction is vital for understanding semantically matched or mismatched audio-visual events, such as silent objects or true sound sources among multiple objects. In this work, we comprehensively examine the cross-modal interaction of existing methods, benchmarks, evaluation metrics, and cross-modal understanding tasks. We identify the overlooked points of previous studies and make several contributions to address them. First, we propose a learning framework that incorporates retrieval-based and hand-crafted augmentation techniques, enhancing cross-modal interaction through cross-modal alignment. Second, we introduce new evaluation metrics to accurately and rigorously assess localization methods, focusing on both localization performance and cross-modal interaction. Third, to thoroughly analyze interactive sound source localization, we present a new semi-synthetic benchmark with diverse categorical combinations. Finally, we evaluate both interactive sound source localization and auxiliary cross-modal retrieval tasks, benchmarking competing methods alongside our own. Our new benchmark and evaluation metrics reveal that previous methods struggle with interactive sound source localization tasks, largely due to their limited cross-modal interaction capabilities. Our method, which features enhanced cross-modal alignment, demonstrates superior sound source localization and cross-modal interaction performance. This work provides the most comprehensive analysis of sound source localization to date, with extensive validation of competing methods on both existing and new benchmarks using both new and standard evaluation metrics.	-
dc.identifier.bibliographicCitation	IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, v.47, no.9, pp.7643 - 7659	-
dc.identifier.doi	10.1109/TPAMI.2025.3573994	-
dc.identifier.issn	1939-3539	-
dc.identifier.scopusid	2-s2.0-105006557418	-
dc.identifier.uri	https://scholarworks.unist.ac.kr/handle/201301/87862	-
dc.identifier.wosid	001547707900015	-
dc.language	영어	-
dc.publisher	IEEE COMPUTER SOC	-
dc.title	Toward Interactive Sound Source Localization: Better Align Sight and Sound!	-
dc.type	Article	-
dc.description.isOpenAccess	FALSE	-
dc.relation.journalWebOfScienceCategory	Computer Science, Artificial Intelligence; Engineering, Electrical & Electronic	-
dc.relation.journalResearchArea	Computer Science; Engineering	-
dc.type.docType	Article	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	scopus	-
dc.subject.keywordAuthor	Benchmark testing	-
dc.subject.keywordAuthor	Visualization	-
dc.subject.keywordAuthor	Measurement	-
dc.subject.keywordAuthor	Semantics	-
dc.subject.keywordAuthor	Contrastive learning	-
dc.subject.keywordAuthor	Cross modal retrieval	-
dc.subject.keywordAuthor	Representation learning	-
dc.subject.keywordAuthor	Training	-
dc.subject.keywordAuthor	Dogs	-
dc.subject.keywordAuthor	Audio-visual learning	-
dc.subject.keywordAuthor	sound source localization	-
dc.subject.keywordAuthor	self-supervision	-
dc.subject.keywordAuthor	multi-modal learning	-
dc.subject.keywordAuthor	cross-modal retrieval	-
dc.subject.keywordAuthor	Location awareness	-

Show Simple Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1403 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.