Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Kim, Hyounghun; Tang, Zineng; Bansal, Mohit

doi:10.18653/v1/2020.acl-main.435

Scholarworks@UNIST

UNIST Library

File Download

There are no files associated with this item.

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Related Researcher

김형훈

Kim, Hyounghun

Read More

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Full metadata record

DC Field	Value	Language
dc.citation.conferencePlace	ZZ	-
dc.citation.conferencePlace	Online	-
dc.citation.title	Annual Meeting of Computational Linguistics	-
dc.contributor.author	Kim, Hyounghun	-
dc.contributor.author	Tang, Zineng	-
dc.contributor.author	Bansal, Mohit	-
dc.date.accessioned	2024-01-31T23:06:25Z	-
dc.date.available	2024-01-31T23:06:25Z	-
dc.date.created	2022-10-21	-
dc.date.issued	2020-07-06	-
dc.description.abstract	Videos convey rich information. Dynamic spatio-temporal relationships between people/objects, and diverse multimodal events are present in a video clip. Hence, it is important to develop automated models that can accurately extract such information from videos. Answering questions on videos is one of the tasks which can evaluate such AI abilities. In this paper, we propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Specifically, we first employ dense image captions to help identify objects and their detailed salient regions and actions, and hence give the model useful extra information (in explicit textual format to allow easier matching) for answering questions. Moreover, our model is also comprised of dual-level attention (word/object and frame level), multi-head self/cross-integration for different sources (video and dense captions), and gates which pass more relevant information to the classifier. Finally, we also cast the frame selection problem as a multi-label classification task and introduce two loss functions, In-andOut Frame Score Margin (IOFSM) and Balanced Binary Cross-Entropy (BBCE), to better supervise the model with human importance annotations. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin (74.09% versus 70.52%). We also present several word, object, and frame level visualization studies.	-
dc.identifier.bibliographicCitation	Annual Meeting of Computational Linguistics	-
dc.identifier.doi	10.18653/v1/2020.acl-main.435	-
dc.identifier.uri	https://scholarworks.unist.ac.kr/handle/201301/78455	-
dc.publisher	Annual Meeting of Computational Linguistics	-
dc.title	Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA	-
dc.type	Conference Paper	-
dc.date.conferenceDate	2020-07-06	-

Show Simple Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1404 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.