Question-aware Caption Refinement for Video Question Answering

Scholarworks@UNIST

UNIST Library

There are no files associated with this item.

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Question-aware Caption Refinement for Video Question Answering

URI: https://scholarworks.unist.ac.kr/handle/201301/88266 http://unist.dcollection.net/common/orgView/200000905240

Abstract: While recent VideoQA studies utilize captions converted from frames as the main source of LLM reasoning, they primarily focus on selecting key frames and often overlook the content of the captions. In this work, we hypothesize that the content of the caption directly influences the reasoning process of LLM. To validate this hypothesis, we establish an evaluation setting that enables isolating the effect of caption content. And our findings show that general captions frequently lack question-relevant information and sometimes even hinder reasoning. To address this issue, we propose a question-aware caption refinement framework that extracts question-related events and event-specific visual elements and incorporates them into refined captions. Extensive experiments across multiple datasets and baselines demonstrate that our refined captions consistently improve over general captions, across both commonsense and non-commonsense questions. Specifically, for non-commonsense questions, our method improves accuracy by 11.8% on NExT-QA and 14.6% on IntentQA. These results empirically validate our hypothesis and highlight the importance of aligning caption content with the intent of the question to enable accurate and robust reasoning in VideoQA.

qrcode

Tel : 052-217-1403 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.