File Download

There are no files associated with this item.

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

Question-aware Caption Refinement for Video Question Answering

Author(s)
Ki, Youngbin
Advisor
Kim, Taehwan
Issued Date
2025-08
URI
https://scholarworks.unist.ac.kr/handle/201301/88266 http://unist.dcollection.net/common/orgView/200000905240
Abstract
While recent VideoQA studies utilize captions converted from frames as the main source of LLM reasoning, they primarily focus on selecting key frames and often overlook the content of the captions. In this work, we hypothesize that the content of the caption directly influences the reasoning process of LLM. To validate this hypothesis, we establish an evaluation setting that enables isolating the effect of caption content. And our findings show that general captions frequently lack question-relevant information and sometimes even hinder reasoning. To address this issue, we propose a question-aware caption refinement framework that extracts question-related events and event-specific visual elements and incorporates them into refined captions. Extensive experiments across multiple datasets and baselines demonstrate that our refined captions consistently improve over general captions, across both commonsense and non-commonsense questions. Specifically, for non-commonsense questions, our method improves accuracy by 11.8% on NExT-QA and 14.6% on IntentQA. These results empirically validate our hypothesis and highlight the importance of aligning caption content with the intent of the question to enable accurate and robust reasoning in VideoQA.
Publisher
Ulsan National Institute of Science and Technology
Degree
Master
Major
Graduate School of Artificial Intelligence

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.