File Download

There are no files associated with this item.

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)
Related Researcher

안혜민

Ahn, Hyemin
Read More

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

Full metadata record

DC Field Value Language
dc.citation.startPage 103741 -
dc.citation.title COMPUTER VISION AND IMAGE UNDERSTANDING -
dc.citation.volume 233 -
dc.contributor.author Ni, Zhifan -
dc.contributor.author Mascaro, Esteve Valls -
dc.contributor.author Ahn, Hyemin -
dc.contributor.author Lee, Dongheui -
dc.date.accessioned 2023-12-21T11:48:39Z -
dc.date.available 2023-12-21T11:48:39Z -
dc.date.created 2023-07-26 -
dc.date.issued 2023-08 -
dc.description.abstract Understanding the human-object interactions (HOIs) from a video is essential to fully comprehend a visual scene. This line of research has been addressed by detecting HOIs from images and lately from videos. However, the video-based HOI anticipation task in the third-person view remains understudied. In this paper, we design a framework to detect current HOIs and anticipate future HOIs in videos. We propose to leverage human gaze information since people often fixate on an object before interacting with it. These gaze features together with the scene contexts and the visual appearances of human-object pairs are fused through a spatio-temporal transformer. To evaluate the model in the HOI anticipation task in a multi-person scenario, we propose a set of person-wise multi-label metrics. Our model is trained and validated on the VidHOI dataset, which contains videos capturing daily life and is currently the largest video HOI dataset. Experimental results in the HOI detection task show that our approach improves the baseline by a great margin of 36.3% relatively. Moreover, we conduct an extensive ablation study to demonstrate the effectiveness of our modifications and extensions to the spatio-temporal transformer. Our code is publicly available on https://github.com/nizhf/hoi-prediction-gaze-transformer. -
dc.identifier.bibliographicCitation COMPUTER VISION AND IMAGE UNDERSTANDING, v.233, pp.103741 -
dc.identifier.doi 10.1016/j.cviu.2023.103741 -
dc.identifier.issn 1077-3142 -
dc.identifier.scopusid 2-s2.0-85161302203 -
dc.identifier.uri https://scholarworks.unist.ac.kr/handle/201301/65165 -
dc.identifier.wosid 001019055200001 -
dc.language 영어 -
dc.publisher ACADEMIC PRESS INC ELSEVIER SCIENCE -
dc.title Human-object interaction prediction in videos through gaze following -
dc.type Article -
dc.description.isOpenAccess FALSE -
dc.relation.journalWebOfScienceCategory Computer Science, Artificial Intelligence; Engineering, Electrical & Electronic -
dc.relation.journalResearchArea Computer Science; Engineering -
dc.type.docType Article -
dc.description.journalRegisteredClass scie -
dc.description.journalRegisteredClass scopus -
dc.subject.keywordAuthor Human-object interaction prediction -
dc.subject.keywordAuthor Semantic scene understanding -
dc.subject.keywordAuthor Spatial-temporal transformer -

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.