Human-object interaction prediction in videos through gaze following

Ni, Zhifan; Mascaro, Esteve Valls; Ahn, Hyemin; Lee, Dongheui

doi:10.1016/j.cviu.2023.103741

Scholarworks@UNIST

UNIST Library

File Download

There are no files associated with this item.

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Related Researcher

안혜민

Ahn, Hyemin

Read More

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Full metadata record

DC Field	Value	Language
dc.citation.startPage	103741	-
dc.citation.title	COMPUTER VISION AND IMAGE UNDERSTANDING	-
dc.citation.volume	233	-
dc.contributor.author	Ni, Zhifan	-
dc.contributor.author	Mascaro, Esteve Valls	-
dc.contributor.author	Ahn, Hyemin	-
dc.contributor.author	Lee, Dongheui	-
dc.date.accessioned	2023-12-21T11:48:39Z	-
dc.date.available	2023-12-21T11:48:39Z	-
dc.date.created	2023-07-26	-
dc.date.issued	2023-08	-
dc.description.abstract	Understanding the human-object interactions (HOIs) from a video is essential to fully comprehend a visual scene. This line of research has been addressed by detecting HOIs from images and lately from videos. However, the video-based HOI anticipation task in the third-person view remains understudied. In this paper, we design a framework to detect current HOIs and anticipate future HOIs in videos. We propose to leverage human gaze information since people often fixate on an object before interacting with it. These gaze features together with the scene contexts and the visual appearances of human-object pairs are fused through a spatio-temporal transformer. To evaluate the model in the HOI anticipation task in a multi-person scenario, we propose a set of person-wise multi-label metrics. Our model is trained and validated on the VidHOI dataset, which contains videos capturing daily life and is currently the largest video HOI dataset. Experimental results in the HOI detection task show that our approach improves the baseline by a great margin of 36.3% relatively. Moreover, we conduct an extensive ablation study to demonstrate the effectiveness of our modifications and extensions to the spatio-temporal transformer. Our code is publicly available on https://github.com/nizhf/hoi-prediction-gaze-transformer.	-
dc.identifier.bibliographicCitation	COMPUTER VISION AND IMAGE UNDERSTANDING, v.233, pp.103741	-
dc.identifier.doi	10.1016/j.cviu.2023.103741	-
dc.identifier.issn	1077-3142	-
dc.identifier.scopusid	2-s2.0-85161302203	-
dc.identifier.uri	https://scholarworks.unist.ac.kr/handle/201301/65165	-
dc.identifier.wosid	001019055200001	-
dc.language	영어	-
dc.publisher	ACADEMIC PRESS INC ELSEVIER SCIENCE	-
dc.title	Human-object interaction prediction in videos through gaze following	-
dc.type	Article	-
dc.description.isOpenAccess	FALSE	-
dc.relation.journalWebOfScienceCategory	Computer Science, Artificial Intelligence; Engineering, Electrical & Electronic	-
dc.relation.journalResearchArea	Computer Science; Engineering	-
dc.type.docType	Article	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	scopus	-
dc.subject.keywordAuthor	Human-object interaction prediction	-
dc.subject.keywordAuthor	Semantic scene understanding	-
dc.subject.keywordAuthor	Spatial-temporal transformer	-

Show Simple Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1404 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.