QORT-Former: Query-optimized Real-time Transformer for Understanding Two Hands Manipulating Objects

Ismayilzada, Elkhan; Sayem, MD Khalequzzaman Chowdhury; Tiruneh, Yihalem Yimolal; Chowdhury, Mubarrat Tajoar; Boboev, Muhammadjon; Baek, Seungryul

doi:10.1609/aaai.v39i4.32407

Scholarworks@UNIST

UNIST Library

File Download

There are no files associated with this item.

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Related Researcher

백승렬

Baek, Seungryul: UNIST VISION AND LEARNING LAB.

Read More

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Full metadata record

DC Field	Value	Language
dc.citation.conferencePlace	US	-
dc.citation.endPage	3903	-
dc.citation.startPage	3895	-
dc.citation.title	AAAI Conference on Artificial Intelligence	-
dc.contributor.author	Ismayilzada, Elkhan	-
dc.contributor.author	Sayem, MD Khalequzzaman Chowdhury	-
dc.contributor.author	Tiruneh, Yihalem Yimolal	-
dc.contributor.author	Chowdhury, Mubarrat Tajoar	-
dc.contributor.author	Boboev, Muhammadjon	-
dc.contributor.author	Baek, Seungryul	-
dc.date.accessioned	2025-12-01T16:03:34Z	-
dc.date.available	2025-12-01T16:03:34Z	-
dc.date.created	2025-11-29	-
dc.date.issued	2025-02-28	-
dc.description.abstract	Significant advancements have been achieved in the realm of understanding poses and interactions of two hands manipulating an object. The emergence of augmented reality (AR) and virtual reality (VR) technologies has heightened the demand for real-time performance in these applications. However, current state-of-the-art models often exhibit promising results at the expense of substantial computational overhead. In this paper, we present a query-optimized real-time Transformer (QORT-Former), the first Transformer-based real-time framework for 3D pose estimation of two hands and an object. We first limit the number of queries and decoders to meet the efficiency requirement. Given limited number of queries and decoders, we propose to optimize queries which are taken as input to the Transformer decoder, to secure better accuracy: (1) we propose to divide queries into three types (a left hand query, a right hand query and an object query) and enhance query features (2) by using the contact information between hands and an object and (3) by using three-step update of enhanced image and query features with respect to one another. With proposed methods, we achieved real-time pose estimation performance using just 108 queries and 1 decoder (53.5 FPS on an RTX 3090TI GPU). Surpassing state-of-the-art results on the H2O dataset by 17.6% (left hand), 22.8% (right hand), and 27.2% (object), as well as on the FPHA dataset by 5.3% (right hand) and 10.4% (object), our method excels in accuracy. Additionally, it sets the state-of-the-art in interaction recognition, maintaining real-time efficiency with an off-the-shelf action recognition module.	-
dc.identifier.bibliographicCitation	AAAI Conference on Artificial Intelligence, pp.3895 - 3903	-
dc.identifier.doi	10.1609/aaai.v39i4.32407	-
dc.identifier.uri	https://scholarworks.unist.ac.kr/handle/201301/88738	-
dc.language	영어	-
dc.publisher	Association for the Advancement of Artificial Intelligence	-
dc.title	QORT-Former: Query-optimized Real-time Transformer for Understanding Two Hands Manipulating Objects	-
dc.type	Conference Paper	-
dc.date.conferenceDate	2025-02-25	-

Show Simple Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1403 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.