File Download

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

Semi-supervised multi-modal video action recognition with audio source localization guided mixup

Author(s)
Kang, Seok Un
Advisor
Kim,Tae Hwan
Issued Date
2024-02
URI
https://scholarworks.unist.ac.kr/handle/201301/82146 http://unist.dcollection.net/common/orgView/200000743485
Abstract
Video action recognition is a challenging but important task to understand and find out what the video does. However, acquiring labels of video is costly, and semi-supervised learning (SSL) has been studied to improve the performance even with the small number of labeled data in the task. Prior studies for semi-supervised video action recognition have mostly focused on using single modality - visuals - but video is multi-modal so utilizing both visuals and audio would be desirable and improve the performance further, which has not been well explored. Therefore, we propose audio-visual SSL for video action recognition, which uses both visual and audio together, even with quite a few labeled data that is challenging. In addition, to maximize the information of audio and video, we propose a novel audio source localization-guided mixup method that considers inter-modal relations between video and audio modalities. In experiments on UCF-51, Kinetics-400, and VGGSound datasets, our model shows the superior performance of the proposed SSL audio-visual action recognition and audio source localization-guided mixup.
Publisher
Ulsan National Institute of Science and Technology
Degree
Master
Major
Graduate School of Artificial Intelligence

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.