| dc.contributor.advisor |
Kim,Tae Hwan |
- |
| dc.contributor.author |
Kang, Seok Un |
- |
| dc.date.accessioned |
2024-04-11T15:19:56Z |
- |
| dc.date.available |
2024-04-11T15:19:56Z |
- |
| dc.date.issued |
2024-02 |
- |
| dc.description.abstract |
Video action recognition is a challenging but important task to understand and find out what the video does. However, acquiring labels of video is costly, and semi-supervised learning (SSL) has been studied to improve the performance even with the small number of labeled data in the task. Prior studies for semi-supervised video action recognition have mostly focused on using single modality - visuals - but video is multi-modal so utilizing both visuals and audio would be desirable and improve the performance further, which has not been well explored. Therefore, we propose audio-visual SSL for video action recognition, which uses both visual and audio together, even with quite a few labeled data that is challenging. In addition, to maximize the information of audio and video, we propose a novel audio source localization-guided mixup method that considers inter-modal relations between video and audio modalities. In experiments on UCF-51, Kinetics-400, and VGGSound datasets, our model shows the superior performance of the proposed SSL audio-visual action recognition and audio source localization-guided mixup. |
- |
| dc.description.degree |
Master |
- |
| dc.description |
Graduate School of Artificial Intelligence |
- |
| dc.identifier.uri |
https://scholarworks.unist.ac.kr/handle/201301/82146 |
- |
| dc.identifier.uri |
http://unist.dcollection.net/common/orgView/200000743485 |
- |
| dc.language |
ENG |
- |
| dc.publisher |
Ulsan National Institute of Science and Technology |
- |
| dc.rights.embargoReleaseDate |
9999-12-31 |
- |
| dc.rights.embargoReleaseTerms |
9999-12-31 |
- |
| dc.title |
Semi-supervised multi-modal video action recognition with audio source localization guided mixup |
- |
| dc.type |
Thesis |
- |