File Download

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

Towards Efficient Multilingual Multimodal Fusion: A Contrastive Learning Approach Using Machine-Translation

Author(s)
Kim, Jongeun
Advisor
Kim, Taehwan
Issued Date
2024-08
URI
https://scholarworks.unist.ac.kr/handle/201301/84192 http://unist.dcollection.net/common/orgView/200000813131
Abstract
Prior researches focus on multilingual text and images in zero-shot settings due to the lack of multilin- gual image-text pair data. On the other hand, to handle multilingual multimodal directly, we introduce an Efficient Multilingual Multimodal Fusion (EMMF) network trained on machine-translated datasets. The multilingual and multimodal projected representations learn contrastively to adjust along with au- toregressive manner. Experiments on the xGQA dataset demonstrate that our model successfully aligns representations compared to previous zero-shot methods and shows qualitative improvements over sim- ilar methods.
Publisher
Ulsan National Institute of Science and Technology
Degree
Master
Major
Graduate School of Artificial Intelligence

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.