Towards Efficient Multilingual Multimodal Fusion: A Contrastive Learning Approach Using Machine-Translation

Abstract: Prior researches focus on multilingual text and images in zero-shot settings due to the lack of multilin- gual image-text pair data. On the other hand, to handle multilingual multimodal directly, we introduce an Efficient Multilingual Multimodal Fusion (EMMF) network trained on machine-translated datasets. The multilingual and multimodal projected representations learn contrastively to adjust along with au- toregressive manner. Experiments on the xGQA dataset demonstrate that our model successfully aligns representations compared to previous zero-shot methods and shows qualitative improvements over sim- ilar methods.

qrcode

UNIST | Library

Tel : 052-217-1403 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.