File Download

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

Full metadata record

DC Field Value Language
dc.contributor.advisor Kim, Taehwan -
dc.contributor.author Kim, Jongeun -
dc.date.accessioned 2024-10-14T13:50:41Z -
dc.date.available 2024-10-14T13:50:41Z -
dc.date.issued 2024-08 -
dc.description.abstract Prior researches focus on multilingual text and images in zero-shot settings due to the lack of multilin- gual image-text pair data. On the other hand, to handle multilingual multimodal directly, we introduce an Efficient Multilingual Multimodal Fusion (EMMF) network trained on machine-translated datasets. The multilingual and multimodal projected representations learn contrastively to adjust along with au- toregressive manner. Experiments on the xGQA dataset demonstrate that our model successfully aligns representations compared to previous zero-shot methods and shows qualitative improvements over sim- ilar methods. -
dc.description.degree Master -
dc.description Graduate School of Artificial Intelligence -
dc.identifier.uri https://scholarworks.unist.ac.kr/handle/201301/84192 -
dc.identifier.uri http://unist.dcollection.net/common/orgView/200000813131 -
dc.language ENG -
dc.publisher Ulsan National Institute of Science and Technology -
dc.title Towards Efficient Multilingual Multimodal Fusion: A Contrastive Learning Approach Using Machine-Translation -
dc.type Thesis -

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.