| dc.contributor.advisor |
Kim, Taehwan |
- |
| dc.contributor.author |
Kim, Jongeun |
- |
| dc.date.accessioned |
2024-10-14T13:50:41Z |
- |
| dc.date.available |
2024-10-14T13:50:41Z |
- |
| dc.date.issued |
2024-08 |
- |
| dc.description.abstract |
Prior researches focus on multilingual text and images in zero-shot settings due to the lack of multilin- gual image-text pair data. On the other hand, to handle multilingual multimodal directly, we introduce an Efficient Multilingual Multimodal Fusion (EMMF) network trained on machine-translated datasets. The multilingual and multimodal projected representations learn contrastively to adjust along with au- toregressive manner. Experiments on the xGQA dataset demonstrate that our model successfully aligns representations compared to previous zero-shot methods and shows qualitative improvements over sim- ilar methods. |
- |
| dc.description.degree |
Master |
- |
| dc.description |
Graduate School of Artificial Intelligence |
- |
| dc.identifier.uri |
https://scholarworks.unist.ac.kr/handle/201301/84192 |
- |
| dc.identifier.uri |
http://unist.dcollection.net/common/orgView/200000813131 |
- |
| dc.language |
ENG |
- |
| dc.publisher |
Ulsan National Institute of Science and Technology |
- |
| dc.title |
Towards Efficient Multilingual Multimodal Fusion: A Contrastive Learning Approach Using Machine-Translation |
- |
| dc.type |
Thesis |
- |