A Study on the Application of Multi-modal Learning for Real-World Challenges using Prototype

Sohn, Wonho

Scholarworks@UNIST

UNIST Library

File Download

There are no files associated with this item.

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

A Study on the Application of Multi-modal Learning for Real-World Challenges using Prototype

Author(s): Sohn, Wonho

Advisor: Lim, Chiehyeon

Issued Date: 2026-02

URI: https://scholarworks.unist.ac.kr/handle/201301/90973 http://unist.dcollection.net/common/orgView/200000965089

Abstract: Modern information systems increasingly capture heterogeneous signals in multiple modalities, including images, text, audio/video, and structured logs, creating strong incentives for multi-modal learning. By integrating complementary information and enforcing cross-modal consistency, multi-modal models can learn unified representation implying complementary information, thereby improving accuracy, robustness, and generalization. However, deploying these methods beyond curated benchmarks introduces additional real-world requirements that are not addressed by predictive performance alone. Real-world application must remain feasible under continuously growing dataset, remain resilient to incomplete, noisy, or partially missing inputs, and provide checkable rationales for decisions.

This dissertation advances a unified perspective that treats these two objectives: (i) strengthening multi-modal representations to improve downstream performance via heterogeneous integration, and (ii) introducing prototypes as a complementary design principle to mitigate real-world constraints. In the first study on fashion e-commerce, we propose MDL-FR, an end-to-end framework that integrates visual and textual data and learns style prototypes that capture high-level structure, enabling style-aware outfit generation beyond compatibility-only recommendation. In the second study on single-cell multi-omics integration, we propose CPG-AE, which replaces dense cell–cell interactions with a sparse cell–prototype graph and combines prototype-mediated message passing with multi-modal fusion autoencoder to learn coherent joint embeddings. Across both domains, experimental results show that multi-modal architectures improve task performance, while learned prototypes provide compact anchors that enhance scalability, robustness to incomplete data, and evidence-oriented validation beyond aggregate metrics.

Publisher: Ulsan National Institute of Science and Technology

Degree: Doctor

Major: Department of Industrial Engineering

Show Full Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1403 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.