File Download

There are no files associated with this item.

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

A Study on the Application of Multi-modal Learning for Real-World Challenges using Prototype

Author(s)
Sohn, Wonho
Advisor
Lim, Chiehyeon
Issued Date
2026-02
URI
https://scholarworks.unist.ac.kr/handle/201301/90973 http://unist.dcollection.net/common/orgView/200000965089
Abstract
Modern information systems increasingly capture heterogeneous signals in multiple modalities, including images, text, audio/video, and structured logs, creating strong incentives for multi-modal learning. By integrating complementary information and enforcing cross-modal consistency, multi-modal models can learn unified representation implying complementary information, thereby improving accuracy, robustness, and generalization. However, deploying these methods beyond curated benchmarks introduces additional real-world requirements that are not addressed by predictive performance alone. Real-world application must remain feasible under continuously growing dataset, remain resilient to incomplete, noisy, or partially missing inputs, and provide checkable rationales for decisions.

This dissertation advances a unified perspective that treats these two objectives: (i) strengthening multi-modal representations to improve downstream performance via heterogeneous integration, and (ii) introducing prototypes as a complementary design principle to mitigate real-world constraints. In the first study on fashion e-commerce, we propose MDL-FR, an end-to-end framework that integrates visual and textual data and learns style prototypes that capture high-level structure, enabling style-aware outfit generation beyond compatibility-only recommendation. In the second study on single-cell multi-omics integration, we propose CPG-AE, which replaces dense cell–cell interactions with a sparse cell–prototype graph and combines prototype-mediated message passing with multi-modal fusion autoencoder to learn coherent joint embeddings. Across both domains, experimental results show that multi-modal architectures improve task performance, while learned prototypes provide compact anchors that enhance scalability, robustness to incomplete data, and evidence-oriented validation beyond aggregate metrics.
Publisher
Ulsan National Institute of Science and Technology
Degree
Doctor
Major
Department of Industrial Engineering

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.