| dc.description.abstract |
Modern information systems increasingly capture heterogeneous signals in multiple modalities, including images, text, audio/video, and structured logs, creating strong incentives for multi-modal learning. By integrating complementary information and enforcing cross-modal consistency, multi-modal models can learn unified representation implying complementary information, thereby improving accuracy, robustness, and generalization. However, deploying these methods beyond curated benchmarks introduces additional real-world requirements that are not addressed by predictive performance alone. Real-world application must remain feasible under continuously growing dataset, remain resilient to incomplete, noisy, or partially missing inputs, and provide checkable rationales for decisions.
This dissertation advances a unified perspective that treats these two objectives: (i) strengthening multi-modal representations to improve downstream performance via heterogeneous integration, and (ii) introducing prototypes as a complementary design principle to mitigate real-world constraints. In the first study on fashion e-commerce, we propose MDL-FR, an end-to-end framework that integrates visual and textual data and learns style prototypes that capture high-level structure, enabling style-aware outfit generation beyond compatibility-only recommendation. In the second study on single-cell multi-omics integration, we propose CPG-AE, which replaces dense cell–cell interactions with a sparse cell–prototype graph and combines prototype-mediated message passing with multi-modal fusion autoencoder to learn coherent joint embeddings. Across both domains, experimental results show that multi-modal architectures improve task performance, while learned prototypes provide compact anchors that enhance scalability, robustness to incomplete data, and evidence-oriented validation beyond aggregate metrics. |
- |