File Download

There are no files associated with this item.

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

Full metadata record

DC Field Value Language
dc.contributor.advisor Kim, Hyounghun -
dc.contributor.author Oh, Jaeho -
dc.date.accessioned 2025-04-04T13:51:36Z -
dc.date.available 2025-04-04T13:51:36Z -
dc.date.issued 2025-02 -
dc.description.abstract Contrastive Language-Image Pre-training (CLIP) has proven pivotal in extracting significant content information from images across a variety of tasks. By aligning textual and visual modalities, it facilitates comprehensive image understanding, capturing even minute details, including those extraneous to specific tasks. However, CLIP lacks reasoning capabilities and cannot be directly guided through text instructions. In contrast, large-scale pre-training and instruction tuning have demonstrated success in developing general-purpose language models with extensive competencies. Yet, the integration of these capabilities into image representation tasks remains underexplored. In this paper, we propose an efficient and innovative method for representing images that align with open-ended instructions by utilizing a pre-trained image encoder and a frozen text-only large language model (LLM). Our approach maps between the embedding spaces of these components, enabling the seamless integration of textual and visual information. The method processes images and corresponding open text instructions as input, producing image representations consistent with these instructions. By keeping both the LLM and vision encoder frozen and training only three lightweight linear layers, InstructionClip achieves remarkable efficiency, leveraging just 10,000 generated data points containing three types of semantic relational information: action, color, and background. Unlike traditional models, InstructionClip introduces a mechanism that enables the model to adapt to user-provided instructions, even those unseen during training. By focusing on relationships such as object actions, image backgrounds, and object colors, the model successfully demonstrates its ability to interpret complex and specific search intents. We construct a controlled evaluation dataset tailored to these relationships, allowing us to rigorously assess the model's performance under instruction-driven scenarios. Through systematic experiments on our test dataset, InstructionClip shows a clear advantage over baseline models like CLIP, which primarily rely on visual similarity. The results highlight InstructionClip’s ability to retrieve images that adhere to semantic relations explicitly described in the instructions, showcasing its flexibility and adaptability. This work underscores the potential for instruction-driven multimodal systems to be applied in real-world scenarios where user-defined control over image representations is critical. By addressing the limitations of existing approaches, InstructionClip opens up new possibilities for rich, multi-faceted interactions with image data guided by user-defined instructions. -
dc.description.degree Master -
dc.description Graduate School of Artificial Intelligence -
dc.identifier.uri https://scholarworks.unist.ac.kr/handle/201301/86590 -
dc.identifier.uri http://unist.dcollection.net/common/orgView/200000868966 -
dc.language ENG -
dc.publisher Ulsan National Institute of Science and Technology -
dc.rights.embargoReleaseDate 9999-12-31 -
dc.rights.embargoReleaseTerms 9999-12-31 -
dc.subject Computer vision -
dc.subject NLP -
dc.title.alternative 제어 가능한 표현 -
dc.title CONTROLLABLE REPRESENTATION: TOWARDS FINE-GRAINED FOCUSING IMAGE FROM INSTRUCTION -
dc.type Thesis -

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.