CONTROLLABLE REPRESENTATION: TOWARDS FINE-GRAINED FOCUSING IMAGE FROM INSTRUCTION

Oh, Jaeho

Scholarworks@UNIST

UNIST Library

File Download

There are no files associated with this item.

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Kim, Hyounghun	-
dc.contributor.author	Oh, Jaeho	-
dc.date.accessioned	2025-04-04T13:51:36Z	-
dc.date.available	2025-04-04T13:51:36Z	-
dc.date.issued	2025-02	-
dc.description.abstract	Contrastive Language-Image Pre-training (CLIP) has proven pivotal in extracting significant content information from images across a variety of tasks. By aligning textual and visual modalities, it facilitates comprehensive image understanding, capturing even minute details, including those extraneous to specific tasks. However, CLIP lacks reasoning capabilities and cannot be directly guided through text instructions. In contrast, large-scale pre-training and instruction tuning have demonstrated success in developing general-purpose language models with extensive competencies. Yet, the integration of these capabilities into image representation tasks remains underexplored. In this paper, we propose an efficient and innovative method for representing images that align with open-ended instructions by utilizing a pre-trained image encoder and a frozen text-only large language model (LLM). Our approach maps between the embedding spaces of these components, enabling the seamless integration of textual and visual information. The method processes images and corresponding open text instructions as input, producing image representations consistent with these instructions. By keeping both the LLM and vision encoder frozen and training only three lightweight linear layers, InstructionClip achieves remarkable efficiency, leveraging just 10,000 generated data points containing three types of semantic relational information: action, color, and background. Unlike traditional models, InstructionClip introduces a mechanism that enables the model to adapt to user-provided instructions, even those unseen during training. By focusing on relationships such as object actions, image backgrounds, and object colors, the model successfully demonstrates its ability to interpret complex and specific search intents. We construct a controlled evaluation dataset tailored to these relationships, allowing us to rigorously assess the model's performance under instruction-driven scenarios. Through systematic experiments on our test dataset, InstructionClip shows a clear advantage over baseline models like CLIP, which primarily rely on visual similarity. The results highlight InstructionClip’s ability to retrieve images that adhere to semantic relations explicitly described in the instructions, showcasing its flexibility and adaptability. This work underscores the potential for instruction-driven multimodal systems to be applied in real-world scenarios where user-defined control over image representations is critical. By addressing the limitations of existing approaches, InstructionClip opens up new possibilities for rich, multi-faceted interactions with image data guided by user-defined instructions.	-
dc.description.degree	Master	-
dc.description	Graduate School of Artificial Intelligence	-
dc.identifier.uri	https://scholarworks.unist.ac.kr/handle/201301/86590	-
dc.identifier.uri	http://unist.dcollection.net/common/orgView/200000868966	-
dc.language	ENG	-
dc.publisher	Ulsan National Institute of Science and Technology	-
dc.rights.embargoReleaseDate	9999-12-31	-
dc.rights.embargoReleaseTerms	9999-12-31	-
dc.subject	Computer vision	-
dc.subject	NLP	-
dc.title.alternative	제어 가능한 표현	-
dc.title	CONTROLLABLE REPRESENTATION: TOWARDS FINE-GRAINED FOCUSING IMAGE FROM INSTRUCTION	-
dc.type	Thesis	-

Show Simple Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1403 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.