By My Eyes: Grounding Multimodal Large Language Models with Sensor Data via Visual Prompting

Yoon, Hyungjun; Tolera, Biniyam Aschalew; Gong, Taesik; Lee, Kimin; Lee, Sung-Ju

Scholarworks@UNIST

UNIST Library

File Download

There are no files associated with this item.

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Related Researcher

공태식

Gong, Taesik: Ubiquitous AI Lab

Read More

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Full metadata record

DC Field	Value	Language
dc.citation.conferencePlace	US	-
dc.citation.title	Empirical Methods in Natural Language Processing	-
dc.contributor.author	Yoon, Hyungjun	-
dc.contributor.author	Tolera, Biniyam Aschalew	-
dc.contributor.author	Gong, Taesik	-
dc.contributor.author	Lee, Kimin	-
dc.contributor.author	Lee, Sung-Ju	-
dc.date.accessioned	2024-12-02T12:05:06Z	-
dc.date.available	2024-12-02T12:05:06Z	-
dc.date.created	2024-11-30	-
dc.date.issued	2024-11-12	-
dc.description.abstract	Large language models (LLMs) have demonstrated exceptional abilities across various domains. However, utilizing LLMs for ubiquitous sensing applications remains challenging as existing text-prompt methods show significant performance degradation when handling long sensor data sequences. We propose a visual prompting approach for sensor data using multimodal LLMs (MLLMs). We design a visual prompt that directs MLLMs to utilize visualized sensor data alongside the target sensory task descriptions. Additionally, we introduce a visualization generator that automates the creation of optimal visualizations tailored to a given sensory task, eliminating the need for prior task-specific knowledge. We evaluated our approach on nine sensory tasks involving four sensing modalities, achieving an average of 10% higher accuracy than text-based prompts and reducing token costs by 15.8×. Our findings highlight the effectiveness and cost-efficiency of visual prompts with MLLMs for various sensory tasks. The source code is available at https://github. com/diamond264/ByMyEyes.	-
dc.identifier.bibliographicCitation	Empirical Methods in Natural Language Processing	-
dc.identifier.uri	https://scholarworks.unist.ac.kr/handle/201301/84658	-
dc.publisher	EMNLP	-
dc.title	By My Eyes: Grounding Multimodal Large Language Models with Sensor Data via Visual Prompting	-
dc.type	Conference Paper	-
dc.date.conferenceDate	2024-11-12	-

Show Simple Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1403 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.