Optimizing Data Preparation Pipelines for Predictive Process Monitoring with Large Language Models

Kim, Yeonsu

Scholarworks@UNIST

UNIST Library

File Download

200000865871.pdf

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Comuzzi, Marco	-
dc.contributor.author	Kim, Yeonsu	-
dc.date.accessioned	2025-04-04T13:49:43Z	-
dc.date.available	2025-04-04T13:49:43Z	-
dc.date.issued	2025-02	-
dc.description.abstract	This research investigates the application of Large Language Models (LLMs) to enhance data prepara- tion pipelines in Predictive Process Monitoring (PPM). PPM, a critical tool for analyzing event logs to predict future process behaviors, often suffers from issues such as missing values, semantic inconsisten- cies, and data noise. The study demonstrates the potential of LLMs to address these imperfections by leveraging their contextual understanding to improve data quality and predictive accuracy. The proposed LLM-driven pipeline integrates steps such as contextual transformation, information extraction, and text normalization, evaluated on two domain-specific datasets, Credit and Pub. Exper- imental results highlight the effectiveness of LLM-based imputation in handling semantic variability, particularly for homonym transformations, where performance metrics such as BERTScore and F1- scores show significant improvements. However, the study also identifies limitations, notably reduced performance under high synonym transformation levels and domain-specific linguistic complexities, es- pecially in the Pub dataset. Comparative analysis reveals that LLMs excel in scenarios requiring semantic understanding, of- fering advantages over traditional rule-based imputation methods in certain contexts. The research em- phasizes the complementary potential of combining LLMs with classic approaches, suggesting hybrid models for robust and scalable data preparation pipelines. This study contributes to the growing field of process mining by showcasing the feasibility of in- tegrating advanced LLMs into PPM workflows. Future research directions include domain-specific fine-tuning, lightweight model development, and hybrid frameworks to optimize both automation and interpretability. Ultimately, these advancements aim to bridge the gap between raw data imperfections and actionable process insights, driving efficiency and accuracy in predictive analytics.	-
dc.description.degree	Master	-
dc.description	Department of Industrial Engineering	-
dc.identifier.uri	https://scholarworks.unist.ac.kr/handle/201301/86487	-
dc.identifier.uri	http://unist.dcollection.net/common/orgView/200000865871	-
dc.language	ENG	-
dc.publisher	Ulsan National Institute of Science and Technology	-
dc.subject	Process Mining	-
dc.title	Optimizing Data Preparation Pipelines for Predictive Process Monitoring with Large Language Models	-
dc.type	Thesis	-

Show Simple Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1403 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.