This research investigates the application of Large Language Models (LLMs) to enhance data prepara- tion pipelines in Predictive Process Monitoring (PPM). PPM, a critical tool for analyzing event logs to predict future process behaviors, often suffers from issues such as missing values, semantic inconsisten- cies, and data noise. The study demonstrates the potential of LLMs to address these imperfections by leveraging their contextual understanding to improve data quality and predictive accuracy. The proposed LLM-driven pipeline integrates steps such as contextual transformation, information extraction, and text normalization, evaluated on two domain-specific datasets, Credit and Pub. Exper- imental results highlight the effectiveness of LLM-based imputation in handling semantic variability, particularly for homonym transformations, where performance metrics such as BERTScore and F1- scores show significant improvements. However, the study also identifies limitations, notably reduced performance under high synonym transformation levels and domain-specific linguistic complexities, es- pecially in the Pub dataset. Comparative analysis reveals that LLMs excel in scenarios requiring semantic understanding, of- fering advantages over traditional rule-based imputation methods in certain contexts. The research em- phasizes the complementary potential of combining LLMs with classic approaches, suggesting hybrid models for robust and scalable data preparation pipelines. This study contributes to the growing field of process mining by showcasing the feasibility of in- tegrating advanced LLMs into PPM workflows. Future research directions include domain-specific fine-tuning, lightweight model development, and hybrid frameworks to optimize both automation and interpretability. Ultimately, these advancements aim to bridge the gap between raw data imperfections and actionable process insights, driving efficiency and accuracy in predictive analytics.
Publisher
Ulsan National Institute of Science and Technology