ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, v.161, pp.112200
Abstract
High-cost patients incur disproportionately high medical expenses, and identifying them proactively is crucial for effective healthcare management. While previous research has focused on identifying high-cost patients based on overall expenditure, there has been a lack of studies analyzing them in the context of specific diseases. This study addressed this gap by leveraging data from the National Health Insurance Service (NHIS) of South Korea, spanning 2015 to 2019, to develop predictive models for identifying these patients. We trained models using data from 880,000 individuals to predict high-cost patients in 2019 using resource-efficient machine learning algorithms such as Extreme Gradient Boosting (XGBoost), Random Forest (RF), and Neural Networks (NN) that minimize computational overhead, with undersampling techniques applied to handle data imbalance. We focused on the six major disease categories that account for the highest medical expenditures in South Korea: diseases of the musculoskeletal system (DMS), circulatory system (DCS), eye and ear (DEA-DEM), digestive system (DDS), genitourinary system (DGS), and respiratory system (DRS). We discovered that disease-specific analyses revealed important predictive factors that were not apparent in aggregate analyses. For example, hemoglobin levels emerged as crucial predictors for DCS, while body mass index (BMI) proved essential for DMS prediction. These findings enhance our understanding of the factors contributing to high medical costs and provide a foundational framework for healthcare providers and policymakers to develop more targeted and effective health management strategies.