Although remote sensing using machine learning techniques can effectively monitor harmful algal blooms, their application is often limited by data availability. The synergetic impacts of rapid urbanization and climate change contribute to the unprecedented occurrence of severe algal blooms, which require sufficient high-concentration data for successful model training. In this study, we evaluated the feasibility of integrating datasets from two different watersheds to estimate chlorophyll-a (Chl-a) concentrations using machine learning models with Sentinel-2 imagery. The original dataset, consisting of data from the Nakdong (ND) River, and two augmented datasets - an integrated dataset combining the Geum (GE) and ND rivers (GEND) and a resampled ND dataset using the synthetic minority oversampling technique for regression with Gaussian noise (ND-SMOGN) - were used to train six machine learning models. Models trained on the augmented datasets, GEND and ND-SMOGN, successfully addressed this underestimation issue for the sample with the highest Chl-a concentration. Among the six algorithms, multilayer perceptron with attention mechanism exploited the highest performance across all indicators with coefficient of determination (R2) and root mean square error (RMSE) values of 0.93 and 2.76. Model interpretations revealed that models trained on GEND assigned high significance to B03 (560 nm) to B05 (705 nm), aligning with the optical characteristics of Chl-a, whereas models trained on ND and ND-SMOGN also emphasized less relevant bands. This study provides valuable insights into improving model performance, understanding the impacts of data availability, and informing the development of more accurate and reliable environmental management practices.