JOURNAL OF CHEMICAL INFORMATION AND MODELING, v.65, no.22, pp.12155 - 12160
Abstract
Artificial intelligence (AI) is reshaping computational science, but AI-driven workflows routinely span heterogeneous tasks executed across diverse high-performance computing (HPC) systems. We introduce DPDispatcher, an open-source Python framework for scalable, fault-tolerant task scheduling in such environments with an emphasis on lightweight submission, automatic retries, and robust resumption. DPDispatcher separates connection and file-staging concerns from scheduler control, supports multiple HPC job managers, and provides both local and secure shell (SSH) backends. DPDispatcher has been adopted by more than ten scientific packages. Representative use cases include active learning for machine-learning potentials, free-energy and thermodynamic integration workflows, large-scale materials screening, and large language model (LLM)-driven agents that launch HPC computations. Across these settings, DPDispatcher reduces operational overhead and error rates while improving portability and automation for reliable, high-throughput scientific computing.