dc.citation.conferencePlace |
US |
- |
dc.citation.conferencePlace |
Boston |
- |
dc.citation.title |
USENIX Symposium on Networked Systems Design and Implementation |
- |
dc.contributor.author |
Gu, Juncheng |
- |
dc.contributor.author |
Chowdhury, Mosharaf |
- |
dc.contributor.author |
Shin, Kang G. |
- |
dc.contributor.author |
Zhu, Yibo |
- |
dc.contributor.author |
Jeon, Myeongjae |
- |
dc.contributor.author |
Qian, Junjie |
- |
dc.contributor.author |
Liu, Hongqiang |
- |
dc.contributor.author |
Guo, Chuanxiong |
- |
dc.date.accessioned |
2024-02-01T00:38:07Z |
- |
dc.date.available |
2024-02-01T00:38:07Z |
- |
dc.date.created |
2019-12-18 |
- |
dc.date.issued |
2019-02-26 |
- |
dc.description.abstract |
Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as unpredictable training times, an all-or-nothing execution model, and inflexibility in GPU sharing. Our analysis of a large GPU cluster in production shows that existing big data schedulers cause long queueing delays and low overall performance. We present Tiresias, a GPU cluster manager tailored for distributed DL training jobs, which efficiently schedules and places DL jobs to reduce their job completion times (JCTs). Given that a DL job’s execution time is often unpredictable, we propose two scheduling algorithms – Discretized Two- Dimensional Gittins index relies on partial information and Discretized Two-Dimensional LAS is information-agnostic – that aim to minimize the average JCT. Additionally, we describe when the consolidated placement constraint can be relaxed, and present a placement algorithm to leverage these observations without any user input. Experiments on the Michigan ConFlux cluster with 60 P100 GPUs and large-scale trace-driven simulations show that Tiresias improves the average JCT by up to 5:5 over an Apache YARN-based resource manager used in production. More importantly, Tiresias’s performance is comparable to that of solutions assuming perfect knowledge. |
- |
dc.identifier.bibliographicCitation |
USENIX Symposium on Networked Systems Design and Implementation |
- |
dc.identifier.scopusid |
2-s2.0-85066897682 |
- |
dc.identifier.uri |
https://scholarworks.unist.ac.kr/handle/201301/80114 |
- |
dc.publisher |
USENIX |
- |
dc.title |
Tiresias: A GPU Cluster Manager for Distributed Deep Learning |
- |
dc.type |
Conference Paper |
- |
dc.date.conferenceDate |
2019-02-26 |
- |