Tiresias: A GPU Cluster Manager for Distributed Deep Learning

Gu, Juncheng; Chowdhury, Mosharaf; Shin, Kang G.; Zhu, Yibo; Jeon, Myeongjae; Qian, Junjie; Liu, Hongqiang; Guo, Chuanxiong

Scholarworks@UNIST

UNIST Library

File Download

There are no files associated with this item.

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Related Researcher

전명재

Jeon, Myeongjae: OMNIA

Read More

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Full metadata record

DC Field	Value	Language
dc.citation.conferencePlace	US	-
dc.citation.conferencePlace	Boston	-
dc.citation.title	USENIX Symposium on Networked Systems Design and Implementation	-
dc.contributor.author	Gu, Juncheng	-
dc.contributor.author	Chowdhury, Mosharaf	-
dc.contributor.author	Shin, Kang G.	-
dc.contributor.author	Zhu, Yibo	-
dc.contributor.author	Jeon, Myeongjae	-
dc.contributor.author	Qian, Junjie	-
dc.contributor.author	Liu, Hongqiang	-
dc.contributor.author	Guo, Chuanxiong	-
dc.date.accessioned	2024-02-01T00:38:07Z	-
dc.date.available	2024-02-01T00:38:07Z	-
dc.date.created	2019-12-18	-
dc.date.issued	2019-02-26	-
dc.description.abstract	Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as unpredictable training times, an all-or-nothing execution model, and inflexibility in GPU sharing. Our analysis of a large GPU cluster in production shows that existing big data schedulers cause long queueing delays and low overall performance. We present Tiresias, a GPU cluster manager tailored for distributed DL training jobs, which efficiently schedules and places DL jobs to reduce their job completion times (JCTs). Given that a DL job’s execution time is often unpredictable, we propose two scheduling algorithms – Discretized Two- Dimensional Gittins index relies on partial information and Discretized Two-Dimensional LAS is information-agnostic – that aim to minimize the average JCT. Additionally, we describe when the consolidated placement constraint can be relaxed, and present a placement algorithm to leverage these observations without any user input. Experiments on the Michigan ConFlux cluster with 60 P100 GPUs and large-scale trace-driven simulations show that Tiresias improves the average JCT by up to 5:5 over an Apache YARN-based resource manager used in production. More importantly, Tiresias’s performance is comparable to that of solutions assuming perfect knowledge.	-
dc.identifier.bibliographicCitation	USENIX Symposium on Networked Systems Design and Implementation	-
dc.identifier.scopusid	2-s2.0-85066897682	-
dc.identifier.uri	https://scholarworks.unist.ac.kr/handle/201301/80114	-
dc.publisher	USENIX	-
dc.title	Tiresias: A GPU Cluster Manager for Distributed Deep Learning	-
dc.type	Conference Paper	-
dc.date.conferenceDate	2019-02-26	-

Show Simple Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1404 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.