File Download

There are no files associated with this item.

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)
Related Researcher

전명재

Jeon, Myeongjae
OMNIA
Read More

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

Full metadata record

DC Field Value Language
dc.citation.conferencePlace US -
dc.citation.conferencePlace Boston -
dc.citation.title USENIX Symposium on Networked Systems Design and Implementation -
dc.contributor.author Gu, Juncheng -
dc.contributor.author Chowdhury, Mosharaf -
dc.contributor.author Shin, Kang G. -
dc.contributor.author Zhu, Yibo -
dc.contributor.author Jeon, Myeongjae -
dc.contributor.author Qian, Junjie -
dc.contributor.author Liu, Hongqiang -
dc.contributor.author Guo, Chuanxiong -
dc.date.accessioned 2024-02-01T00:38:07Z -
dc.date.available 2024-02-01T00:38:07Z -
dc.date.created 2019-12-18 -
dc.date.issued 2019-02-26 -
dc.description.abstract Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as unpredictable training times, an all-or-nothing execution model, and inflexibility in GPU sharing. Our analysis of a large GPU cluster in production shows that existing big data schedulers cause long queueing delays and low overall performance.
We present Tiresias, a GPU cluster manager tailored for distributed DL training jobs, which efficiently schedules and places DL jobs to reduce their job completion times (JCTs). Given that a DL job’s execution time is often unpredictable, we propose two scheduling algorithms – Discretized Two- Dimensional Gittins index relies on partial information and Discretized Two-Dimensional LAS is information-agnostic – that aim to minimize the average JCT. Additionally, we describe when the consolidated placement constraint can be relaxed, and present a placement algorithm to leverage these observations without any user input. Experiments on the Michigan ConFlux cluster with 60 P100 GPUs and large-scale trace-driven simulations show that Tiresias improves the average JCT by up to 5:5 over an Apache YARN-based resource manager used in production. More importantly, Tiresias’s performance is comparable to that of solutions assuming perfect knowledge.
-
dc.identifier.bibliographicCitation USENIX Symposium on Networked Systems Design and Implementation -
dc.identifier.scopusid 2-s2.0-85066897682 -
dc.identifier.uri https://scholarworks.unist.ac.kr/handle/201301/80114 -
dc.publisher USENIX -
dc.title Tiresias: A GPU Cluster Manager for Distributed Deep Learning -
dc.type Conference Paper -
dc.date.conferenceDate 2019-02-26 -

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.