Sibylla: To Retry or Not To Retry on Deep Learning Job Failure

Kim, Taeyoon; Jeong, Suyeon; Lee, Jongseop; Lee, Soobee; Jeon, Myeongjae

Scholarworks@UNIST

UNIST Library

File Download

There are no files associated with this item.

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Related Researcher

전명재

Jeon, Myeongjae: OMNIA

Read More

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Full metadata record

DC Field	Value	Language
dc.citation.conferencePlace	US	-
dc.citation.conferencePlace	Carlsbad, CA	-
dc.citation.title	USENIX Annual Technical Conference	-
dc.contributor.author	Kim, Taeyoon	-
dc.contributor.author	Jeong, Suyeon	-
dc.contributor.author	Lee, Jongseop	-
dc.contributor.author	Lee, Soobee	-
dc.contributor.author	Jeon, Myeongjae	-
dc.date.accessioned	2024-01-31T20:08:52Z	-
dc.date.available	2024-01-31T20:08:52Z	-
dc.date.created	2022-07-18	-
dc.date.issued	2022-07-11	-
dc.description.abstract	GPUs are highly contended resources in shared clusters for deep learning (DL) training. However, our analysis with a real-world trace reveals that a non-negligible number of jobs running on the cluster undergo failures and are blindly retried by the job scheduler. Unfortunately, these job failures often repeat and waste GPU resources, limiting effective GPU utilization across the cluster. In this paper, we introduce Sibylla which informs whether an observed failure of DL training will repeat or not upon retry on the failure. Sibylla employs a machine learning model based on RNNs that trains on stdout and stderr logs of failed jobs and can continuously update the model on new log messages without hand-constructing labels for the new training samples. With Sibylla, the job scheduler is learning-enhanced, performing a retry for a failed job only when it is highly likely to succeed with the retry. We evaluate the effectiveness of Sibylla under a variety of scenarios using trace-driven simulations. Sibylla improves cluster utilization and reduces job completion time (JCT) by up to 15%.	-
dc.identifier.bibliographicCitation	USENIX Annual Technical Conference	-
dc.identifier.uri	https://scholarworks.unist.ac.kr/handle/201301/75713	-
dc.identifier.url	https://www.usenix.org/conference/atc22/presentation/kim-taeyoon	-
dc.language	영어	-
dc.publisher	USENIX	-
dc.title	Sibylla: To Retry or Not To Retry on Deep Learning Job Failure	-
dc.type	Conference Paper	-
dc.date.conferenceDate	2022-07-11	-

Show Simple Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1404 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.