File Download

There are no files associated with this item.

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)
Related Researcher

전명재

Jeon, Myeongjae
OMNIA
Read More

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

Full metadata record

DC Field Value Language
dc.citation.conferencePlace SV -
dc.citation.conferencePlace Ljubljana -
dc.citation.endPage 181 -
dc.citation.startPage 179 -
dc.citation.title International World Wide Web Conference -
dc.contributor.author Qian, Junjie -
dc.contributor.author Kim, Taeyoon -
dc.contributor.author Jeon, Myeongjae -
dc.date.accessioned 2024-01-31T22:07:02Z -
dc.date.available 2024-01-31T22:07:02Z -
dc.date.created 2021-09-10 -
dc.date.issued 2021-04-19 -
dc.description.abstract Recent advances on deep learning technologies have made GPU clusters popular as training platforms. In this paper, we study reliability issues while focusing on training job failures from analyzing logs collected from running deep learning workloads on a large-scale GPU cluster in production. These failures are largely grouped into two categories, infrastructure and user, based on their sources, and reveal diverse reasons causing the failures. With insights obtained from the failure analysis, we suggest several different ways to improve the stability of shared GPU clusters designed for DL training and optimize user experience by reducing failure occurrences. -
dc.identifier.bibliographicCitation International World Wide Web Conference, pp.179 - 181 -
dc.identifier.doi 10.1145/3442442.3452056 -
dc.identifier.scopusid 2-s2.0-85107703409 -
dc.identifier.uri https://scholarworks.unist.ac.kr/handle/201301/77539 -
dc.identifier.url https://dl.acm.org/doi/10.1145/3442442.3452056 -
dc.identifier.wosid 000749534900025 -
dc.language 영어 -
dc.publisher Association for Computing Machinery, Inc -
dc.title Reliability of Large Scale GPU Clusters for Deep Learning Workloads -
dc.type Conference Paper -
dc.date.conferenceDate 2021-04-19 -

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.