BROWSE

Related Researcher

Author's Photo

Jeon, Myeongjae
Research Interests
  • Parallel/distributed processing of deep learning workloads, Real-time stream data analytics at cloud/IoT scale, Public/private blockchain

Reliability of Large Scale GPU Clusters for Deep Learning Workloads

DC Field Value Language
dc.contributor.author Qian, Junjie ko
dc.contributor.author Kim, Taeyoon ko
dc.contributor.author Jeon, Myeongjae ko
dc.date.available 2021-10-01T01:43:38Z -
dc.date.created 2021-09-10 ko
dc.date.issued 2021-04-19 ko
dc.identifier.citation International World Wide Web Conference, pp.179 - 181 ko
dc.identifier.uri https://scholarworks.unist.ac.kr/handle/201301/54031 -
dc.description.abstract Recent advances on deep learning technologies have made GPU clusters popular as training platforms. In this paper, we study reliability issues while focusing on training job failures from analyzing logs collected from running deep learning workloads on a large-scale GPU cluster in production. These failures are largely grouped into two categories, infrastructure and user, based on their sources, and reveal diverse reasons causing the failures. With insights obtained from the failure analysis, we suggest several different ways to improve the stability of shared GPU clusters designed for DL training and optimize user experience by reducing failure occurrences. ko
dc.language 영어 ko
dc.publisher Association for Computing Machinery, Inc ko
dc.title Reliability of Large Scale GPU Clusters for Deep Learning Workloads ko
dc.type CONFERENCE ko
dc.identifier.scopusid 2-s2.0-85107703409 ko
dc.identifier.wosid 000749534900025 ko
dc.type.rims CONF ko
dc.identifier.doi 10.1145/3442442.3452056 ko
dc.identifier.url https://dl.acm.org/doi/10.1145/3442442.3452056 ko
Appears in Collections:
CSE_Conference Papers

find_unist can give you direct access to the published full text of this article. (UNISTARs only)

Show simple item record

qrcode

  • mendeley

    citeulike

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

MENU