Reliability of Large Scale GPU Clusters for Deep Learning Workloads
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Qian, Junjie | ko |
dc.contributor.author | Kim, Taeyoon | ko |
dc.contributor.author | Jeon, Myeongjae | ko |
dc.date.available | 2021-10-01T01:43:38Z | - |
dc.date.created | 2021-09-10 | ko |
dc.date.issued | 2021-04-19 | ko |
dc.identifier.citation | International World Wide Web Conference, pp.179 - 181 | ko |
dc.identifier.uri | https://scholarworks.unist.ac.kr/handle/201301/54031 | - |
dc.description.abstract | Recent advances on deep learning technologies have made GPU clusters popular as training platforms. In this paper, we study reliability issues while focusing on training job failures from analyzing logs collected from running deep learning workloads on a large-scale GPU cluster in production. These failures are largely grouped into two categories, infrastructure and user, based on their sources, and reveal diverse reasons causing the failures. With insights obtained from the failure analysis, we suggest several different ways to improve the stability of shared GPU clusters designed for DL training and optimize user experience by reducing failure occurrences. | ko |
dc.language | 영어 | ko |
dc.publisher | Association for Computing Machinery, Inc | ko |
dc.title | Reliability of Large Scale GPU Clusters for Deep Learning Workloads | ko |
dc.type | CONFERENCE | ko |
dc.identifier.scopusid | 2-s2.0-85107703409 | ko |
dc.identifier.wosid | 000749534900025 | ko |
dc.type.rims | CONF | ko |
dc.identifier.doi | 10.1145/3442442.3452056 | ko |
dc.identifier.url | https://dl.acm.org/doi/10.1145/3442442.3452056 | ko |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.