NANOSCIENCE AND NANOTECHNOLOGY LETTERS, v.9, no.8, pp.1241 - 1245
Abstract
Low-power processors have emerged as an alternative for supercomputers and cloud computers to reduce their energy consumption. Many low-power processors are designed and used for mobile devices, such as phones, tablets, and laptops, and their reliability is commonly evaluated based on mobile use conditions. Supercomputers are, on the other hand, often used for high-performance computing, including complex calculations, graphics processing, and large-scale modeling and simulations, which generates heavy workload and stress conditions. Therefore, the reliability of low-power processors in supercomputing systems can be of concern. This paper assessed the lifetime of low-power processors used in high-performance computing environments. Failure modes, mechanisms, and effects analysis identified thermo-mechanical fatigue at board level solder joints as a critical failure mechanism of low-power processors. An HPC use condition was estimated based on the supercomputer use statistics collected from a supercomputing center. Experimental studies were conducted to correlate the use conditions with changes in the package temperature. A physics-of-failure-based reliability model was used to estimate the lifetime of low-power processors under multiple load ratios with or without active cooling. Recommendations for reliable operation of low-power processor-based supercomputers were presented.