Optimization of a 3D neutron transport code STREAM for highly parallel execution using Graphics Processing Units

Dzianisau, Siarhei

Scholarworks@UNIST

UNIST Library

File Download

There are no files associated with this item.

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Optimization of a 3D neutron transport code STREAM for highly parallel execution using Graphics Processing Units

Author(s): Dzianisau, Siarhei

Advisor: Lee, Deokjung

Issued Date: 2025-08

URI: https://scholarworks.unist.ac.kr/handle/201301/88251 http://unist.dcollection.net/common/orgView/200000900404

Abstract: Deterministic neutron transport codes are a vital modern technology for high-fidelity reactor simulations. Many versions of such codes currently employ a method of characteristics (MOC) for solving the neutron transport equation. With the rise of graphics processing unit (GPU) computing power, offloading the most time-consuming parts of codes to GPU became possible. Among the examples of currently available GPU-enabled MOC codes, nTRACER employs a 2D/1D MOC solver, and ThorMOC uses a 3D MOC/Diamond difference (DD) method. Both the above-mentioned codes display decent GPU capabilities, but some of the presented aspects of their work could be further improved. STREAM is the Ulsan National Institute of Science and Technology (UNIST) Computational Reactor Physics and Experiment (CORE) Laboratory’s code based on the in-house 3D MOC/DD method. STREAM has always been an attractive target for various acceleration techniques due to its more accurate but slower computational scheme. The challenge of offloading STREAM to GPU is the high on-board memory demand, where for one 3D fuel assembly (FA), at least 48 GB of memory should be available. Other challenges include using sequential axial sweeping based on the Gauss-Seidel scheme and radial domain decomposition that leads to calculating angular flux change only FA-wise per iteration and then synchronizing incoming and outgoing angular fluxes. In this thesis, a GPU-enabled version of the code named STREAM3D-GPU is presented. This code is based on the newly introduced axially decomposed GPU-enabled 3D MOC/DD scheme, which is a modified version of the original 3D MOC/DD method but restructured and optimized for GPU execution. The scheme allows scalable execution using any number of GPU cards with significantly reduced on-board memory demand – not exceeding 4 GB per MPI process per GPU card. As a result, STREAM3D-GPU could simulate large, pressurized water reactor (PWR) depletion problems, even using entry-level consumer-grade GPUs with as little as 6 GB of on-board memory. Performance-wise, STREAM3D-GPU also was found superior to the original CPU-optimized STREAM. For the OPR-1000 quarter-core depletion problem with thermal-hydraulic (TH) feedback, two GPU nodes with 8 GPU cards and 64 CPU cores each outperformed 8 CPU nodes with 256 CPU cores by more than two times in total execution time. This result was achieved despite using nvfortran for CPU parts of STREAM3D-GPU, which was found noticeably slower than gfortran used for the CPU reference result. The introduced GPU-enabled MOC solver showed linear scalability for a variable number of employed GPUs and reached over 15 times faster execution with 16 GPU cards and over 24 times faster execution with 24 GPU cards compared to 256 CPU cores. From a cost-efficiency standpoint, STREAM3D-GPU demonstrated a significant advantage over its CPU counterpart, primarily due to the shorter MOC and overall runtimes achieved on a more densely packed GPU system. A single GPU node with 8 GPUs requires 3.9 times less capital investment than a CPU-based cluster that offers similar computational performance. When considering monthly expenses, such as power consumption, the GPU setup again showed a clear benefit – achieving over 4.4 times greater efficiency in the single and multi-node GPU configurations for the same calculation. On a per- device basis, each GPU card delivered computational performance comparable to at least 240 CPU cores, highlighting the high efficiency of the GPU-enabled MOC implementation presented in this thesis.

Publisher: Ulsan National Institute of Science and Technology

Degree: Doctor

Major: Department of Nuclear Engineering

Show Full Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1403 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.