Deterministic neutron transport codes are a vital modern technology for high-fidelity reactor simulations. Many versions of such codes currently employ a method of characteristics (MOC) for solving the neutron transport equation. With the rise of graphics processing unit (GPU) computing power, offloading the most time-consuming parts of codes to GPU became possible. In this study, we present our novel GPU-enabled code, STREAM3D-GPU, with an MOC module offloaded to the GPU. It became possible due to a newly introduced, axially decomposed GPU-enabled three-dimensional (3D) MOC/diamond difference (DD) solver. This solver allowed us to reduce the GPU memory burden from 50 GB per MPI process to below 4.5 GB, thus enabling modern consumer-grade GPUs for large reactor calculations. We confirmed it for a 3D OPR-1000 quarter-core model with thermal-hydraulic feedback. We found that the axially decomposed GPU-enabled 3D MOC/DD solver in STREAM3D-GPU was faster by 22.1 times when using 8 GPU cards and by 40.5 times when using 16 GPU cards compared to a 128-core reference solution. Therefore, we estimated that one GPU card was equal to 324-353 parallel CPU cores. CPU nodes of comparable performance would cost 2.77-3.54 times more than a GPU system and consume 5.4-6.5 times more energy.