Restrict Register Per Thread in CUDA #4574

YqWangcfd · 2025-07-18T08:42:09Z

YqWangcfd
Jul 18, 2025

How to restrict the maximum register usage per thread in amrex, when I are trying to profile the GPU program?
I have tried to add a line 'CUDA_MAXREGCOUNT := 128' in the GNUmakefile but it seemed not to work.

There are still output information like 'ptxas info : Used 240 registers, 592 bytes cmem[0]' during compilation.

Answered by WeiqunZhang

Jul 20, 2025

I thought maxrregcount is a hard ceiling for nvcc. But it is not. https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#maxrregcount-amount-maxrregcount

A value less than the minimum registers required by ABI will be bumped up by the compiler to ABI minimum limit.

I guess for some of the kernels it's impossible to run without bumping up the register counts. Then there is probably nothing you can do.

View full answer

WeiqunZhang · 2025-07-19T00:15:54Z

WeiqunZhang
Jul 19, 2025
Maintainer

It seems to work for me .

2 replies

WeiqunZhang Jul 19, 2025
Maintainer

Maybe there there was a typo. You can try to set it from command line. make USE_CUDA=TRUE CUDA_MAXREGCOUNT=128. You can also use make help to examine the flags.

$ make USE_CUDA=TRUE CUDA_MAXREGCOUNT=128 help | grep "maxrregcount=128"
    CXXFLAGS      = -ccbin=g++ -Xcompiler='-Werror=return-type -g1 -O3 -finline-limit=43210 -std=c++17  -pthread  -std=c++17' --std=c++17 -Wno-deprecated-gpu-targets -m64 -m
axrregcount=128 --expt-relaxed-constexpr --expt-extended-lambda --forward-unknown-to-host-compiler -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored -Xcudafe --diag
_suppress=implicit_return_from_non_void_function --Werror cross-execution-space-call -lineinfo --ptxas-options=-O3 --ptxas-options=-v --use_fast_math  --Werror ext-lambda-ca
ptures-this --display-error-number --diag-error 20092 --relocatable-device-code=true --generate-code arch=compute_60,code=sm_60 -x cu -c
    CFLAGS        = -ccbin=g++ -Xcompiler='-Werror=return-type -g1 -O3 -finline-limit=43210 -std=c++17  -pthread  -std=c++17' --std=c++17 -Wno-deprecated-gpu-targets -m64 -m
axrregcount=128 --expt-relaxed-constexpr --expt-extended-lambda --forward-unknown-to-host-compiler -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored -Xcudafe --diag
_suppress=implicit_return_from_non_void_function --Werror cross-execution-space-call -lineinfo --ptxas-options=-O3 --ptxas-options=-v --use_fast_math  --Werror ext-lambda-ca
ptures-this --display-error-number --diag-error 20092 --relocatable-device-code=true --generate-code arch=compute_60,code=sm_60 -x cu -c
    LINKFLAGS     = -Wno-deprecated-gpu-targets -m64 -maxrregcount=128 --expt-relaxed-constexpr --expt-extended-lambda --forward-unknown-to-host-compiler -Xcudafe --diag_sup
press=esa_on_defaulted_function_ignored -Xcudafe --diag_suppress=implicit_return_from_non_void_function --Werror cross-execution-space-call -lineinfo --ptxas-options=-O3 --p
txas-options=-v --use_fast_math  --Werror ext-lambda-captures-this --display-error-number --diag-error 20092 --relocatable-device-code=true --generate-code arch=compute_60,c
ode=sm_60 -ccbin=g++ -Xcompiler='-Werror=return-type -g1 -O3 -finline-limit=43210 -std=c++17  -pthread  -std=c++17' --std=c++17

YqWangcfd Jul 19, 2025
Author

Yes... By running make USE_CUDA=TRUE CUDA_MAXREGCOUNT=128 help | grep "maxrregcount=128", my outputs are just like yours.
However, there is still >128 per thread register in the ptxas info.

I also noticed that in AMReX there is a mechanism named as "launch__bounds"

amrex/Src/Base/AMReX_GpuLaunchGlobal.H

Line 24 in 7c99975

__launch_bounds__(amrex_launch_bounds_max_threads, min_blocks)

, where it restricts by default the minimum CUDA block per SM to one. Does this play an effective role in tolerating the register usage?

If it is, could please provide a Usage Example on how to use this mechanism to enforcedly lower down the per thread register?

WeiqunZhang · 2025-07-19T18:04:00Z

WeiqunZhang
Jul 19, 2025
Maintainer

I still think you had a typo.

There are still output information like 'ptxas info : Used 240 registers, 592 bytes cmem[0]' during compilation.

Could you provide more detail so that we can see exactly what flags nvcc gets? For example, I can see

nvcc ... -maxrregcount=128 ....
ptxas info    : Overriding maximum register limit 256 for '_ZN3cub17CUB_200802_SM_60011EmptyKernelIvEEvv' with  128 of maxrregcount option

4 replies

WeiqunZhang Jul 19, 2025
Maintainer

It seems that there are two possibilities. (1) nvcc never saw -maxrregcount=128 due to issues in makefile. (2) nvcc does not treat it as hard ceiling. I am not 100% sure. But I believe it was (1).

YqWangcfd Jul 20, 2025
Author

Thanks for your response.

When I run make USE_CUDA=TRUE CUDA_MAXREGCOUNT=128 help | grep "maxrregcount=128", I got

 make USE_CUDA=TRUE CUDA_MAXREGCOUNT=128 help | grep "maxrregcount=128"
    CXXFLAGS      = -ccbin=g++ -Xcompiler=' -Werror=return-type -g -O3 -finline-limit=43210 -std=c++17  -pthread  -std=c++17' --std=c++17 -Wno-deprecated-gpu-targets -m64 -maxrregcount=128 --expt-relaxed-constexpr --expt-extended-lambda --forward-unknown-to-host-compiler -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored -Xcudafe --diag_suppress=implicit_return_from_non_void_function --Werror cross-execution-space-call -lineinfo --ptxas-options=-O3 --ptxas-options=-v --use_fast_math  --Werror ext-lambda-captures-this --display-error-number --diag-error 20092 --generate-code arch=compute_80,code=sm_80 -x cu -dc -Xptxas --disable-optimizer-constants
    CFLAGS        = -ccbin=g++ -Xcompiler=' -Werror=return-type -g -O3 -finline-limit=43210 -std=c++17  -pthread  -std=c++17' --std=c++17 -Wno-deprecated-gpu-targets -m64 -maxrregcount=128 --expt-relaxed-constexpr --expt-extended-lambda --forward-unknown-to-host-compiler -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored -Xcudafe --diag_suppress=implicit_return_from_non_void_function --Werror cross-execution-space-call -lineinfo --ptxas-options=-O3 --ptxas-options=-v --use_fast_math  --Werror ext-lambda-captures-this --display-error-number --diag-error 20092 --generate-code arch=compute_80,code=sm_80 -x cu -dc
    LINKFLAGS     = -Wno-deprecated-gpu-targets -m64 -maxrregcount=128 --expt-relaxed-constexpr --expt-extended-lambda --forward-unknown-to-host-compiler -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored -Xcudafe --diag_suppress=implicit_return_from_non_void_function --Werror cross-execution-space-call -lineinfo --ptxas-options=-O3 --ptxas-options=-v --use_fast_math  --Werror ext-lambda-captures-this --display-error-number --diag-error 20092 --generate-code arch=compute_80,code=sm_80 -ccbin=g++ -Xcompiler=' -Werror=return-type -g -O3 -finline-limit=43210 -std=c++17  -pthread  -std=c++17' --std=c++17

By compiling using make USE_CUDA=TRUE CUDA_MAXREGCOUNT=128, I presume this flag is passed to nvcc

nvcc -MMD -MP -ccbin=g++ -Xcompiler=' -Werror=return-type -g -O3 -finline-limit=43210 -std=c++17  -pthread  -std=c++17' --std=c++17 -Wno-deprecated-gpu-targets -m64 -maxrregcount=128 ...

But I cannot see the register limit override in the ptxas info like yours. Mine is like

ptxas info    : Used 255 registers, 1096 bytes cmem[0]
ptxas info    : Compiling entry function '_ZN5amrex13launch_globalILi256EZNS_6launchILi256EZNS_9ReduceOpsIJNS_11ReduceOpSumEEE4evalIiNS_10ReduceDataIJiEEEZNS_6Reduce3SumIiiZN4pele7physics9reactions11ReactorRK645reactEPdSE_SE_SE_RdSF_iP11CUstream_stEUliE0_vEET_T0_RKT1_SJ_EUliE_vEEvT_RT0_RKT1_EUlvE_EEvimSH_RKSK_EUlvE_EEvSK_' for 'sm_80'
ptxas info    : Function properties for _ZN5amrex13launch_globalILi256EZNS_6launchILi256EZNS_9ReduceOpsIJNS_11ReduceOpSumEEE4evalIiNS_10ReduceDataIJiEEEZNS_6Reduce3SumIiiZN4pele7physics9reactions11ReactorRK645reactEPdSE_SE_SE_RdSF_iP11CUstream_stEUliE0_vEET_T0_RKT1_SJ_EUliE_vEEvT_RT0_RKT1_EUlvE_EEvimSH_RKSK_EUlvE_EEvSK_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

WeiqunZhang Jul 20, 2025
Maintainer

I presume this flag is passed to nvcc
nvcc -MMD -MP -ccbin=g++ -Xcompiler=' -Werror=return-type -g -O3 -finline-limit=43210 -std=c++17 -pthread -std=c++17' --std=c++17 -Wno-deprecated-gpu-targets -m64 -maxrregcount=128 ...

Is that what you actually saw or what you think what happened?

WeiqunZhang Jul 20, 2025
Maintainer

Looks like you are compiling PelePhysics. My guess is you are compiling from a code that recursively calls make to compile PelelPhysics and not all flags are forwarded. If that is the case, the simplest approach might be modify the default value in your amrex source code.

YqWangcfd · 2025-07-20T02:51:55Z

YqWangcfd
Jul 20, 2025
Author

I have tried another case within a workspace without PelePhysics.

In this case, I have hard-coded the per thread register limit in nvcc from
NVCC_FLAGS = -Wno-deprecated-gpu-targets -m64 -maxrregcount=$(CUDA_MAXREGCOUNT) --expt-relaxed-constexpr --expt-extended-lambda --forward-unknown-to-host-compiler
to
NVCC_FLAGS = -Wno-deprecated-gpu-targets -m64 -maxrregcount=32 --expt-relaxed-constexpr --expt-extended-lambda --forward-unknown-to-host-compiler.

However, the issue does not get improved at all. It reports as

ptxas info    : Used 38 registers, 424 bytes cmem[0]
ptxas info    : Compiling entry function '_ZN5amrex13launch_globalILi256EZNS_11ParallelForILi256EZNS_12experimental6detail11ParallelForILi128ENS_8FabArrayINS_9FArrayBoxEEEZNS_3AddIS6_vEEvRNS5_IT_EERKSA_iiiRKNS_9IntVectNDILi2EEEEUliiiiiE_EENSt9enable_ifIXsr5amrex10IsFabArrayIT0_vEE5valueEvE4typeERKSK_SH_iSH_bRKT1_EUliiiE_Li2EEENSJ_IXsr5amrex19MaybeDeviceRunnableISK_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoERKNS_5BoxNDIXT1_EEESO_EUlvE_EEvSK_' for 'sm_80'

4 replies

WeiqunZhang Jul 20, 2025
Maintainer

May I ask again?

Could you provide more detail so that we can see exactly what flags nvcc gets?

The most important missing information is the lines above ptaxas info where we can see exactly what flags are used by nvcc. It's possible that NVCC_FLAGS you set somehow was not used by nvcc.

YqWangcfd Jul 20, 2025
Author

Sorry for the delay.

The nvcc flags are as follows, if I specify 128 as the maximum per thread register.
nvcc -MMD -MP -ccbin=g++ -Xcompiler=' -Werror=return-type -g -O3 -finline-limit=43210 -std=c++17 -pthread -std=c++17' --std=c++17 -Wno-deprecated-gpu-targets -m64 -maxrregcount=128 --expt-relaxed-constexpr --expt-extended-lambda --forward-unknown-to-host-compiler -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored -Xcudafe --diag_suppress=implicit_return_from_non_void_function --Werror cross-execution-space-call -lineinfo --ptxas-options=-O3 --use_fast_math --Werror ext-lambda-captures-this --display-error-number --diag-error 20092 --generate-code arch=compute_80,code=sm_80 -x cu -dc -Xptxas --disable-optimizer-constants ...

It looks like I successfully passed '128' to the parameter ‘-maxrregcount’, but the ptxas info indicates in many kernels/device functions the register usage spills out of this value.
I wonder how you got this result

nvcc ... -maxrregcount=128 ....
ptxas info : Overriding maximum register limit 256 for '_ZN3cub17CUB_200802_SM_60011EmptyKernelIvEEvv' with 128 of maxrregcount option

. Maybe I could have it tested on my own PC and then debug my program step by step.

WeiqunZhang Jul 20, 2025
Maintainer

Could you do the following? make clean, make ...... >& make.ou, and then upload the make.ou file.

I was compiling a test in amrex/Tests/.

YqWangcfd Jul 20, 2025
Author

Sure. Please see the attached. I have used

make -j10 CUDA_MAXREGCOUNT=32 USE_CUDA=TRUE >& make.txt

in the command line.

make.txt

WeiqunZhang · 2025-07-20T07:51:24Z

WeiqunZhang
Jul 20, 2025
Maintainer

I thought maxrregcount is a hard ceiling for nvcc. But it is not. https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#maxrregcount-amount-maxrregcount

A value less than the minimum registers required by ABI will be bumped up by the compiler to ABI minimum limit.

I guess for some of the kernels it's impossible to run without bumping up the register counts. Then there is probably nothing you can do.

0 replies

YqWangcfd · 2025-08-20T15:05:06Z

YqWangcfd
Aug 20, 2025
Author

Hi, back here again for some help.

I have encountered significant register pressure when using the per-thread model on GPUs, resulting in register spilling and reduced concurrency, by using the performance portable ParallelFor function https://amrex-codes.github.io/amrex/docs_html/Basics.html#parallelfor:~:text=%23ifdef%20AMREX_USE_OMP%0A%23pragma%20omp%20parallel%20if,j%2Ck)%3B%0A%20%20%20%20%7D)%3B%0A%20%20%7D.

I wonder if AMReX has had some existing approaches for solving this kind of problems, e.g., a per-block model or even a per-warp model in Ref. (https://dl.acm.org/doi/10.1145/2555243.2555258).

Thanks in advance!

2 replies

WeiqunZhang Aug 20, 2025
Maintainer

There is no concept of warp or block for CPUs. So you will need to write GPU specific code, using CUDA directly or amrex::launch.

amrex/Src/Base/AMReX_GpuLaunchFunctsG.H

Line 734 in 7886010

template<typename L>

YqWangcfd Aug 21, 2025
Author

Thanks. I will have a try.

Restrict Register Per Thread in CUDA #4574

Uh oh!

YqWangcfd Jul 18, 2025

Replies: 5 comments · 12 replies

Uh oh!

WeiqunZhang Jul 19, 2025 Maintainer

Uh oh!

WeiqunZhang Jul 19, 2025 Maintainer

Uh oh!

Uh oh!

YqWangcfd Jul 19, 2025 Author

Uh oh!

WeiqunZhang Jul 19, 2025 Maintainer

Uh oh!

WeiqunZhang Jul 19, 2025 Maintainer

Uh oh!

YqWangcfd Jul 20, 2025 Author

Uh oh!

WeiqunZhang Jul 20, 2025 Maintainer

Uh oh!

WeiqunZhang Jul 20, 2025 Maintainer

Uh oh!

YqWangcfd Jul 20, 2025 Author

Uh oh!

Uh oh!

WeiqunZhang Jul 20, 2025 Maintainer

Uh oh!

YqWangcfd Jul 20, 2025 Author

Uh oh!

WeiqunZhang Jul 20, 2025 Maintainer

Uh oh!

YqWangcfd Jul 20, 2025 Author

Uh oh!

WeiqunZhang Jul 20, 2025 Maintainer

Uh oh!

YqWangcfd Aug 20, 2025 Author

Uh oh!

WeiqunZhang Aug 20, 2025 Maintainer

Uh oh!

YqWangcfd Aug 21, 2025 Author

YqWangcfd
Jul 18, 2025

Replies: 5 comments 12 replies

WeiqunZhang
Jul 19, 2025
Maintainer

WeiqunZhang Jul 19, 2025
Maintainer

YqWangcfd Jul 19, 2025
Author

WeiqunZhang
Jul 19, 2025
Maintainer

WeiqunZhang Jul 19, 2025
Maintainer

YqWangcfd Jul 20, 2025
Author

WeiqunZhang Jul 20, 2025
Maintainer

WeiqunZhang Jul 20, 2025
Maintainer

YqWangcfd
Jul 20, 2025
Author

WeiqunZhang Jul 20, 2025
Maintainer

YqWangcfd Jul 20, 2025
Author

WeiqunZhang Jul 20, 2025
Maintainer

YqWangcfd Jul 20, 2025
Author

WeiqunZhang
Jul 20, 2025
Maintainer

YqWangcfd
Aug 20, 2025
Author

WeiqunZhang Aug 20, 2025
Maintainer

YqWangcfd Aug 21, 2025
Author