Skip to content

benchmark_inference: Add CLI option to enable thunder CUDAGraph Transform#2697

Merged
kshitij12345 merged 9 commits intomainfrom
ksh/bench-inf-cudagraph
Nov 4, 2025
Merged

benchmark_inference: Add CLI option to enable thunder CUDAGraph Transform#2697
kshitij12345 merged 9 commits intomainfrom
ksh/bench-inf-cudagraph

Conversation

@kshitij12345
Copy link
Collaborator

@kshitij12345 kshitij12345 commented Oct 27, 2025

Command

python thunder/benchmarks/benchmark_inference.py --input-length 32 --output-length 3 --mode thunder --num-iterations 10 --enable-thunder-cudagraph

NOTE: Need to revert 13f7171

Running the above command leads to the following error (seems to fail during FusionDefinition execution)

Traceback (most recent call last):
  File "/opt/pytorch/lightning-thunder/thunder/transforms/cudagraph.py", line 130, in build_cuda_graph
    static_outputs = fn(*static_inputs)
                     ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 121, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/lightning-thunder/thunder/executors/torchex.py", line 169, in no_autocast_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "thunder.CUDAGraph5_39", line 116, in CUDAGraph5
  File "/opt/pytorch/lightning-thunder/thunder/executors/nvfuserex_impl.py", line 566, in __call__
    return fd.execute(
           ^^^^^^^^^^^
  File "/opt/pytorch/nvfuser/python/nvfuser_direct/__init__.py", line 318, in execute
    return self.fec.execute(
           ^^^^^^^^^^^^^^^^^
RuntimeError: 
Error from segmentation group 3: CUDA error: operation not permitted when stream is capturing
Search for `cudaErrorStreamCaptureUnsupported' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from memcpy_and_sync at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.h:106 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x88 (0x74e88646d008 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5cb0a (0x74e8bc9b2b0a in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1c8 (0x74e8bc9b27c8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)

@kshitij12345 kshitij12345 marked this pull request as ready for review October 27, 2025 13:54
Copy link
Collaborator

@mattteochen mattteochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kshitij12345

@wujingyue
Copy link
Collaborator

cc @mdavis36

@tbqh
Copy link

tbqh commented Oct 28, 2025

Seems to be broken with different parameters:
python thunder/benchmarks/benchmark_inference.py --input-length 4096 --output-length 4 --mode thunder --enable-nv-linear --warmup-iterations 2 --num-iterations 2 --enable-thunder-cudagraph

Causes:

  File "/opt/pytorch/nvfuser/lightning-thunder/thunder/benchmarks/benchmark_inference.py", line 737, in <module>
    main()
  File "/opt/pytorch/nvfuser/lightning-thunder/thunder/benchmarks/benchmark_inference.py", line 719, in main
    benchmark = InferenceBenchmark(config)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/nvfuser/lightning-thunder/thunder/benchmarks/benchmark_inference.py", line 279, in __init__
    self.model = self._compile_model(model)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/nvfuser/lightning-thunder/thunder/benchmarks/benchmark_inference.py", line 308, in _compile_model
    return thunderfx(model, **self._thunder_jit_options)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/nvfuser/lightning-thunder/thunder/benchmarks/benchmark_inference.py", line 298, in _thunder_jit_options
    res["transforms"].append(CUDAGraphTransform())
    ~~~^^^^^^^^^^^^^^
KeyError: 'transforms'

Edit:
Looks like it's just the --enable-nv-linear flag. Is this expected to not be compatible with --enable-thunder-cudagraph?

@kshitij12345
Copy link
Collaborator Author

@tbqh Thanks for reporting, have pushed a patch to fix the issue.

@mattteochen
Copy link
Collaborator

Just a FYI for anyone testing this, the transform + NVIDIA/Fuser#5434 (comment) are expected to work on blackwell

@kshitij12345 kshitij12345 enabled auto-merge (squash) October 30, 2025 11:10
@kshitij12345
Copy link
Collaborator Author

Ping @KaelanDt for review

Copy link
Collaborator

@KaelanDt KaelanDt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @kshitij12345

@tbqh
Copy link

tbqh commented Nov 4, 2025

Thanks for the --enable-nv-linear fix, this PR is working well.

Before:

Time Between Output Tokens (TBOT): 7.25 ms
Prefill Time: 15.68 ms
Decode Time: 7.25 ms

After:

Time Between Output Tokens (TBOT): 4.47 ms
Prefill Time: 15.27 ms
Decode Time: 4.47 ms

@kshitij12345 kshitij12345 merged commit 77261d1 into main Nov 4, 2025
51 checks passed
@kshitij12345 kshitij12345 deleted the ksh/bench-inf-cudagraph branch November 4, 2025 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants