Skip to content

[BugFix] Fix CUDA graph capture for Bounded spec projection#3453

Merged
vmoens merged 1 commit intomainfrom
fix/cudagraph-bounded-spec
Feb 6, 2026
Merged

[BugFix] Fix CUDA graph capture for Bounded spec projection#3453
vmoens merged 1 commit intomainfrom
fix/cudagraph-bounded-spec

Conversation

@vmoens
Copy link
Collaborator

@vmoens vmoens commented Feb 6, 2026

Summary

Fix CUDA graph capture compatibility for Bounded spec by caching device-specific bounds tensors.

Problem:

  • Bounded._project() and Bounded.is_in() called .to(device) on low/high bounds during every forward pass
  • This created DeviceCopy operations incompatible with CUDA graph capture
  • Resulted in "operation not permitted when stream is capturing" errors
  • Also caused graph partitioning warnings, reducing performance gains

Solution:

  • Add _get_space_bounds(device) helper that caches (low, high) tensors per device
  • Short-circuits when spec device matches input device (common case)
  • During warmup, cache is populated; during capture/replay, cached tensors are used directly

Changes

  • Added Bounded._get_space_bounds() method for lazy per-device bounds caching
  • Updated Bounded._project() to use cached bounds
  • Updated Bounded.is_in() to use cached bounds

Test plan

Verified on cluster with CUDA graph capture:

import torch
from torchrl.data.tensor_specs import Bounded
from tensordict.nn import CudaGraphModule

spec = Bounded(low=-1, high=1, shape=(4,), device="cuda")
action = torch.randn(4, device="cuda")

def project_fn(x):
    return spec.project(x)

# Previously failed, now works:
cgm = CudaGraphModule(project_fn)
result = cgm(action)  # Success!

Also tested with Dreamer collector policy compilation with cudagraphs=True:

Setting up CudaGraphModule with stream <torch.cuda.Stream device=cuda:1 ...> on device cuda:1.
Registering CUDA graph...
CUDA graph successfully registered.

Made with Cursor

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 6, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3453

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 6, 2026
@github-actions github-actions bot added the BugFix label Feb 6, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

$\color{#D29922}\textsf{\Large&amp;#x26A0;\kern{0.2cm}\normalsize Warning}$ Result of CPU Benchmark Tests

Total Benchmarks: 173. Improved: $\large\color{#35bf28}20$. Worsened: $\large\color{#d91a1a}11$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_tensor_to_bytestream_speed[pickle] 80.5809μs 78.9921μs 12.6595 KOps/s 11.6371 KOps/s $\textbf{\color{#35bf28}+8.79\%}$
test_tensor_to_bytestream_speed[torch.save] 0.1377ms 0.1370ms 7.2997 KOps/s 6.7917 KOps/s $\textbf{\color{#35bf28}+7.48\%}$
test_tensor_to_bytestream_speed[untyped_storage] 0.1079s 0.1076s 9.2924 Ops/s 9.4897 Ops/s $\color{#d91a1a}-2.08\%$
test_tensor_to_bytestream_speed[numpy] 2.6082μs 2.5970μs 385.0640 KOps/s 402.0182 KOps/s $\color{#d91a1a}-4.22\%$
test_tensor_to_bytestream_speed[safetensors] 37.3257μs 37.1357μs 26.9283 KOps/s 25.0581 KOps/s $\textbf{\color{#35bf28}+7.46\%}$
test_simple 0.5470s 0.5455s 1.8333 Ops/s 1.7477 Ops/s $\color{#35bf28}+4.90\%$
test_transformed 1.1264s 1.1256s 0.8884 Ops/s 0.8631 Ops/s $\color{#35bf28}+2.93\%$
test_serial 1.7303s 1.7279s 0.5787 Ops/s 0.5726 Ops/s $\color{#35bf28}+1.07\%$
test_parallel 1.2177s 1.1348s 0.8812 Ops/s 0.8920 Ops/s $\color{#d91a1a}-1.21\%$
test_step_mdp_speed[True-True-True-True-True] 0.3638ms 44.9211μs 22.2612 KOps/s 21.8800 KOps/s $\color{#35bf28}+1.74\%$
test_step_mdp_speed[True-True-True-True-False] 71.3710μs 25.2363μs 39.6254 KOps/s 38.9279 KOps/s $\color{#35bf28}+1.79\%$
test_step_mdp_speed[True-True-True-False-True] 57.6810μs 25.3270μs 39.4835 KOps/s 38.2800 KOps/s $\color{#35bf28}+3.14\%$
test_step_mdp_speed[True-True-True-False-False] 45.0600μs 14.0030μs 71.4134 KOps/s 69.6196 KOps/s $\color{#35bf28}+2.58\%$
test_step_mdp_speed[True-True-False-True-True] 81.9510μs 48.1505μs 20.7682 KOps/s 20.0827 KOps/s $\color{#35bf28}+3.41\%$
test_step_mdp_speed[True-True-False-True-False] 51.0710μs 27.9832μs 35.7358 KOps/s 34.8855 KOps/s $\color{#35bf28}+2.44\%$
test_step_mdp_speed[True-True-False-False-True] 59.5310μs 27.7796μs 35.9976 KOps/s 34.4416 KOps/s $\color{#35bf28}+4.52\%$
test_step_mdp_speed[True-True-False-False-False] 42.3010μs 16.8897μs 59.2076 KOps/s 58.5434 KOps/s $\color{#35bf28}+1.13\%$
test_step_mdp_speed[True-False-True-True-True] 86.5910μs 51.1771μs 19.5400 KOps/s 19.3611 KOps/s $\color{#35bf28}+0.92\%$
test_step_mdp_speed[True-False-True-True-False] 60.8110μs 31.2678μs 31.9817 KOps/s 31.5930 KOps/s $\color{#35bf28}+1.23\%$
test_step_mdp_speed[True-False-True-False-True] 61.5910μs 27.7601μs 36.0229 KOps/s 34.7790 KOps/s $\color{#35bf28}+3.58\%$
test_step_mdp_speed[True-False-True-False-False] 63.2010μs 16.8616μs 59.3065 KOps/s 59.1894 KOps/s $\color{#35bf28}+0.20\%$
test_step_mdp_speed[True-False-False-True-True] 85.2010μs 53.0959μs 18.8338 KOps/s 18.4784 KOps/s $\color{#35bf28}+1.92\%$
test_step_mdp_speed[True-False-False-True-False] 80.4300μs 33.6475μs 29.7199 KOps/s 29.4746 KOps/s $\color{#35bf28}+0.83\%$
test_step_mdp_speed[True-False-False-False-True] 61.5410μs 30.1534μs 33.1637 KOps/s 31.8420 KOps/s $\color{#35bf28}+4.15\%$
test_step_mdp_speed[True-False-False-False-False] 51.8210μs 19.4856μs 51.3200 KOps/s 50.7352 KOps/s $\color{#35bf28}+1.15\%$
test_step_mdp_speed[False-True-True-True-True] 82.5310μs 51.6291μs 19.3689 KOps/s 19.4976 KOps/s $\color{#d91a1a}-0.66\%$
test_step_mdp_speed[False-True-True-True-False] 58.5610μs 30.8902μs 32.3727 KOps/s 31.9485 KOps/s $\color{#35bf28}+1.33\%$
test_step_mdp_speed[False-True-True-False-True] 2.2935ms 31.7173μs 31.5285 KOps/s 30.2624 KOps/s $\color{#35bf28}+4.18\%$
test_step_mdp_speed[False-True-True-False-False] 45.1000μs 18.2347μs 54.8404 KOps/s 53.5381 KOps/s $\color{#35bf28}+2.43\%$
test_step_mdp_speed[False-True-False-True-True] 0.1263ms 53.6207μs 18.6495 KOps/s 18.9634 KOps/s $\color{#d91a1a}-1.66\%$
test_step_mdp_speed[False-True-False-True-False] 58.2400μs 32.9109μs 30.3851 KOps/s 29.2680 KOps/s $\color{#35bf28}+3.82\%$
test_step_mdp_speed[False-True-False-False-True] 73.8610μs 33.6910μs 29.6815 KOps/s 28.1893 KOps/s $\textbf{\color{#35bf28}+5.29\%}$
test_step_mdp_speed[False-True-False-False-False] 92.4410μs 20.3448μs 49.1525 KOps/s 46.6230 KOps/s $\textbf{\color{#35bf28}+5.43\%}$
test_step_mdp_speed[False-False-True-True-True] 92.4610μs 55.5075μs 18.0156 KOps/s 17.5472 KOps/s $\color{#35bf28}+2.67\%$
test_step_mdp_speed[False-False-True-True-False] 70.5800μs 36.0511μs 27.7384 KOps/s 26.7778 KOps/s $\color{#35bf28}+3.59\%$
test_step_mdp_speed[False-False-True-False-True] 65.3910μs 33.6804μs 29.6908 KOps/s 28.4676 KOps/s $\color{#35bf28}+4.30\%$
test_step_mdp_speed[False-False-True-False-False] 49.7400μs 20.6319μs 48.4687 KOps/s 46.7144 KOps/s $\color{#35bf28}+3.76\%$
test_step_mdp_speed[False-False-False-True-True] 0.1076ms 58.2924μs 17.1549 KOps/s 16.5785 KOps/s $\color{#35bf28}+3.48\%$
test_step_mdp_speed[False-False-False-True-False] 69.1510μs 37.9169μs 26.3735 KOps/s 25.3213 KOps/s $\color{#35bf28}+4.16\%$
test_step_mdp_speed[False-False-False-False-True] 68.3410μs 35.7408μs 27.9793 KOps/s 26.5293 KOps/s $\textbf{\color{#35bf28}+5.47\%}$
test_step_mdp_speed[False-False-False-False-False] 62.3500μs 22.9963μs 43.4853 KOps/s 41.6360 KOps/s $\color{#35bf28}+4.44\%$
test_non_tensor_env_rollout_speed[1000-single-True] 0.8653s 0.7717s 1.2959 Ops/s 1.2920 Ops/s $\color{#35bf28}+0.30\%$
test_non_tensor_env_rollout_speed[1000-single-False] 0.7297s 0.6331s 1.5796 Ops/s 1.5701 Ops/s $\color{#35bf28}+0.60\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-True] 1.7687s 1.6824s 0.5944 Ops/s 0.5923 Ops/s $\color{#35bf28}+0.35\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-False] 1.5337s 1.4535s 0.6880 Ops/s 0.6807 Ops/s $\color{#35bf28}+1.08\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-True] 1.9948s 1.9180s 0.5214 Ops/s 0.5139 Ops/s $\color{#35bf28}+1.45\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-False] 1.7765s 1.7001s 0.5882 Ops/s 0.5811 Ops/s $\color{#35bf28}+1.23\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-True] 4.7832s 4.6828s 0.2135 Ops/s 0.2169 Ops/s $\color{#d91a1a}-1.54\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-False] 4.5785s 4.4187s 0.2263 Ops/s 0.2259 Ops/s $\color{#35bf28}+0.16\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-True] 2.0394s 1.9920s 0.5020 Ops/s 0.5069 Ops/s $\color{#d91a1a}-0.96\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-False] 1.7496s 1.6717s 0.5982 Ops/s 0.5856 Ops/s $\color{#35bf28}+2.15\%$
test_values[generalized_advantage_estimate-True-True] 10.0069ms 9.7818ms 102.2309 Ops/s 99.8056 Ops/s $\color{#35bf28}+2.43\%$
test_values[vec_generalized_advantage_estimate-True-True] 17.4525ms 11.5988ms 86.2158 Ops/s 90.6036 Ops/s $\color{#d91a1a}-4.84\%$
test_values[td0_return_estimate-False-False] 0.2027ms 0.1188ms 8.4164 KOps/s 7.7133 KOps/s $\textbf{\color{#35bf28}+9.11\%}$
test_values[td1_return_estimate-False-False] 29.1659ms 27.5651ms 36.2777 Ops/s 35.9825 Ops/s $\color{#35bf28}+0.82\%$
test_values[vec_td1_return_estimate-False-False] 11.3724ms 11.0814ms 90.2413 Ops/s 90.6618 Ops/s $\color{#d91a1a}-0.46\%$
test_values[td_lambda_return_estimate-True-False] 42.5650ms 40.4529ms 24.7201 Ops/s 24.1942 Ops/s $\color{#35bf28}+2.17\%$
test_values[vec_td_lambda_return_estimate-True-False] 17.4115ms 11.2180ms 89.1422 Ops/s 90.9569 Ops/s $\color{#d91a1a}-2.00\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 8.7156ms 8.5973ms 116.3162 Ops/s 113.4057 Ops/s $\color{#35bf28}+2.57\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 1.7357ms 1.5159ms 659.6680 Ops/s 659.6805 Ops/s $-0.00\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.4768ms 0.4212ms 2.3743 KOps/s 2.4072 KOps/s $\color{#d91a1a}-1.37\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 29.8366ms 29.1868ms 34.2621 Ops/s 38.8551 Ops/s $\textbf{\color{#d91a1a}-11.82\%}$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 1.7832ms 1.6897ms 591.8383 Ops/s 585.7160 Ops/s $\color{#35bf28}+1.05\%$
test_dqn_speed[False-None] 1.5261ms 1.4033ms 712.6258 Ops/s 700.1961 Ops/s $\color{#35bf28}+1.78\%$
test_dqn_speed[False-backward] 2.0243ms 1.9063ms 524.5833 Ops/s 502.4356 Ops/s $\color{#35bf28}+4.41\%$
test_dqn_speed[True-None] 0.6522ms 0.5460ms 1.8315 KOps/s 1.7760 KOps/s $\color{#35bf28}+3.13\%$
test_dqn_speed[True-backward] 1.0309ms 0.9882ms 1.0119 KOps/s 981.7778 Ops/s $\color{#35bf28}+3.07\%$
test_dqn_speed[reduce-overhead-None] 0.6811ms 0.5347ms 1.8702 KOps/s 1.8259 KOps/s $\color{#35bf28}+2.43\%$
test_ddpg_speed[False-None] 3.2028ms 2.8192ms 354.7090 Ops/s 345.8403 Ops/s $\color{#35bf28}+2.56\%$
test_ddpg_speed[False-backward] 4.1077ms 4.0413ms 247.4448 Ops/s 241.5704 Ops/s $\color{#35bf28}+2.43\%$
test_ddpg_speed[True-None] 1.8113ms 1.3945ms 717.0798 Ops/s 705.3566 Ops/s $\color{#35bf28}+1.66\%$
test_ddpg_speed[True-backward] 2.7729ms 2.3655ms 422.7481 Ops/s 406.0860 Ops/s $\color{#35bf28}+4.10\%$
test_ddpg_speed[reduce-overhead-None] 1.8270ms 1.3850ms 722.0054 Ops/s 703.1804 Ops/s $\color{#35bf28}+2.68\%$
test_sac_speed[False-None] 8.5666ms 8.0122ms 124.8090 Ops/s 124.3772 Ops/s $\color{#35bf28}+0.35\%$
test_sac_speed[False-backward] 11.7921ms 11.2914ms 88.5631 Ops/s 87.9772 Ops/s $\color{#35bf28}+0.67\%$
test_sac_speed[True-None] 2.4605ms 2.1354ms 468.3048 Ops/s 453.9554 Ops/s $\color{#35bf28}+3.16\%$
test_sac_speed[True-backward] 4.2184ms 4.0200ms 248.7568 Ops/s 242.9608 Ops/s $\color{#35bf28}+2.39\%$
test_sac_speed[reduce-overhead-None] 2.5454ms 2.1224ms 471.1677 Ops/s 459.9326 Ops/s $\color{#35bf28}+2.44\%$
test_redq_speed[False-None] 11.3200ms 10.3051ms 97.0393 Ops/s 95.8953 Ops/s $\color{#35bf28}+1.19\%$
test_redq_speed[False-backward] 18.8371ms 17.7598ms 56.3070 Ops/s 55.8332 Ops/s $\color{#35bf28}+0.85\%$
test_redq_speed[True-None] 5.9899ms 4.5258ms 220.9543 Ops/s 223.4449 Ops/s $\color{#d91a1a}-1.11\%$
test_redq_speed[True-backward] 9.6941ms 9.4166ms 106.1958 Ops/s 101.2607 Ops/s $\color{#35bf28}+4.87\%$
test_redq_speed[reduce-overhead-None] 4.7732ms 4.3286ms 231.0217 Ops/s 223.4898 Ops/s $\color{#35bf28}+3.37\%$
test_redq_deprec_speed[False-None] 11.0585ms 10.6407ms 93.9787 Ops/s 91.4058 Ops/s $\color{#35bf28}+2.81\%$
test_redq_deprec_speed[False-backward] 15.6256ms 15.2090ms 65.7507 Ops/s 64.4627 Ops/s $\color{#35bf28}+2.00\%$
test_redq_deprec_speed[True-None] 3.9245ms 3.5460ms 282.0086 Ops/s 263.0305 Ops/s $\textbf{\color{#35bf28}+7.22\%}$
test_redq_deprec_speed[True-backward] 7.5444ms 7.2500ms 137.9305 Ops/s 135.7290 Ops/s $\color{#35bf28}+1.62\%$
test_redq_deprec_speed[reduce-overhead-None] 3.8558ms 3.5211ms 283.9993 Ops/s 268.2141 Ops/s $\textbf{\color{#35bf28}+5.89\%}$
test_td3_speed[False-None] 8.1069ms 7.9362ms 126.0046 Ops/s 123.5921 Ops/s $\color{#35bf28}+1.95\%$
test_td3_speed[False-backward] 11.2379ms 10.7550ms 92.9803 Ops/s 91.7588 Ops/s $\color{#35bf28}+1.33\%$
test_td3_speed[True-None] 1.8810ms 1.8167ms 550.4364 Ops/s 541.3208 Ops/s $\color{#35bf28}+1.68\%$
test_td3_speed[True-backward] 3.6713ms 3.5404ms 282.4523 Ops/s 239.6334 Ops/s $\textbf{\color{#35bf28}+17.87\%}$
test_td3_speed[reduce-overhead-None] 1.7863ms 1.7605ms 568.0276 Ops/s 538.3055 Ops/s $\textbf{\color{#35bf28}+5.52\%}$
test_cql_speed[False-None] 28.3830ms 25.7850ms 38.7823 Ops/s 38.8273 Ops/s $\color{#d91a1a}-0.12\%$
test_cql_speed[False-backward] 38.1860ms 34.8535ms 28.6916 Ops/s 27.4262 Ops/s $\color{#35bf28}+4.61\%$
test_cql_speed[True-None] 12.3737ms 12.0845ms 82.7507 Ops/s 81.5096 Ops/s $\color{#35bf28}+1.52\%$
test_cql_speed[True-backward] 17.9461ms 17.3390ms 57.6734 Ops/s 56.5462 Ops/s $\color{#35bf28}+1.99\%$
test_cql_speed[reduce-overhead-None] 12.4868ms 12.0715ms 82.8398 Ops/s 65.7883 Ops/s $\textbf{\color{#35bf28}+25.92\%}$
test_a2c_speed[False-None] 5.6003ms 5.3459ms 187.0577 Ops/s 181.0824 Ops/s $\color{#35bf28}+3.30\%$
test_a2c_speed[False-backward] 11.9137ms 11.6834ms 85.5913 Ops/s 84.2902 Ops/s $\color{#35bf28}+1.54\%$
test_a2c_speed[True-None] 3.8251ms 3.7070ms 269.7605 Ops/s 268.4232 Ops/s $\color{#35bf28}+0.50\%$
test_a2c_speed[True-backward] 8.6862ms 8.4762ms 117.9780 Ops/s 115.8618 Ops/s $\color{#35bf28}+1.83\%$
test_a2c_speed[reduce-overhead-None] 3.8454ms 3.6824ms 271.5625 Ops/s 269.6632 Ops/s $\color{#35bf28}+0.70\%$
test_ppo_speed[False-None] 6.0020ms 5.8032ms 172.3197 Ops/s 166.5460 Ops/s $\color{#35bf28}+3.47\%$
test_ppo_speed[False-backward] 12.5083ms 12.2108ms 81.8948 Ops/s 80.4794 Ops/s $\color{#35bf28}+1.76\%$
test_ppo_speed[True-None] 3.7870ms 3.6190ms 276.3198 Ops/s 271.3793 Ops/s $\color{#35bf28}+1.82\%$
test_ppo_speed[True-backward] 8.5983ms 8.3234ms 120.1434 Ops/s 113.6067 Ops/s $\textbf{\color{#35bf28}+5.75\%}$
test_ppo_speed[reduce-overhead-None] 3.7278ms 3.5815ms 279.2121 Ops/s 275.0553 Ops/s $\color{#35bf28}+1.51\%$
test_reinforce_speed[False-None] 4.8203ms 4.4613ms 224.1520 Ops/s 217.2272 Ops/s $\color{#35bf28}+3.19\%$
test_reinforce_speed[False-backward] 7.3808ms 7.1732ms 139.4085 Ops/s 135.3076 Ops/s $\color{#35bf28}+3.03\%$
test_reinforce_speed[True-None] 2.9905ms 2.8710ms 348.3145 Ops/s 334.9090 Ops/s $\color{#35bf28}+4.00\%$
test_reinforce_speed[True-backward] 7.7496ms 7.5476ms 132.4930 Ops/s 129.0363 Ops/s $\color{#35bf28}+2.68\%$
test_reinforce_speed[reduce-overhead-None] 3.0804ms 2.8378ms 352.3818 Ops/s 346.7866 Ops/s $\color{#35bf28}+1.61\%$
test_iql_speed[False-None] 24.3619ms 19.8706ms 50.3257 Ops/s 48.5029 Ops/s $\color{#35bf28}+3.76\%$
test_iql_speed[False-backward] 35.0545ms 29.9515ms 33.3873 Ops/s 32.6657 Ops/s $\color{#35bf28}+2.21\%$
test_iql_speed[True-None] 8.6605ms 8.3890ms 119.2038 Ops/s 114.6129 Ops/s $\color{#35bf28}+4.01\%$
test_iql_speed[True-backward] 16.6562ms 16.2960ms 61.3648 Ops/s 60.2092 Ops/s $\color{#35bf28}+1.92\%$
test_iql_speed[reduce-overhead-None] 8.6370ms 8.4542ms 118.2838 Ops/s 116.3588 Ops/s $\color{#35bf28}+1.65\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 6.2812ms 6.0572ms 165.0936 Ops/s 160.4248 Ops/s $\color{#35bf28}+2.91\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 1.9934ms 0.3730ms 2.6810 KOps/s 3.3451 KOps/s $\textbf{\color{#d91a1a}-19.85\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.5618ms 0.3530ms 2.8327 KOps/s 3.6748 KOps/s $\textbf{\color{#d91a1a}-22.91\%}$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 6.3041ms 5.8275ms 171.6008 Ops/s 165.9663 Ops/s $\color{#35bf28}+3.39\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.7252ms 0.3670ms 2.7245 KOps/s 3.3849 KOps/s $\textbf{\color{#d91a1a}-19.51\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.5524ms 0.3380ms 2.9590 KOps/s 3.3680 KOps/s $\textbf{\color{#d91a1a}-12.14\%}$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 1.8245ms 1.4729ms 678.9137 Ops/s 768.0558 Ops/s $\textbf{\color{#d91a1a}-11.61\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 1.6308ms 1.3467ms 742.5386 Ops/s 842.1466 Ops/s $\textbf{\color{#d91a1a}-11.83\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 9.3469ms 6.0829ms 164.3963 Ops/s 161.1935 Ops/s $\color{#35bf28}+1.99\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 2.1588ms 0.4781ms 2.0918 KOps/s 2.1941 KOps/s $\color{#d91a1a}-4.66\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.7001ms 0.4464ms 2.2401 KOps/s 2.2873 KOps/s $\color{#d91a1a}-2.06\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 6.0156ms 5.8367ms 171.3309 Ops/s 165.6921 Ops/s $\color{#35bf28}+3.40\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.9161ms 0.3831ms 2.6101 KOps/s 3.4619 KOps/s $\textbf{\color{#d91a1a}-24.60\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.5846ms 0.3698ms 2.7044 KOps/s 3.4652 KOps/s $\textbf{\color{#d91a1a}-21.96\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 6.0238ms 5.7777ms 173.0795 Ops/s 165.9388 Ops/s $\color{#35bf28}+4.30\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.6974ms 0.3223ms 3.1025 KOps/s 3.0490 KOps/s $\color{#35bf28}+1.75\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.4782ms 0.2595ms 3.8534 KOps/s 3.4277 KOps/s $\textbf{\color{#35bf28}+12.42\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 6.3832ms 5.9526ms 167.9947 Ops/s 161.5447 Ops/s $\color{#35bf28}+3.99\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 2.4433ms 0.4908ms 2.0377 KOps/s 2.0570 KOps/s $\color{#d91a1a}-0.94\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.7373ms 0.4701ms 2.1271 KOps/s 2.2204 KOps/s $\color{#d91a1a}-4.20\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 6.4872ms 5.0233ms 199.0738 Ops/s 51.6161 Ops/s $\textbf{\color{#35bf28}+285.68\%}$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 8.2882ms 1.8763ms 532.9567 Ops/s 506.7679 Ops/s $\textbf{\color{#35bf28}+5.17\%}$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 10.4253ms 1.2399ms 806.4983 Ops/s 836.3193 Ops/s $\color{#d91a1a}-3.57\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 7.4608ms 5.0504ms 198.0055 Ops/s 198.0838 Ops/s $\color{#d91a1a}-0.04\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 3.8857ms 1.7349ms 576.3884 Ops/s 483.9157 Ops/s $\textbf{\color{#35bf28}+19.11\%}$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 1.0572ms 0.8666ms 1.1540 KOps/s 897.0640 Ops/s $\textbf{\color{#35bf28}+28.64\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 0.5488s 16.2108ms 61.6874 Ops/s 191.0213 Ops/s $\textbf{\color{#d91a1a}-67.71\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 4.1800ms 1.8853ms 530.4325 Ops/s 507.6167 Ops/s $\color{#35bf28}+4.49\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 2.3652ms 1.0572ms 945.9093 Ops/s 989.1316 Ops/s $\color{#d91a1a}-4.37\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] 38.2750ms 35.8165ms 27.9201 Ops/s 28.0391 Ops/s $\color{#d91a1a}-0.42\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] 19.8035ms 17.8224ms 56.1090 Ops/s 55.1490 Ops/s $\color{#35bf28}+1.74\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-True] 40.3777ms 36.8985ms 27.1013 Ops/s 26.5672 Ops/s $\color{#35bf28}+2.01\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-False] 20.2045ms 18.6084ms 53.7391 Ops/s 53.7863 Ops/s $\color{#d91a1a}-0.09\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-True] 41.7219ms 39.5666ms 25.2738 Ops/s 25.2211 Ops/s $\color{#35bf28}+0.21\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-False] 21.9014ms 20.4003ms 49.0189 Ops/s 50.4907 Ops/s $\color{#d91a1a}-2.92\%$
test_storage_write_lazystack[50-img_shape0-small] 0.8987ms 0.2259ms 4.4267 KOps/s 4.3286 KOps/s $\color{#35bf28}+2.27\%$
test_storage_write_lazystack[100-img_shape1-atari] 1.5550ms 1.3839ms 722.5846 Ops/s 726.1910 Ops/s $\color{#d91a1a}-0.50\%$
test_storage_write_lazystack[100-img_shape2-large_img] 2.5208ms 2.3660ms 422.6620 Ops/s 420.4097 Ops/s $\color{#35bf28}+0.54\%$
test_storage_write_lazystack[200-img_shape3-large_batch] 3.0401ms 2.8521ms 350.6217 Ops/s 347.8934 Ops/s $\color{#35bf28}+0.78\%$
test_storage_write_contiguous[50-img_shape0-small] 0.2232ms 0.1315ms 7.6036 KOps/s 7.4847 KOps/s $\color{#35bf28}+1.59\%$
test_storage_write_contiguous[100-img_shape1-atari] 0.3445ms 0.1788ms 5.5940 KOps/s 5.4920 KOps/s $\color{#35bf28}+1.86\%$
test_storage_write_contiguous[100-img_shape2-large_img] 1.8992ms 1.7356ms 576.1811 Ops/s 582.3595 Ops/s $\color{#d91a1a}-1.06\%$
test_storage_write_contiguous[200-img_shape3-large_batch] 1.4572ms 1.2826ms 779.6718 Ops/s 791.5148 Ops/s $\color{#d91a1a}-1.50\%$
test_collector_stack_then_write[50-img_shape0-small] 1.3026ms 1.1129ms 898.5173 Ops/s 896.9222 Ops/s $\color{#35bf28}+0.18\%$
test_collector_stack_then_write[100-img_shape1-atari] 3.6531ms 3.5304ms 283.2562 Ops/s 273.0237 Ops/s $\color{#35bf28}+3.75\%$
test_collector_stack_then_write[100-img_shape2-large_img] 5.9710ms 5.4534ms 183.3719 Ops/s 173.7980 Ops/s $\textbf{\color{#35bf28}+5.51\%}$
test_collector_stack_then_write[200-img_shape3-large_batch] 7.2855ms 6.9891ms 143.0796 Ops/s 146.2636 Ops/s $\color{#d91a1a}-2.18\%$
test_collector_lazystack_then_write[50-img_shape0-small] 0.4439ms 0.2738ms 3.6521 KOps/s 3.4781 KOps/s $\textbf{\color{#35bf28}+5.00\%}$
test_collector_lazystack_then_write[100-img_shape1-atari] 1.6742ms 1.5001ms 666.6434 Ops/s 656.3810 Ops/s $\color{#35bf28}+1.56\%$
test_collector_lazystack_then_write[100-img_shape2-large_img] 2.7745ms 2.4858ms 402.2885 Ops/s 398.0225 Ops/s $\color{#35bf28}+1.07\%$
test_collector_lazystack_then_write[200-img_shape3-large_batch] 3.3991ms 3.0791ms 324.7697 Ops/s 315.8878 Ops/s $\color{#35bf28}+2.81\%$
test_collector_without_rb[100-img_shape0-atari] 34.3655ms 33.5346ms 29.8199 Ops/s 29.5949 Ops/s $\color{#35bf28}+0.76\%$
test_collector_without_rb[200-img_shape1-large_batch] 66.4780ms 66.1934ms 15.1072 Ops/s 14.9139 Ops/s $\color{#35bf28}+1.30\%$
test_collector_with_rb[100-img_shape0-atari] 0.5823s 58.9650ms 16.9592 Ops/s 25.7437 Ops/s $\textbf{\color{#d91a1a}-34.12\%}$
test_collector_with_rb[200-img_shape1-large_batch] 76.5379ms 75.4594ms 13.2522 Ops/s 13.1631 Ops/s $\color{#35bf28}+0.68\%$

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

$\color{#D29922}\textsf{\Large&amp;#x26A0;\kern{0.2cm}\normalsize Warning}$ Result of GPU Benchmark Tests

Total Benchmarks: 172. Improved: $\large\color{#35bf28}24$. Worsened: $\large\color{#d91a1a}8$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_tensor_to_bytestream_speed[pickle] 83.8439μs 81.1762μs 12.3189 KOps/s 11.5663 KOps/s $\textbf{\color{#35bf28}+6.51\%}$
test_tensor_to_bytestream_speed[torch.save] 0.1423ms 0.1415ms 7.0696 KOps/s 6.7212 KOps/s $\textbf{\color{#35bf28}+5.18\%}$
test_tensor_to_bytestream_speed[untyped_storage] 0.1110s 0.1103s 9.0702 Ops/s 8.7079 Ops/s $\color{#35bf28}+4.16\%$
test_tensor_to_bytestream_speed[numpy] 2.6482μs 2.6439μs 378.2309 KOps/s 365.4122 KOps/s $\color{#35bf28}+3.51\%$
test_tensor_to_bytestream_speed[safetensors] 37.3413μs 37.0379μs 26.9994 KOps/s 25.3141 KOps/s $\textbf{\color{#35bf28}+6.66\%}$
test_simple 0.8226s 0.8130s 1.2300 Ops/s 1.1828 Ops/s $\color{#35bf28}+4.00\%$
test_transformed 1.5661s 1.4751s 0.6779 Ops/s 0.6741 Ops/s $\color{#35bf28}+0.57\%$
test_serial 2.4597s 2.3694s 0.4221 Ops/s 0.4143 Ops/s $\color{#35bf28}+1.87\%$
test_parallel 2.0827s 1.9680s 0.5081 Ops/s 0.4883 Ops/s $\color{#35bf28}+4.07\%$
test_step_mdp_speed[True-True-True-True-True] 0.1992ms 46.5797μs 21.4686 KOps/s 21.8135 KOps/s $\color{#d91a1a}-1.58\%$
test_step_mdp_speed[True-True-True-True-False] 61.9810μs 26.5108μs 37.7205 KOps/s 39.3866 KOps/s $\color{#d91a1a}-4.23\%$
test_step_mdp_speed[True-True-True-False-True] 0.1281ms 25.3309μs 39.4775 KOps/s 39.3049 KOps/s $\color{#35bf28}+0.44\%$
test_step_mdp_speed[True-True-True-False-False] 54.3710μs 14.3841μs 69.5210 KOps/s 71.4162 KOps/s $\color{#d91a1a}-2.65\%$
test_step_mdp_speed[True-True-False-True-True] 0.1098ms 50.1952μs 19.9222 KOps/s 20.4486 KOps/s $\color{#d91a1a}-2.57\%$
test_step_mdp_speed[True-True-False-True-False] 54.0300μs 28.9053μs 34.5957 KOps/s 35.2578 KOps/s $\color{#d91a1a}-1.88\%$
test_step_mdp_speed[True-True-False-False-True] 81.3710μs 28.4015μs 35.2094 KOps/s 34.9884 KOps/s $\color{#35bf28}+0.63\%$
test_step_mdp_speed[True-True-False-False-False] 47.1310μs 17.4332μs 57.3618 KOps/s 58.7148 KOps/s $\color{#d91a1a}-2.30\%$
test_step_mdp_speed[True-False-True-True-True] 81.8310μs 52.9587μs 18.8826 KOps/s 19.0766 KOps/s $\color{#d91a1a}-1.02\%$
test_step_mdp_speed[True-False-True-True-False] 63.3210μs 32.2318μs 31.0252 KOps/s 31.6307 KOps/s $\color{#d91a1a}-1.91\%$
test_step_mdp_speed[True-False-True-False-True] 58.8100μs 28.1733μs 35.4946 KOps/s 35.2835 KOps/s $\color{#35bf28}+0.60\%$
test_step_mdp_speed[True-False-True-False-False] 71.8410μs 17.2251μs 58.0548 KOps/s 57.9122 KOps/s $\color{#35bf28}+0.25\%$
test_step_mdp_speed[True-False-False-True-True] 96.8820μs 54.5887μs 18.3188 KOps/s 18.1864 KOps/s $\color{#35bf28}+0.73\%$
test_step_mdp_speed[True-False-False-True-False] 80.0610μs 35.3174μs 28.3146 KOps/s 29.1347 KOps/s $\color{#d91a1a}-2.81\%$
test_step_mdp_speed[True-False-False-False-True] 81.6910μs 31.5103μs 31.7356 KOps/s 32.1958 KOps/s $\color{#d91a1a}-1.43\%$
test_step_mdp_speed[True-False-False-False-False] 71.0010μs 20.6166μs 48.5045 KOps/s 50.8459 KOps/s $\color{#d91a1a}-4.60\%$
test_step_mdp_speed[False-True-True-True-True] 81.5210μs 53.4163μs 18.7209 KOps/s 19.1310 KOps/s $\color{#d91a1a}-2.14\%$
test_step_mdp_speed[False-True-True-True-False] 62.9710μs 32.8047μs 30.4835 KOps/s 31.9907 KOps/s $\color{#d91a1a}-4.71\%$
test_step_mdp_speed[False-True-True-False-True] 2.2786ms 32.6194μs 30.6566 KOps/s 30.0929 KOps/s $\color{#35bf28}+1.87\%$
test_step_mdp_speed[False-True-True-False-False] 48.6510μs 18.9717μs 52.7100 KOps/s 52.1717 KOps/s $\color{#35bf28}+1.03\%$
test_step_mdp_speed[False-True-False-True-True] 0.1198ms 54.9282μs 18.2056 KOps/s 18.0987 KOps/s $\color{#35bf28}+0.59\%$
test_step_mdp_speed[False-True-False-True-False] 84.6610μs 34.8193μs 28.7197 KOps/s 28.9214 KOps/s $\color{#d91a1a}-0.70\%$
test_step_mdp_speed[False-True-False-False-True] 0.1015ms 35.4178μs 28.2344 KOps/s 28.1843 KOps/s $\color{#35bf28}+0.18\%$
test_step_mdp_speed[False-True-False-False-False] 61.6710μs 21.7937μs 45.8847 KOps/s 45.8678 KOps/s $\color{#35bf28}+0.04\%$
test_step_mdp_speed[False-False-True-True-True] 0.1010ms 58.4514μs 17.1082 KOps/s 17.4935 KOps/s $\color{#d91a1a}-2.20\%$
test_step_mdp_speed[False-False-True-True-False] 76.6710μs 38.2471μs 26.1458 KOps/s 26.5462 KOps/s $\color{#d91a1a}-1.51\%$
test_step_mdp_speed[False-False-True-False-True] 90.7710μs 34.6955μs 28.8222 KOps/s 27.6802 KOps/s $\color{#35bf28}+4.13\%$
test_step_mdp_speed[False-False-True-False-False] 53.7700μs 21.6123μs 46.2699 KOps/s 46.8705 KOps/s $\color{#d91a1a}-1.28\%$
test_step_mdp_speed[False-False-False-True-True] 0.4932ms 59.7051μs 16.7490 KOps/s 16.9028 KOps/s $\color{#d91a1a}-0.91\%$
test_step_mdp_speed[False-False-False-True-False] 77.9710μs 40.1223μs 24.9238 KOps/s 25.2091 KOps/s $\color{#d91a1a}-1.13\%$
test_step_mdp_speed[False-False-False-False-True] 0.4577ms 37.7144μs 26.5150 KOps/s 26.8653 KOps/s $\color{#d91a1a}-1.30\%$
test_step_mdp_speed[False-False-False-False-False] 0.4642ms 24.4620μs 40.8797 KOps/s 41.4279 KOps/s $\color{#d91a1a}-1.32\%$
test_non_tensor_env_rollout_speed[1000-single-True] 0.8739s 0.7834s 1.2766 Ops/s 1.2792 Ops/s $\color{#d91a1a}-0.20\%$
test_non_tensor_env_rollout_speed[1000-single-False] 0.7363s 0.6401s 1.5622 Ops/s 1.5517 Ops/s $\color{#35bf28}+0.68\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-True] 1.7768s 1.6958s 0.5897 Ops/s 0.5880 Ops/s $\color{#35bf28}+0.29\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-False] 1.5425s 1.4694s 0.6805 Ops/s 0.6761 Ops/s $\color{#35bf28}+0.66\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-True] 2.0200s 1.9426s 0.5148 Ops/s 0.5121 Ops/s $\color{#35bf28}+0.53\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-False] 1.8051s 1.7213s 0.5810 Ops/s 0.5781 Ops/s $\color{#35bf28}+0.50\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-True] 4.7060s 4.6312s 0.2159 Ops/s 0.2117 Ops/s $\color{#35bf28}+2.01\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-False] 4.5680s 4.4963s 0.2224 Ops/s 0.2238 Ops/s $\color{#d91a1a}-0.64\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-True] 2.0602s 1.9862s 0.5035 Ops/s 0.5038 Ops/s $\color{#d91a1a}-0.06\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-False] 1.7688s 1.6806s 0.5950 Ops/s 0.5846 Ops/s $\color{#35bf28}+1.77\%$
test_values[generalized_advantage_estimate-True-True] 21.5569ms 21.0887ms 47.4188 Ops/s 47.1892 Ops/s $\color{#35bf28}+0.49\%$
test_values[vec_generalized_advantage_estimate-True-True] 0.1328s 3.5916ms 278.4245 Ops/s 259.0226 Ops/s $\textbf{\color{#35bf28}+7.49\%}$
test_values[td0_return_estimate-False-False] 0.1124ms 85.6226μs 11.6792 KOps/s 11.6824 KOps/s $\color{#d91a1a}-0.03\%$
test_values[td1_return_estimate-False-False] 50.4791ms 49.7804ms 20.0882 Ops/s 19.9915 Ops/s $\color{#35bf28}+0.48\%$
test_values[vec_td1_return_estimate-False-False] 1.3089ms 1.1045ms 905.3724 Ops/s 900.5397 Ops/s $\color{#35bf28}+0.54\%$
test_values[td_lambda_return_estimate-True-False] 82.7304ms 81.7123ms 12.2381 Ops/s 12.1019 Ops/s $\color{#35bf28}+1.12\%$
test_values[vec_td_lambda_return_estimate-True-False] 1.2784ms 1.1020ms 907.4202 Ops/s 904.0775 Ops/s $\color{#35bf28}+0.37\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 21.5184ms 21.3186ms 46.9074 Ops/s 46.6643 Ops/s $\color{#35bf28}+0.52\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 1.0419ms 0.7766ms 1.2877 KOps/s 1.2852 KOps/s $\color{#35bf28}+0.19\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.7653ms 0.6950ms 1.4388 KOps/s 1.3559 KOps/s $\textbf{\color{#35bf28}+6.11\%}$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 1.5604ms 1.5070ms 663.5836 Ops/s 659.6050 Ops/s $\color{#35bf28}+0.60\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 0.7777ms 0.7115ms 1.4054 KOps/s 1.3865 KOps/s $\color{#35bf28}+1.37\%$
test_dqn_speed[False-None] 1.6347ms 1.5655ms 638.7631 Ops/s 629.4472 Ops/s $\color{#35bf28}+1.48\%$
test_dqn_speed[False-backward] 2.5265ms 2.2237ms 449.7103 Ops/s 441.0059 Ops/s $\color{#35bf28}+1.97\%$
test_dqn_speed[True-None] 0.6893ms 0.5654ms 1.7686 KOps/s 1.7271 KOps/s $\color{#35bf28}+2.41\%$
test_dqn_speed[True-backward] 1.2656ms 1.2182ms 820.8593 Ops/s 817.0723 Ops/s $\color{#35bf28}+0.46\%$
test_dqn_speed[reduce-overhead-None] 0.6657ms 0.5921ms 1.6890 KOps/s 1.6224 KOps/s $\color{#35bf28}+4.11\%$
test_ddpg_speed[False-None] 3.3333ms 2.9611ms 337.7180 Ops/s 330.5641 Ops/s $\color{#35bf28}+2.16\%$
test_ddpg_speed[False-backward] 4.8256ms 4.4219ms 226.1449 Ops/s 222.8785 Ops/s $\color{#35bf28}+1.47\%$
test_ddpg_speed[True-None] 1.4326ms 1.3378ms 747.4867 Ops/s 745.0031 Ops/s $\color{#35bf28}+0.33\%$
test_ddpg_speed[True-backward] 2.6401ms 2.5573ms 391.0376 Ops/s 389.3255 Ops/s $\color{#35bf28}+0.44\%$
test_ddpg_speed[reduce-overhead-None] 1.4715ms 1.3665ms 731.7876 Ops/s 728.4990 Ops/s $\color{#35bf28}+0.45\%$
test_sac_speed[False-None] 9.9836ms 8.6625ms 115.4406 Ops/s 115.5112 Ops/s $\color{#d91a1a}-0.06\%$
test_sac_speed[False-backward] 12.2043ms 11.6584ms 85.7749 Ops/s 83.0591 Ops/s $\color{#35bf28}+3.27\%$
test_sac_speed[True-None] 2.0584ms 1.8439ms 542.3392 Ops/s 537.5706 Ops/s $\color{#35bf28}+0.89\%$
test_sac_speed[True-backward] 3.5643ms 3.4861ms 286.8499 Ops/s 271.1755 Ops/s $\textbf{\color{#35bf28}+5.78\%}$
test_sac_speed[reduce-overhead-None] 20.0095ms 11.1976ms 89.3048 Ops/s 89.2234 Ops/s $\color{#35bf28}+0.09\%$
test_redq_deprec_speed[False-None] 10.0381ms 9.5150ms 105.0976 Ops/s 104.2512 Ops/s $\color{#35bf28}+0.81\%$
test_redq_deprec_speed[False-backward] 13.2718ms 12.7031ms 78.7210 Ops/s 76.0140 Ops/s $\color{#35bf28}+3.56\%$
test_redq_deprec_speed[True-None] 2.7070ms 2.5628ms 390.1981 Ops/s 383.8010 Ops/s $\color{#35bf28}+1.67\%$
test_redq_deprec_speed[True-backward] 4.2899ms 4.1626ms 240.2335 Ops/s 225.9285 Ops/s $\textbf{\color{#35bf28}+6.33\%}$
test_redq_deprec_speed[reduce-overhead-None] 16.4456ms 10.1015ms 98.9954 Ops/s 99.5453 Ops/s $\color{#d91a1a}-0.55\%$
test_td3_speed[False-None] 8.6773ms 8.4453ms 118.4089 Ops/s 111.8853 Ops/s $\textbf{\color{#35bf28}+5.83\%}$
test_td3_speed[False-backward] 11.3444ms 10.8824ms 91.8913 Ops/s 89.8804 Ops/s $\color{#35bf28}+2.24\%$
test_td3_speed[True-None] 1.6834ms 1.6511ms 605.6514 Ops/s 591.9918 Ops/s $\color{#35bf28}+2.31\%$
test_td3_speed[True-backward] 3.2144ms 3.1487ms 317.5925 Ops/s 297.9734 Ops/s $\textbf{\color{#35bf28}+6.58\%}$
test_td3_speed[reduce-overhead-None] 73.8719ms 25.4426ms 39.3041 Ops/s 39.1117 Ops/s $\color{#35bf28}+0.49\%$
test_cql_speed[False-None] 18.3187ms 17.7283ms 56.4071 Ops/s 56.0474 Ops/s $\color{#35bf28}+0.64\%$
test_cql_speed[False-backward] 24.0049ms 23.1262ms 43.2411 Ops/s 42.2238 Ops/s $\color{#35bf28}+2.41\%$
test_cql_speed[True-None] 3.4022ms 3.2847ms 304.4439 Ops/s 281.9197 Ops/s $\textbf{\color{#35bf28}+7.99\%}$
test_cql_speed[True-backward] 5.9864ms 5.3843ms 185.7245 Ops/s 174.3002 Ops/s $\textbf{\color{#35bf28}+6.55\%}$
test_cql_speed[reduce-overhead-None] 19.4274ms 12.0839ms 82.7548 Ops/s 83.2283 Ops/s $\color{#d91a1a}-0.57\%$
test_a2c_speed[False-None] 4.0207ms 3.3729ms 296.4771 Ops/s 297.7744 Ops/s $\color{#d91a1a}-0.44\%$
test_a2c_speed[False-backward] 6.8347ms 6.3870ms 156.5683 Ops/s 150.6744 Ops/s $\color{#35bf28}+3.91\%$
test_a2c_speed[True-None] 1.4639ms 1.3458ms 743.0430 Ops/s 729.7385 Ops/s $\color{#35bf28}+1.82\%$
test_a2c_speed[True-backward] 3.4883ms 3.1834ms 314.1265 Ops/s 311.7182 Ops/s $\color{#35bf28}+0.77\%$
test_a2c_speed[reduce-overhead-None] 1.1787ms 1.0055ms 994.4896 Ops/s 994.8987 Ops/s $\color{#d91a1a}-0.04\%$
test_ppo_speed[False-None] 4.0942ms 3.9480ms 253.2955 Ops/s 251.7072 Ops/s $\color{#35bf28}+0.63\%$
test_ppo_speed[False-backward] 7.9923ms 7.3911ms 135.2979 Ops/s 138.5974 Ops/s $\color{#d91a1a}-2.38\%$
test_ppo_speed[True-None] 1.5776ms 1.4531ms 688.2011 Ops/s 687.1273 Ops/s $\color{#35bf28}+0.16\%$
test_ppo_speed[True-backward] 3.3208ms 3.2852ms 304.3984 Ops/s 298.1566 Ops/s $\color{#35bf28}+2.09\%$
test_ppo_speed[reduce-overhead-None] 1.1980ms 1.0546ms 948.2294 Ops/s 909.4386 Ops/s $\color{#35bf28}+4.27\%$
test_reinforce_speed[False-None] 2.5740ms 2.3434ms 426.7338 Ops/s 420.1233 Ops/s $\color{#35bf28}+1.57\%$
test_reinforce_speed[False-backward] 3.8486ms 3.4031ms 293.8454 Ops/s 282.2635 Ops/s $\color{#35bf28}+4.10\%$
test_reinforce_speed[True-None] 1.4141ms 1.3199ms 757.6539 Ops/s 770.5362 Ops/s $\color{#d91a1a}-1.67\%$
test_reinforce_speed[True-backward] 2.9805ms 2.9327ms 340.9860 Ops/s 331.0514 Ops/s $\color{#35bf28}+3.00\%$
test_reinforce_speed[reduce-overhead-None] 0.4529s 10.4586ms 95.6150 Ops/s 104.0287 Ops/s $\textbf{\color{#d91a1a}-8.09\%}$
test_iql_speed[False-None] 9.9704ms 9.6652ms 103.4636 Ops/s 103.0250 Ops/s $\color{#35bf28}+0.43\%$
test_iql_speed[False-backward] 13.9227ms 13.4837ms 74.1635 Ops/s 74.2332 Ops/s $\color{#d91a1a}-0.09\%$
test_iql_speed[True-None] 2.3601ms 2.2153ms 451.3968 Ops/s 447.2570 Ops/s $\color{#35bf28}+0.93\%$
test_iql_speed[True-backward] 5.0471ms 4.7660ms 209.8203 Ops/s 200.4161 Ops/s $\color{#35bf28}+4.69\%$
test_iql_speed[reduce-overhead-None] 18.3044ms 10.7188ms 93.2939 Ops/s 94.5986 Ops/s $\color{#d91a1a}-1.38\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 6.3897ms 6.1399ms 162.8699 Ops/s 159.8555 Ops/s $\color{#35bf28}+1.89\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.8874ms 0.3818ms 2.6193 KOps/s 3.1647 KOps/s $\textbf{\color{#d91a1a}-17.23\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.6379ms 0.3692ms 2.7087 KOps/s 3.3158 KOps/s $\textbf{\color{#d91a1a}-18.31\%}$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 6.3004ms 6.0133ms 166.2982 Ops/s 166.0873 Ops/s $\color{#35bf28}+0.13\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 2.3832ms 0.3187ms 3.1380 KOps/s 2.8614 KOps/s $\textbf{\color{#35bf28}+9.67\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.7169ms 0.2946ms 3.3945 KOps/s 3.2311 KOps/s $\textbf{\color{#35bf28}+5.06\%}$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 1.7522ms 1.4380ms 695.3886 Ops/s 707.6580 Ops/s $\color{#d91a1a}-1.73\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 1.4332ms 1.2179ms 821.1017 Ops/s 731.5519 Ops/s $\textbf{\color{#35bf28}+12.24\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 6.4089ms 6.1994ms 161.3060 Ops/s 161.5305 Ops/s $\color{#d91a1a}-0.14\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 0.9332ms 0.4755ms 2.1032 KOps/s 2.2899 KOps/s $\textbf{\color{#d91a1a}-8.15\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.7953ms 0.5260ms 1.9013 KOps/s 2.1942 KOps/s $\textbf{\color{#d91a1a}-13.35\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 6.2132ms 6.0526ms 165.2173 Ops/s 165.7334 Ops/s $\color{#d91a1a}-0.31\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.8876ms 0.3216ms 3.1094 KOps/s 3.1321 KOps/s $\color{#d91a1a}-0.72\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.5244ms 0.3166ms 3.1587 KOps/s 2.8617 KOps/s $\textbf{\color{#35bf28}+10.38\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 6.3058ms 5.9935ms 166.8480 Ops/s 168.1736 Ops/s $\color{#d91a1a}-0.79\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 1.8377ms 0.3167ms 3.1578 KOps/s 2.7970 KOps/s $\textbf{\color{#35bf28}+12.90\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.5753ms 0.3212ms 3.1136 KOps/s 3.1344 KOps/s $\color{#d91a1a}-0.66\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 6.3354ms 6.1976ms 161.3538 Ops/s 161.3699 Ops/s $-0.01\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1.9631ms 0.4661ms 2.1455 KOps/s 2.1897 KOps/s $\color{#d91a1a}-2.02\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.7029ms 0.4892ms 2.0442 KOps/s 2.0041 KOps/s $\color{#35bf28}+2.00\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 0.6500s 18.1184ms 55.1925 Ops/s 191.4180 Ops/s $\textbf{\color{#d91a1a}-71.17\%}$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 8.3424ms 2.0434ms 489.3779 Ops/s 495.3038 Ops/s $\color{#d91a1a}-1.20\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 9.4998ms 1.3066ms 765.3587 Ops/s 993.2289 Ops/s $\textbf{\color{#d91a1a}-22.94\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 7.1461ms 5.2460ms 190.6200 Ops/s 190.5996 Ops/s $\color{#35bf28}+0.01\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 3.9628ms 1.8132ms 551.5059 Ops/s 498.7986 Ops/s $\textbf{\color{#35bf28}+10.57\%}$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 1.1093ms 0.9266ms 1.0792 KOps/s 777.6168 Ops/s $\textbf{\color{#35bf28}+38.78\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 0.6007s 17.3173ms 57.7458 Ops/s 48.1891 Ops/s $\textbf{\color{#35bf28}+19.83\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 4.2563ms 1.9569ms 511.0154 Ops/s 455.9667 Ops/s $\textbf{\color{#35bf28}+12.07\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 2.2058ms 1.1473ms 871.6345 Ops/s 876.3190 Ops/s $\color{#d91a1a}-0.53\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] 39.4911ms 37.0136ms 27.0171 Ops/s 26.3768 Ops/s $\color{#35bf28}+2.43\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] 19.6466ms 18.2195ms 54.8864 Ops/s 54.5077 Ops/s $\color{#35bf28}+0.69\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-True] 40.8604ms 37.6575ms 26.5552 Ops/s 25.9542 Ops/s $\color{#35bf28}+2.32\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-False] 20.7622ms 18.6228ms 53.6976 Ops/s 53.0594 Ops/s $\color{#35bf28}+1.20\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-True] 42.4893ms 39.9210ms 25.0495 Ops/s 24.7443 Ops/s $\color{#35bf28}+1.23\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-False] 22.0778ms 20.8876ms 47.8753 Ops/s 48.7144 Ops/s $\color{#d91a1a}-1.72\%$
test_storage_write_lazystack[50-img_shape0-small] 0.9271ms 0.2373ms 4.2148 KOps/s 4.4948 KOps/s $\textbf{\color{#d91a1a}-6.23\%}$
test_storage_write_lazystack[100-img_shape1-atari] 1.6624ms 1.4179ms 705.2898 Ops/s 691.4757 Ops/s $\color{#35bf28}+2.00\%$
test_storage_write_lazystack[100-img_shape2-large_img] 2.7204ms 2.2947ms 435.7922 Ops/s 432.1262 Ops/s $\color{#35bf28}+0.85\%$
test_storage_write_lazystack[200-img_shape3-large_batch] 3.1208ms 2.9520ms 338.7558 Ops/s 337.1903 Ops/s $\color{#35bf28}+0.46\%$
test_storage_write_contiguous[50-img_shape0-small] 0.5015ms 0.1514ms 6.6054 KOps/s 6.4509 KOps/s $\color{#35bf28}+2.40\%$
test_storage_write_contiguous[100-img_shape1-atari] 0.3807ms 0.2087ms 4.7915 KOps/s 4.4692 KOps/s $\textbf{\color{#35bf28}+7.21\%}$
test_storage_write_contiguous[100-img_shape2-large_img] 2.0897ms 1.8167ms 550.4414 Ops/s 557.3680 Ops/s $\color{#d91a1a}-1.24\%$
test_storage_write_contiguous[200-img_shape3-large_batch] 1.5532ms 1.3418ms 745.2599 Ops/s 776.6210 Ops/s $\color{#d91a1a}-4.04\%$
test_collector_stack_then_write[50-img_shape0-small] 1.3742ms 1.1788ms 848.3427 Ops/s 856.6189 Ops/s $\color{#d91a1a}-0.97\%$
test_collector_stack_then_write[100-img_shape1-atari] 3.8650ms 3.6995ms 270.3038 Ops/s 274.1923 Ops/s $\color{#d91a1a}-1.42\%$
test_collector_stack_then_write[100-img_shape2-large_img] 11.3122ms 5.7723ms 173.2400 Ops/s 172.1791 Ops/s $\color{#35bf28}+0.62\%$
test_collector_stack_then_write[200-img_shape3-large_batch] 7.3110ms 7.1071ms 140.7035 Ops/s 133.3587 Ops/s $\textbf{\color{#35bf28}+5.51\%}$
test_collector_lazystack_then_write[50-img_shape0-small] 0.4622ms 0.2759ms 3.6248 KOps/s 3.4394 KOps/s $\textbf{\color{#35bf28}+5.39\%}$
test_collector_lazystack_then_write[100-img_shape1-atari] 1.6829ms 1.4843ms 673.7138 Ops/s 636.2263 Ops/s $\textbf{\color{#35bf28}+5.89\%}$
test_collector_lazystack_then_write[100-img_shape2-large_img] 2.9868ms 2.4061ms 415.6152 Ops/s 415.5374 Ops/s $\color{#35bf28}+0.02\%$
test_collector_lazystack_then_write[200-img_shape3-large_batch] 3.3923ms 3.1736ms 315.1027 Ops/s 313.3269 Ops/s $\color{#35bf28}+0.57\%$
test_collector_without_rb[100-img_shape0-atari] 35.3399ms 34.8313ms 28.7098 Ops/s 28.6965 Ops/s $\color{#35bf28}+0.05\%$
test_collector_without_rb[200-img_shape1-large_batch] 69.1895ms 68.2090ms 14.6608 Ops/s 14.2748 Ops/s $\color{#35bf28}+2.70\%$
test_collector_with_rb[100-img_shape0-atari] 40.9820ms 39.4420ms 25.3537 Ops/s 25.0612 Ops/s $\color{#35bf28}+1.17\%$
test_collector_with_rb[200-img_shape1-large_batch] 78.5831ms 77.7333ms 12.8645 Ops/s 12.9219 Ops/s $\color{#d91a1a}-0.44\%$
test_collector_without_rb_cuda[100-img_shape0-atari] 60.2530ms 58.4997ms 17.0941 Ops/s 17.2069 Ops/s $\color{#d91a1a}-0.66\%$
test_collector_without_rb_cuda[200-img_shape1-large_batch] 0.1183s 0.1155s 8.6561 Ops/s 8.5729 Ops/s $\color{#35bf28}+0.97\%$
test_collector_with_rb_cuda[100-img_shape0-atari] 61.5590ms 59.8362ms 16.7123 Ops/s 16.4630 Ops/s $\color{#35bf28}+1.51\%$
test_collector_with_rb_cuda[200-img_shape1-large_batch] 0.1226s 0.1204s 8.3027 Ops/s 8.3909 Ops/s $\color{#d91a1a}-1.05\%$

Cache device-specific bounds tensors in Bounded._get_space_bounds() to
avoid .to(device) calls during CUDA graph capture.

Previously, Bounded._project() and Bounded.is_in() called .to(device)
on low/high bounds during every forward pass. This created DeviceCopy
operations that are incompatible with CUDA graph capture, causing:
- "operation not permitted when stream is capturing" errors
- Graph partitioning warnings reducing performance

The fix adds lazy per-device caching: during warmup, the cache is
populated with .to() results. During capture and replay, cached tensors
are returned directly, avoiding the problematic device copies.

The cache is excluded from serialization via __getstate__ to avoid
pickling CUDA tensors, which could cause issues when loading on
different devices or machines.

Co-authored-by: Cursor <[email protected]>
@vmoens vmoens force-pushed the fix/cudagraph-bounded-spec branch from 3926f01 to 06b14cd Compare February 6, 2026 08:39
@vmoens vmoens merged commit fca4e76 into main Feb 6, 2026
64 of 86 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BugFix CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant