Skip to content

[Perf] ParallelEnv: fast-path device transfer in step_and_maybe_reset#3458

Merged
vmoens merged 2 commits intogh/vmoens/219/basefrom
gh/vmoens/219/head
Feb 7, 2026
Merged

[Perf] ParallelEnv: fast-path device transfer in step_and_maybe_reset#3458
vmoens merged 2 commits intogh/vmoens/219/basefrom
gh/vmoens/219/head

Conversation

@vmoens
Copy link
Collaborator

@vmoens vmoens commented Feb 6, 2026

Stack from ghstack (oldest at bottom):

Optimise the output-reading phase of step_and_maybe_reset when shared
memory and target device are both known and different (the common
CPU-shared -> CUDA case).

  • When shared_device is not None and shared_device != device: use a
    single td.to(device) instead of _fast_apply with per-tensor check.
    Since .to() already creates new tensors, the extra .clone() is
    unnecessary.
  • Keep the _fast_apply fallback for the mixed-device case.
  • Move _sync_w2m() into a conditional - only called when a cross-device
    transfer actually happened.

Co-authored-by: Cursor [email protected]

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 6, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3458

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit fbd36ac with merge base ab49b59 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added the Performance Performance issue or suggestion for improvement label Feb 6, 2026
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 6, 2026
[ghstack-poisoned]
@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of CPU Benchmark Tests

Total Benchmarks: 173. Improved: $\large\color{#35bf28}10$. Worsened: $\large\color{#d91a1a}18$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_tensor_to_bytestream_speed[pickle] 79.6692μs 77.9475μs 12.8291 KOps/s 12.4608 KOps/s $\color{#35bf28}+2.96\%$
test_tensor_to_bytestream_speed[torch.save] 0.1350ms 0.1338ms 7.4716 KOps/s 7.4192 KOps/s $\color{#35bf28}+0.71\%$
test_tensor_to_bytestream_speed[untyped_storage] 0.1034s 0.1029s 9.7151 Ops/s 9.4630 Ops/s $\color{#35bf28}+2.66\%$
test_tensor_to_bytestream_speed[numpy] 2.5026μs 2.4950μs 400.7975 KOps/s 411.2053 KOps/s $\color{#d91a1a}-2.53\%$
test_tensor_to_bytestream_speed[safetensors] 36.6250μs 36.4308μs 27.4493 KOps/s 28.0176 KOps/s $\color{#d91a1a}-2.03\%$
test_simple 0.5349s 0.5336s 1.8740 Ops/s 1.7884 Ops/s $\color{#35bf28}+4.78\%$
test_transformed 1.2342s 1.1398s 0.8774 Ops/s 0.8867 Ops/s $\color{#d91a1a}-1.05\%$
test_serial 1.6457s 1.6359s 0.6113 Ops/s 0.6045 Ops/s $\color{#35bf28}+1.13\%$
test_parallel 1.1388s 1.0326s 0.9685 Ops/s 0.9595 Ops/s $\color{#35bf28}+0.93\%$
test_step_mdp_speed[True-True-True-True-True] 0.1444ms 42.4077μs 23.5806 KOps/s 22.8043 KOps/s $\color{#35bf28}+3.40\%$
test_step_mdp_speed[True-True-True-True-False] 59.2430μs 24.2949μs 41.1609 KOps/s 40.1519 KOps/s $\color{#35bf28}+2.51\%$
test_step_mdp_speed[True-True-True-False-True] 49.2520μs 24.1839μs 41.3497 KOps/s 41.0573 KOps/s $\color{#35bf28}+0.71\%$
test_step_mdp_speed[True-True-True-False-False] 45.2520μs 13.4230μs 74.4989 KOps/s 73.8747 KOps/s $\color{#35bf28}+0.84\%$
test_step_mdp_speed[True-True-False-True-True] 89.3550μs 45.8145μs 21.8272 KOps/s 21.6723 KOps/s $\color{#35bf28}+0.71\%$
test_step_mdp_speed[True-True-False-True-False] 53.9930μs 26.3342μs 37.9735 KOps/s 36.6220 KOps/s $\color{#35bf28}+3.69\%$
test_step_mdp_speed[True-True-False-False-True] 65.5840μs 26.3149μs 38.0013 KOps/s 37.4400 KOps/s $\color{#35bf28}+1.50\%$
test_step_mdp_speed[True-True-False-False-False] 50.3520μs 16.1666μs 61.8560 KOps/s 61.3442 KOps/s $\color{#35bf28}+0.83\%$
test_step_mdp_speed[True-False-True-True-True] 88.1940μs 49.7113μs 20.1162 KOps/s 20.4208 KOps/s $\color{#d91a1a}-1.49\%$
test_step_mdp_speed[True-False-True-True-False] 58.2630μs 29.5847μs 33.8012 KOps/s 33.0411 KOps/s $\color{#35bf28}+2.30\%$
test_step_mdp_speed[True-False-True-False-True] 67.4540μs 27.3006μs 36.6292 KOps/s 37.4326 KOps/s $\color{#d91a1a}-2.15\%$
test_step_mdp_speed[True-False-True-False-False] 39.5920μs 16.1264μs 62.0102 KOps/s 61.2831 KOps/s $\color{#35bf28}+1.19\%$
test_step_mdp_speed[True-False-False-True-True] 88.4840μs 51.3285μs 19.4824 KOps/s 19.3171 KOps/s $\color{#35bf28}+0.86\%$
test_step_mdp_speed[True-False-False-True-False] 64.0930μs 31.2260μs 32.0246 KOps/s 30.3707 KOps/s $\textbf{\color{#35bf28}+5.45\%}$
test_step_mdp_speed[True-False-False-False-True] 63.2040μs 29.3154μs 34.1117 KOps/s 33.8551 KOps/s $\color{#35bf28}+0.76\%$
test_step_mdp_speed[True-False-False-False-False] 55.0920μs 18.9501μs 52.7703 KOps/s 52.6343 KOps/s $\color{#35bf28}+0.26\%$
test_step_mdp_speed[False-True-True-True-True] 94.5140μs 49.6910μs 20.1244 KOps/s 20.4505 KOps/s $\color{#d91a1a}-1.59\%$
test_step_mdp_speed[False-True-True-True-False] 70.3740μs 29.0420μs 34.4329 KOps/s 33.1163 KOps/s $\color{#35bf28}+3.98\%$
test_step_mdp_speed[False-True-True-False-True] 2.4272ms 31.3209μs 31.9276 KOps/s 31.9722 KOps/s $\color{#d91a1a}-0.14\%$
test_step_mdp_speed[False-True-True-False-False] 41.4720μs 18.1100μs 55.2181 KOps/s 55.6443 KOps/s $\color{#d91a1a}-0.77\%$
test_step_mdp_speed[False-True-False-True-True] 93.7440μs 51.5562μs 19.3963 KOps/s 19.1971 KOps/s $\color{#35bf28}+1.04\%$
test_step_mdp_speed[False-True-False-True-False] 58.8330μs 32.3998μs 30.8644 KOps/s 29.9179 KOps/s $\color{#35bf28}+3.16\%$
test_step_mdp_speed[False-True-False-False-True] 69.8440μs 33.3225μs 30.0097 KOps/s 30.1544 KOps/s $\color{#d91a1a}-0.48\%$
test_step_mdp_speed[False-True-False-False-False] 49.6820μs 20.1732μs 49.5707 KOps/s 47.8866 KOps/s $\color{#35bf28}+3.52\%$
test_step_mdp_speed[False-False-True-True-True] 90.3650μs 54.8598μs 18.2283 KOps/s 18.1527 KOps/s $\color{#35bf28}+0.42\%$
test_step_mdp_speed[False-False-True-True-False] 69.8930μs 34.9549μs 28.6083 KOps/s 27.7935 KOps/s $\color{#35bf28}+2.93\%$
test_step_mdp_speed[False-False-True-False-True] 80.5530μs 33.7250μs 29.6516 KOps/s 29.6125 KOps/s $\color{#35bf28}+0.13\%$
test_step_mdp_speed[False-False-True-False-False] 52.3830μs 20.8657μs 47.9255 KOps/s 48.0495 KOps/s $\color{#d91a1a}-0.26\%$
test_step_mdp_speed[False-False-False-True-True] 0.1048ms 56.5161μs 17.6941 KOps/s 17.6051 KOps/s $\color{#35bf28}+0.51\%$
test_step_mdp_speed[False-False-False-True-False] 69.8330μs 37.3332μs 26.7858 KOps/s 26.2823 KOps/s $\color{#35bf28}+1.92\%$
test_step_mdp_speed[False-False-False-False-True] 67.1730μs 35.3855μs 28.2601 KOps/s 27.7153 KOps/s $\color{#35bf28}+1.97\%$
test_step_mdp_speed[False-False-False-False-False] 58.7830μs 22.8091μs 43.8421 KOps/s 43.7439 KOps/s $\color{#35bf28}+0.22\%$
test_non_tensor_env_rollout_speed[1000-single-True] 0.8559s 0.7536s 1.3270 Ops/s 1.3269 Ops/s $\color{#35bf28}+0.01\%$
test_non_tensor_env_rollout_speed[1000-single-False] 0.7187s 0.6208s 1.6109 Ops/s 1.6135 Ops/s $\color{#d91a1a}-0.16\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-True] 1.7250s 1.6475s 0.6070 Ops/s 0.6065 Ops/s $\color{#35bf28}+0.07\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-False] 1.5042s 1.4254s 0.7015 Ops/s 0.7016 Ops/s $-0.01\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-True] 1.9705s 1.8883s 0.5296 Ops/s 0.5267 Ops/s $\color{#35bf28}+0.54\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-False] 1.7456s 1.6693s 0.5990 Ops/s 0.5943 Ops/s $\color{#35bf28}+0.81\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-True] 4.6265s 4.5720s 0.2187 Ops/s 0.2214 Ops/s $\color{#d91a1a}-1.19\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-False] 4.4134s 4.3793s 0.2283 Ops/s 0.2275 Ops/s $\color{#35bf28}+0.37\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-True] 1.9515s 1.8531s 0.5396 Ops/s 0.5115 Ops/s $\textbf{\color{#35bf28}+5.50\%}$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-False] 1.6841s 1.5989s 0.6254 Ops/s 0.6311 Ops/s $\color{#d91a1a}-0.90\%$
test_values[generalized_advantage_estimate-True-True] 9.7836ms 9.6252ms 103.8937 Ops/s 102.7139 Ops/s $\color{#35bf28}+1.15\%$
test_values[vec_generalized_advantage_estimate-True-True] 19.9723ms 17.4518ms 57.3008 Ops/s 55.9789 Ops/s $\color{#35bf28}+2.36\%$
test_values[td0_return_estimate-False-False] 0.2308ms 0.1252ms 7.9844 KOps/s 7.9000 KOps/s $\color{#35bf28}+1.07\%$
test_values[td1_return_estimate-False-False] 26.4377ms 25.9841ms 38.4850 Ops/s 38.1172 Ops/s $\color{#35bf28}+0.97\%$
test_values[vec_td1_return_estimate-False-False] 18.5143ms 17.5894ms 56.8523 Ops/s 55.7263 Ops/s $\color{#35bf28}+2.02\%$
test_values[td_lambda_return_estimate-True-False] 38.8565ms 38.4967ms 25.9762 Ops/s 25.5848 Ops/s $\color{#35bf28}+1.53\%$
test_values[vec_td_lambda_return_estimate-True-False] 18.7263ms 17.5847ms 56.8676 Ops/s 56.2933 Ops/s $\color{#35bf28}+1.02\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 8.7018ms 8.5146ms 117.4458 Ops/s 117.0619 Ops/s $\color{#35bf28}+0.33\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 1.9149ms 1.4357ms 696.5108 Ops/s 643.0232 Ops/s $\textbf{\color{#35bf28}+8.32\%}$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.5331ms 0.4123ms 2.4257 KOps/s 2.3745 KOps/s $\color{#35bf28}+2.15\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 38.0940ms 34.2844ms 29.1678 Ops/s 31.9422 Ops/s $\textbf{\color{#d91a1a}-8.69\%}$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 2.1665ms 1.7219ms 580.7655 Ops/s 586.4523 Ops/s $\color{#d91a1a}-0.97\%$
test_dqn_speed[False-None] 1.5047ms 1.3580ms 736.3681 Ops/s 719.6174 Ops/s $\color{#35bf28}+2.33\%$
test_dqn_speed[False-backward] 1.9221ms 1.8712ms 534.4129 Ops/s 520.4364 Ops/s $\color{#35bf28}+2.69\%$
test_dqn_speed[True-None] 1.0711ms 0.5367ms 1.8633 KOps/s 1.7590 KOps/s $\textbf{\color{#35bf28}+5.93\%}$
test_dqn_speed[True-backward] 1.0326ms 0.9880ms 1.0121 KOps/s 908.8394 Ops/s $\textbf{\color{#35bf28}+11.36\%}$
test_dqn_speed[reduce-overhead-None] 0.6315ms 0.5252ms 1.9040 KOps/s 1.8520 KOps/s $\color{#35bf28}+2.81\%$
test_ddpg_speed[False-None] 3.1630ms 2.7769ms 360.1096 Ops/s 366.8750 Ops/s $\color{#d91a1a}-1.84\%$
test_ddpg_speed[False-backward] 4.1396ms 4.0007ms 249.9570 Ops/s 251.1995 Ops/s $\color{#d91a1a}-0.49\%$
test_ddpg_speed[True-None] 1.5530ms 1.3850ms 722.0227 Ops/s 704.3901 Ops/s $\color{#35bf28}+2.50\%$
test_ddpg_speed[True-backward] 2.4030ms 2.3588ms 423.9515 Ops/s 421.3855 Ops/s $\color{#35bf28}+0.61\%$
test_ddpg_speed[reduce-overhead-None] 1.6103ms 1.3788ms 725.2847 Ops/s 722.2930 Ops/s $\color{#35bf28}+0.41\%$
test_sac_speed[False-None] 8.4472ms 7.8887ms 126.7636 Ops/s 126.3622 Ops/s $\color{#35bf28}+0.32\%$
test_sac_speed[False-backward] 11.6807ms 11.1550ms 89.6457 Ops/s 89.7176 Ops/s $\color{#d91a1a}-0.08\%$
test_sac_speed[True-None] 2.3109ms 2.1398ms 467.3243 Ops/s 452.8961 Ops/s $\color{#35bf28}+3.19\%$
test_sac_speed[True-backward] 4.1259ms 4.0056ms 249.6491 Ops/s 247.7975 Ops/s $\color{#35bf28}+0.75\%$
test_sac_speed[reduce-overhead-None] 2.5114ms 2.1274ms 470.0502 Ops/s 451.0358 Ops/s $\color{#35bf28}+4.22\%$
test_redq_speed[False-None] 10.9571ms 10.3404ms 96.7081 Ops/s 94.0680 Ops/s $\color{#35bf28}+2.81\%$
test_redq_speed[False-backward] 21.1370ms 17.8732ms 55.9498 Ops/s 58.1914 Ops/s $\color{#d91a1a}-3.85\%$
test_redq_speed[True-None] 4.7080ms 4.3575ms 229.4869 Ops/s 233.4952 Ops/s $\color{#d91a1a}-1.72\%$
test_redq_speed[True-backward] 10.0526ms 9.7874ms 102.1724 Ops/s 106.3557 Ops/s $\color{#d91a1a}-3.93\%$
test_redq_speed[reduce-overhead-None] 4.8358ms 4.4267ms 225.9009 Ops/s 229.2342 Ops/s $\color{#d91a1a}-1.45\%$
test_redq_deprec_speed[False-None] 11.4406ms 10.9557ms 91.2765 Ops/s 93.8165 Ops/s $\color{#d91a1a}-2.71\%$
test_redq_deprec_speed[False-backward] 16.3431ms 15.8397ms 63.1326 Ops/s 65.3660 Ops/s $\color{#d91a1a}-3.42\%$
test_redq_deprec_speed[True-None] 3.9539ms 3.6685ms 272.5886 Ops/s 276.8422 Ops/s $\color{#d91a1a}-1.54\%$
test_redq_deprec_speed[True-backward] 7.7721ms 7.5325ms 132.7576 Ops/s 132.2917 Ops/s $\color{#35bf28}+0.35\%$
test_redq_deprec_speed[reduce-overhead-None] 4.0160ms 3.6286ms 275.5846 Ops/s 280.3592 Ops/s $\color{#d91a1a}-1.70\%$
test_td3_speed[False-None] 8.4857ms 8.1268ms 123.0490 Ops/s 127.9793 Ops/s $\color{#d91a1a}-3.85\%$
test_td3_speed[False-backward] 11.3370ms 10.8253ms 92.3760 Ops/s 93.8692 Ops/s $\color{#d91a1a}-1.59\%$
test_td3_speed[True-None] 1.8734ms 1.8325ms 545.7151 Ops/s 553.6461 Ops/s $\color{#d91a1a}-1.43\%$
test_td3_speed[True-backward] 3.7296ms 3.6377ms 274.8996 Ops/s 251.4885 Ops/s $\textbf{\color{#35bf28}+9.31\%}$
test_td3_speed[reduce-overhead-None] 1.8335ms 1.7875ms 559.4409 Ops/s 558.7315 Ops/s $\color{#35bf28}+0.13\%$
test_cql_speed[False-None] 28.1171ms 25.7215ms 38.8779 Ops/s 39.5531 Ops/s $\color{#d91a1a}-1.71\%$
test_cql_speed[False-backward] 37.7470ms 34.9665ms 28.5988 Ops/s 28.9108 Ops/s $\color{#d91a1a}-1.08\%$
test_cql_speed[True-None] 12.7156ms 12.3408ms 81.0320 Ops/s 80.2824 Ops/s $\color{#35bf28}+0.93\%$
test_cql_speed[True-backward] 18.8840ms 18.3804ms 54.4058 Ops/s 56.5348 Ops/s $\color{#d91a1a}-3.77\%$
test_cql_speed[reduce-overhead-None] 12.6358ms 12.3643ms 80.8777 Ops/s 80.9382 Ops/s $\color{#d91a1a}-0.07\%$
test_a2c_speed[False-None] 5.6001ms 5.2970ms 188.7879 Ops/s 192.9837 Ops/s $\color{#d91a1a}-2.17\%$
test_a2c_speed[False-backward] 12.0435ms 11.7763ms 84.9163 Ops/s 85.8763 Ops/s $\color{#d91a1a}-1.12\%$
test_a2c_speed[True-None] 4.0743ms 3.6855ms 271.3360 Ops/s 282.0821 Ops/s $\color{#d91a1a}-3.81\%$
test_a2c_speed[True-backward] 8.7649ms 8.5557ms 116.8812 Ops/s 113.0888 Ops/s $\color{#35bf28}+3.35\%$
test_a2c_speed[reduce-overhead-None] 4.1630ms 3.7101ms 269.5370 Ops/s 270.6534 Ops/s $\color{#d91a1a}-0.41\%$
test_ppo_speed[False-None] 6.2237ms 5.9214ms 168.8788 Ops/s 174.5965 Ops/s $\color{#d91a1a}-3.27\%$
test_ppo_speed[False-backward] 12.9143ms 12.6021ms 79.3516 Ops/s 81.9815 Ops/s $\color{#d91a1a}-3.21\%$
test_ppo_speed[True-None] 4.0235ms 3.6531ms 273.7376 Ops/s 275.6998 Ops/s $\color{#d91a1a}-0.71\%$
test_ppo_speed[True-backward] 8.7936ms 8.4534ms 118.2962 Ops/s 120.4557 Ops/s $\color{#d91a1a}-1.79\%$
test_ppo_speed[reduce-overhead-None] 3.7893ms 3.6076ms 277.1945 Ops/s 275.2368 Ops/s $\color{#35bf28}+0.71\%$
test_reinforce_speed[False-None] 4.9983ms 4.5216ms 221.1610 Ops/s 221.5471 Ops/s $\color{#d91a1a}-0.17\%$
test_reinforce_speed[False-backward] 7.5683ms 7.3612ms 135.8483 Ops/s 137.1550 Ops/s $\color{#d91a1a}-0.95\%$
test_reinforce_speed[True-None] 3.2599ms 2.8657ms 348.9522 Ops/s 344.6285 Ops/s $\color{#35bf28}+1.25\%$
test_reinforce_speed[True-backward] 7.9996ms 7.7570ms 128.9163 Ops/s 130.3903 Ops/s $\color{#d91a1a}-1.13\%$
test_reinforce_speed[reduce-overhead-None] 3.2933ms 2.8770ms 347.5852 Ops/s 344.0248 Ops/s $\color{#35bf28}+1.03\%$
test_iql_speed[False-None] 23.3709ms 19.5683ms 51.1031 Ops/s 50.2563 Ops/s $\color{#35bf28}+1.68\%$
test_iql_speed[False-backward] 36.4573ms 30.4454ms 32.8457 Ops/s 33.2830 Ops/s $\color{#d91a1a}-1.31\%$
test_iql_speed[True-None] 9.1074ms 8.5445ms 117.0338 Ops/s 116.0876 Ops/s $\color{#35bf28}+0.82\%$
test_iql_speed[True-backward] 17.0279ms 16.7774ms 59.6040 Ops/s 60.6434 Ops/s $\color{#d91a1a}-1.71\%$
test_iql_speed[reduce-overhead-None] 9.9940ms 8.5899ms 116.4154 Ops/s 113.1260 Ops/s $\color{#35bf28}+2.91\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 5.9811ms 5.8348ms 171.3849 Ops/s 170.2374 Ops/s $\color{#35bf28}+0.67\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 2.7233ms 0.3718ms 2.6893 KOps/s 3.5998 KOps/s $\textbf{\color{#d91a1a}-25.29\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.5539ms 0.2986ms 3.3491 KOps/s 3.8635 KOps/s $\textbf{\color{#d91a1a}-13.31\%}$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 5.9908ms 5.6236ms 177.8206 Ops/s 176.4864 Ops/s $\color{#35bf28}+0.76\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.8983ms 0.3267ms 3.0607 KOps/s 3.6901 KOps/s $\textbf{\color{#d91a1a}-17.06\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.6354ms 0.2860ms 3.4963 KOps/s 3.9302 KOps/s $\textbf{\color{#d91a1a}-11.04\%}$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 1.6200ms 1.3398ms 746.3673 Ops/s 820.6186 Ops/s $\textbf{\color{#d91a1a}-9.05\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 1.6783ms 1.2864ms 777.3415 Ops/s 878.4665 Ops/s $\textbf{\color{#d91a1a}-11.51\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 9.4219ms 5.8556ms 170.7762 Ops/s 173.3214 Ops/s $\color{#d91a1a}-1.47\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1.0613ms 0.4677ms 2.1381 KOps/s 2.3484 KOps/s $\textbf{\color{#d91a1a}-8.95\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.8191ms 0.4809ms 2.0796 KOps/s 2.4992 KOps/s $\textbf{\color{#d91a1a}-16.79\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 5.8217ms 5.6501ms 176.9890 Ops/s 177.1088 Ops/s $\color{#d91a1a}-0.07\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 1.8505ms 0.3621ms 2.7620 KOps/s 3.3654 KOps/s $\textbf{\color{#d91a1a}-17.93\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.5378ms 0.3508ms 2.8503 KOps/s 3.7923 KOps/s $\textbf{\color{#d91a1a}-24.84\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 5.8438ms 5.5778ms 179.2806 Ops/s 177.7050 Ops/s $\color{#35bf28}+0.89\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 1.8205ms 0.3608ms 2.7713 KOps/s 3.6426 KOps/s $\textbf{\color{#d91a1a}-23.92\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.5586ms 0.3450ms 2.8984 KOps/s 2.9173 KOps/s $\color{#d91a1a}-0.65\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 5.8418ms 5.7338ms 174.4053 Ops/s 173.3857 Ops/s $\color{#35bf28}+0.59\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 0.9416ms 0.5067ms 1.9736 KOps/s 2.2528 KOps/s $\textbf{\color{#d91a1a}-12.39\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.6684ms 0.4903ms 2.0397 KOps/s 2.2289 KOps/s $\textbf{\color{#d91a1a}-8.49\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 6.3856ms 4.9675ms 201.3099 Ops/s 59.0645 Ops/s $\textbf{\color{#35bf28}+240.83\%}$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 5.2321ms 2.1518ms 464.7323 Ops/s 511.5933 Ops/s $\textbf{\color{#d91a1a}-9.16\%}$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 2.1538ms 1.2032ms 831.0991 Ops/s 1.1344 KOps/s $\textbf{\color{#d91a1a}-26.73\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 0.5531s 16.1676ms 61.8520 Ops/s 196.4381 Ops/s $\textbf{\color{#d91a1a}-68.51\%}$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 4.2261ms 1.7955ms 556.9581 Ops/s 533.1057 Ops/s $\color{#35bf28}+4.47\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 1.0446ms 0.8517ms 1.1741 KOps/s 792.0274 Ops/s $\textbf{\color{#35bf28}+48.24\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 9.4670ms 5.2038ms 192.1689 Ops/s 59.6910 Ops/s $\textbf{\color{#35bf28}+221.94\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 9.0958ms 2.0641ms 484.4639 Ops/s 528.5379 Ops/s $\textbf{\color{#d91a1a}-8.34\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 1.2263ms 0.9956ms 1.0044 KOps/s 958.6802 Ops/s $\color{#35bf28}+4.77\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] 37.0664ms 34.9089ms 28.6460 Ops/s 28.5139 Ops/s $\color{#35bf28}+0.46\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] 19.2390ms 17.6433ms 56.6788 Ops/s 56.4156 Ops/s $\color{#35bf28}+0.47\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-True] 38.9538ms 36.4270ms 27.4522 Ops/s 27.6354 Ops/s $\color{#d91a1a}-0.66\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-False] 19.6444ms 18.0549ms 55.3868 Ops/s 54.9392 Ops/s $\color{#35bf28}+0.81\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-True] 39.2744ms 37.7138ms 26.5155 Ops/s 26.4797 Ops/s $\color{#35bf28}+0.14\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-False] 20.5928ms 19.3631ms 51.6445 Ops/s 51.4096 Ops/s $\color{#35bf28}+0.46\%$
test_storage_write_lazystack[50-img_shape0-small] 0.8371ms 0.2146ms 4.6600 KOps/s 4.5355 KOps/s $\color{#35bf28}+2.75\%$
test_storage_write_lazystack[100-img_shape1-atari] 1.7745ms 1.4174ms 705.5208 Ops/s 715.1122 Ops/s $\color{#d91a1a}-1.34\%$
test_storage_write_lazystack[100-img_shape2-large_img] 2.7466ms 2.3238ms 430.3206 Ops/s 416.5543 Ops/s $\color{#35bf28}+3.30\%$
test_storage_write_lazystack[200-img_shape3-large_batch] 3.0803ms 2.9234ms 342.0659 Ops/s 341.5278 Ops/s $\color{#35bf28}+0.16\%$
test_storage_write_contiguous[50-img_shape0-small] 0.2078ms 0.1300ms 7.6922 KOps/s 7.6976 KOps/s $\color{#d91a1a}-0.07\%$
test_storage_write_contiguous[100-img_shape1-atari] 0.3282ms 0.1810ms 5.5236 KOps/s 5.1793 KOps/s $\textbf{\color{#35bf28}+6.65\%}$
test_storage_write_contiguous[100-img_shape2-large_img] 1.9403ms 1.7644ms 566.7723 Ops/s 573.5394 Ops/s $\color{#d91a1a}-1.18\%$
test_storage_write_contiguous[200-img_shape3-large_batch] 1.5837ms 1.2805ms 780.9634 Ops/s 764.8914 Ops/s $\color{#35bf28}+2.10\%$
test_collector_stack_then_write[50-img_shape0-small] 1.2418ms 1.0863ms 920.5940 Ops/s 916.8938 Ops/s $\color{#35bf28}+0.40\%$
test_collector_stack_then_write[100-img_shape1-atari] 3.6867ms 3.5071ms 285.1383 Ops/s 286.4685 Ops/s $\color{#d91a1a}-0.46\%$
test_collector_stack_then_write[100-img_shape2-large_img] 11.0627ms 5.6249ms 177.7824 Ops/s 180.7879 Ops/s $\color{#d91a1a}-1.66\%$
test_collector_stack_then_write[200-img_shape3-large_batch] 7.0356ms 6.8929ms 145.0758 Ops/s 141.5778 Ops/s $\color{#35bf28}+2.47\%$
test_collector_lazystack_then_write[50-img_shape0-small] 0.4243ms 0.2693ms 3.7136 KOps/s 3.7026 KOps/s $\color{#35bf28}+0.30\%$
test_collector_lazystack_then_write[100-img_shape1-atari] 1.6989ms 1.5296ms 653.7488 Ops/s 668.6264 Ops/s $\color{#d91a1a}-2.23\%$
test_collector_lazystack_then_write[100-img_shape2-large_img] 2.8282ms 2.4438ms 409.1915 Ops/s 399.4314 Ops/s $\color{#35bf28}+2.44\%$
test_collector_lazystack_then_write[200-img_shape3-large_batch] 3.2904ms 3.1412ms 318.3507 Ops/s 320.5043 Ops/s $\color{#d91a1a}-0.67\%$
test_collector_without_rb[100-img_shape0-atari] 33.7639ms 33.1667ms 30.1508 Ops/s 30.3116 Ops/s $\color{#d91a1a}-0.53\%$
test_collector_without_rb[200-img_shape1-large_batch] 66.9921ms 65.4223ms 15.2853 Ops/s 15.3618 Ops/s $\color{#d91a1a}-0.50\%$
test_collector_with_rb[100-img_shape0-atari] 38.5026ms 37.6547ms 26.5571 Ops/s 26.5951 Ops/s $\color{#d91a1a}-0.14\%$
test_collector_with_rb[200-img_shape1-large_batch] 74.4906ms 73.6932ms 13.5698 Ops/s 13.6345 Ops/s $\color{#d91a1a}-0.47\%$

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of GPU Benchmark Tests

Total Benchmarks: 172. Improved: $\large\color{#35bf28}15$. Worsened: $\large\color{#d91a1a}11$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_tensor_to_bytestream_speed[pickle] 83.9278μs 81.6100μs 12.2534 KOps/s 12.4390 KOps/s $\color{#d91a1a}-1.49\%$
test_tensor_to_bytestream_speed[torch.save] 0.1413ms 0.1408ms 7.1042 KOps/s 7.1823 KOps/s $\color{#d91a1a}-1.09\%$
test_tensor_to_bytestream_speed[untyped_storage] 0.1091s 0.1088s 9.1940 Ops/s 9.1406 Ops/s $\color{#35bf28}+0.58\%$
test_tensor_to_bytestream_speed[numpy] 2.7011μs 2.6956μs 370.9682 KOps/s 372.4892 KOps/s $\color{#d91a1a}-0.41\%$
test_tensor_to_bytestream_speed[safetensors] 37.4997μs 37.2678μs 26.8328 KOps/s 26.1360 KOps/s $\color{#35bf28}+2.67\%$
test_simple 0.8022s 0.7978s 1.2534 Ops/s 1.2236 Ops/s $\color{#35bf28}+2.44\%$
test_transformed 1.5459s 1.4466s 0.6913 Ops/s 0.6872 Ops/s $\color{#35bf28}+0.59\%$
test_serial 2.3907s 2.3101s 0.4329 Ops/s 0.4322 Ops/s $\color{#35bf28}+0.16\%$
test_parallel 1.9089s 1.8084s 0.5530 Ops/s 0.5539 Ops/s $\color{#d91a1a}-0.17\%$
test_step_mdp_speed[True-True-True-True-True] 0.2568ms 45.3964μs 22.0282 KOps/s 21.9010 KOps/s $\color{#35bf28}+0.58\%$
test_step_mdp_speed[True-True-True-True-False] 49.7110μs 25.5063μs 39.2060 KOps/s 39.7021 KOps/s $\color{#d91a1a}-1.25\%$
test_step_mdp_speed[True-True-True-False-True] 65.1920μs 24.8562μs 40.2314 KOps/s 39.5159 KOps/s $\color{#35bf28}+1.81\%$
test_step_mdp_speed[True-True-True-False-False] 64.0710μs 14.1194μs 70.8247 KOps/s 71.0671 KOps/s $\color{#d91a1a}-0.34\%$
test_step_mdp_speed[True-True-False-True-True] 83.6020μs 47.7708μs 20.9333 KOps/s 20.8730 KOps/s $\color{#35bf28}+0.29\%$
test_step_mdp_speed[True-True-False-True-False] 68.0720μs 27.8268μs 35.9366 KOps/s 35.8279 KOps/s $\color{#35bf28}+0.30\%$
test_step_mdp_speed[True-True-False-False-True] 61.4710μs 27.3773μs 36.5266 KOps/s 36.6631 KOps/s $\color{#d91a1a}-0.37\%$
test_step_mdp_speed[True-True-False-False-False] 54.6410μs 16.6506μs 60.0581 KOps/s 60.0576 KOps/s $+0.00\%$
test_step_mdp_speed[True-False-True-True-True] 78.7610μs 50.2292μs 19.9087 KOps/s 20.0103 KOps/s $\color{#d91a1a}-0.51\%$
test_step_mdp_speed[True-False-True-True-False] 63.7620μs 31.0479μs 32.2083 KOps/s 32.5707 KOps/s $\color{#d91a1a}-1.11\%$
test_step_mdp_speed[True-False-True-False-True] 68.4510μs 27.5946μs 36.2390 KOps/s 36.1455 KOps/s $\color{#35bf28}+0.26\%$
test_step_mdp_speed[True-False-True-False-False] 43.0010μs 16.7498μs 59.7023 KOps/s 59.1109 KOps/s $\color{#35bf28}+1.00\%$
test_step_mdp_speed[True-False-False-True-True] 90.2720μs 52.3084μs 19.1174 KOps/s 18.4665 KOps/s $\color{#35bf28}+3.52\%$
test_step_mdp_speed[True-False-False-True-False] 65.6120μs 32.9826μs 30.3190 KOps/s 29.2894 KOps/s $\color{#35bf28}+3.52\%$
test_step_mdp_speed[True-False-False-False-True] 67.4320μs 29.8287μs 33.5248 KOps/s 33.0590 KOps/s $\color{#35bf28}+1.41\%$
test_step_mdp_speed[True-False-False-False-False] 44.5210μs 19.2390μs 51.9779 KOps/s 50.3711 KOps/s $\color{#35bf28}+3.19\%$
test_step_mdp_speed[False-True-True-True-True] 84.4320μs 51.1544μs 19.5487 KOps/s 19.9300 KOps/s $\color{#d91a1a}-1.91\%$
test_step_mdp_speed[False-True-True-True-False] 62.1210μs 30.8492μs 32.4158 KOps/s 32.0654 KOps/s $\color{#35bf28}+1.09\%$
test_step_mdp_speed[False-True-True-False-True] 2.3246ms 32.2782μs 30.9807 KOps/s 31.1233 KOps/s $\color{#d91a1a}-0.46\%$
test_step_mdp_speed[False-True-True-False-False] 49.2110μs 18.4495μs 54.2020 KOps/s 53.9321 KOps/s $\color{#35bf28}+0.50\%$
test_step_mdp_speed[False-True-False-True-True] 83.2820μs 53.4655μs 18.7037 KOps/s 18.8229 KOps/s $\color{#d91a1a}-0.63\%$
test_step_mdp_speed[False-True-False-True-False] 69.0110μs 33.2397μs 30.0845 KOps/s 29.9010 KOps/s $\color{#35bf28}+0.61\%$
test_step_mdp_speed[False-True-False-False-True] 77.3910μs 34.4001μs 29.0697 KOps/s 29.8660 KOps/s $\color{#d91a1a}-2.67\%$
test_step_mdp_speed[False-True-False-False-False] 58.5010μs 20.8886μs 47.8729 KOps/s 47.4674 KOps/s $\color{#35bf28}+0.85\%$
test_step_mdp_speed[False-False-True-True-True] 93.8020μs 56.4749μs 17.7070 KOps/s 17.9712 KOps/s $\color{#d91a1a}-1.47\%$
test_step_mdp_speed[False-False-True-True-False] 65.6310μs 36.5024μs 27.3955 KOps/s 27.4907 KOps/s $\color{#d91a1a}-0.35\%$
test_step_mdp_speed[False-False-True-False-True] 72.3610μs 34.0082μs 29.4047 KOps/s 28.9270 KOps/s $\color{#35bf28}+1.65\%$
test_step_mdp_speed[False-False-True-False-False] 63.7210μs 21.1617μs 47.2552 KOps/s 47.6593 KOps/s $\color{#d91a1a}-0.85\%$
test_step_mdp_speed[False-False-False-True-True] 97.1320μs 58.0753μs 17.2190 KOps/s 17.3696 KOps/s $\color{#d91a1a}-0.87\%$
test_step_mdp_speed[False-False-False-True-False] 68.0710μs 39.0007μs 25.6406 KOps/s 25.8101 KOps/s $\color{#d91a1a}-0.66\%$
test_step_mdp_speed[False-False-False-False-True] 75.6810μs 36.0346μs 27.7511 KOps/s 27.3769 KOps/s $\color{#35bf28}+1.37\%$
test_step_mdp_speed[False-False-False-False-False] 46.8910μs 23.8776μs 41.8802 KOps/s 41.7286 KOps/s $\color{#35bf28}+0.36\%$
test_non_tensor_env_rollout_speed[1000-single-True] 0.8677s 0.7684s 1.3014 Ops/s 1.3028 Ops/s $\color{#d91a1a}-0.11\%$
test_non_tensor_env_rollout_speed[1000-single-False] 0.7301s 0.6334s 1.5789 Ops/s 1.5854 Ops/s $\color{#d91a1a}-0.41\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-True] 1.7567s 1.6741s 0.5973 Ops/s 0.5945 Ops/s $\color{#35bf28}+0.47\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-False] 1.5272s 1.4508s 0.6893 Ops/s 0.6849 Ops/s $\color{#35bf28}+0.63\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-True] 1.9947s 1.9162s 0.5219 Ops/s 0.5159 Ops/s $\color{#35bf28}+1.15\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-False] 1.7751s 1.6898s 0.5918 Ops/s 0.5875 Ops/s $\color{#35bf28}+0.73\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-True] 4.7951s 4.6458s 0.2152 Ops/s 0.2161 Ops/s $\color{#d91a1a}-0.42\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-False] 4.6379s 4.4896s 0.2227 Ops/s 0.2225 Ops/s $\color{#35bf28}+0.12\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-True] 1.9814s 1.8819s 0.5314 Ops/s 0.5216 Ops/s $\color{#35bf28}+1.88\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-False] 1.7196s 1.6172s 0.6184 Ops/s 0.6164 Ops/s $\color{#35bf28}+0.31\%$
test_values[generalized_advantage_estimate-True-True] 20.6052ms 20.0933ms 49.7680 Ops/s 49.7009 Ops/s $\color{#35bf28}+0.13\%$
test_values[vec_generalized_advantage_estimate-True-True] 0.1445s 3.7984ms 263.2655 Ops/s 269.6266 Ops/s $\color{#d91a1a}-2.36\%$
test_values[td0_return_estimate-False-False] 0.1070ms 81.5733μs 12.2589 KOps/s 12.1981 KOps/s $\color{#35bf28}+0.50\%$
test_values[td1_return_estimate-False-False] 49.3758ms 47.4641ms 21.0686 Ops/s 20.9779 Ops/s $\color{#35bf28}+0.43\%$
test_values[vec_td1_return_estimate-False-False] 1.2836ms 1.0745ms 930.6544 Ops/s 924.5530 Ops/s $\color{#35bf28}+0.66\%$
test_values[td_lambda_return_estimate-True-False] 81.2111ms 78.1385ms 12.7978 Ops/s 12.7387 Ops/s $\color{#35bf28}+0.46\%$
test_values[vec_td_lambda_return_estimate-True-False] 1.2582ms 1.0709ms 933.7830 Ops/s 927.2909 Ops/s $\color{#35bf28}+0.70\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 20.6400ms 20.3010ms 49.2586 Ops/s 48.8625 Ops/s $\color{#35bf28}+0.81\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 1.0161ms 0.7445ms 1.3432 KOps/s 1.3315 KOps/s $\color{#35bf28}+0.87\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.7173ms 0.6666ms 1.5002 KOps/s 1.4864 KOps/s $\color{#35bf28}+0.93\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 1.5706ms 1.4838ms 673.9261 Ops/s 674.6681 Ops/s $\color{#d91a1a}-0.11\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 0.7292ms 0.6827ms 1.4648 KOps/s 1.4530 KOps/s $\color{#35bf28}+0.81\%$
test_dqn_speed[False-None] 1.6357ms 1.5273ms 654.7505 Ops/s 654.4231 Ops/s $\color{#35bf28}+0.05\%$
test_dqn_speed[False-backward] 2.1934ms 2.1481ms 465.5320 Ops/s 464.8767 Ops/s $\color{#35bf28}+0.14\%$
test_dqn_speed[True-None] 0.7221ms 0.5644ms 1.7718 KOps/s 1.7529 KOps/s $\color{#35bf28}+1.08\%$
test_dqn_speed[True-backward] 1.3254ms 1.2113ms 825.5448 Ops/s 902.1882 Ops/s $\textbf{\color{#d91a1a}-8.50\%}$
test_dqn_speed[reduce-overhead-None] 0.6483ms 0.5863ms 1.7055 KOps/s 1.6475 KOps/s $\color{#35bf28}+3.52\%$
test_ddpg_speed[False-None] 3.2684ms 2.8713ms 348.2720 Ops/s 345.6010 Ops/s $\color{#35bf28}+0.77\%$
test_ddpg_speed[False-backward] 4.6371ms 4.2636ms 234.5438 Ops/s 242.3560 Ops/s $\color{#d91a1a}-3.22\%$
test_ddpg_speed[True-None] 1.5078ms 1.3284ms 752.7580 Ops/s 747.9624 Ops/s $\color{#35bf28}+0.64\%$
test_ddpg_speed[True-backward] 2.6184ms 2.5349ms 394.4929 Ops/s 416.8307 Ops/s $\textbf{\color{#d91a1a}-5.36\%}$
test_ddpg_speed[reduce-overhead-None] 1.4895ms 1.3581ms 736.3285 Ops/s 737.3089 Ops/s $\color{#d91a1a}-0.13\%$
test_sac_speed[False-None] 8.7908ms 8.2569ms 121.1110 Ops/s 121.2842 Ops/s $\color{#d91a1a}-0.14\%$
test_sac_speed[False-backward] 11.8945ms 11.3710ms 87.9428 Ops/s 89.6973 Ops/s $\color{#d91a1a}-1.96\%$
test_sac_speed[True-None] 1.9660ms 1.8412ms 543.1228 Ops/s 545.0603 Ops/s $\color{#d91a1a}-0.36\%$
test_sac_speed[True-backward] 3.6537ms 3.5870ms 278.7814 Ops/s 274.1408 Ops/s $\color{#35bf28}+1.69\%$
test_sac_speed[reduce-overhead-None] 18.5390ms 10.7757ms 92.8015 Ops/s 80.7939 Ops/s $\textbf{\color{#35bf28}+14.86\%}$
test_redq_deprec_speed[False-None] 10.0329ms 9.2026ms 108.6650 Ops/s 106.3963 Ops/s $\color{#35bf28}+2.13\%$
test_redq_deprec_speed[False-backward] 12.9545ms 12.4773ms 80.1457 Ops/s 78.7387 Ops/s $\color{#35bf28}+1.79\%$
test_redq_deprec_speed[True-None] 2.7228ms 2.5510ms 391.9972 Ops/s 395.3414 Ops/s $\color{#d91a1a}-0.85\%$
test_redq_deprec_speed[True-backward] 4.6510ms 4.2910ms 233.0451 Ops/s 235.4562 Ops/s $\color{#d91a1a}-1.02\%$
test_redq_deprec_speed[reduce-overhead-None] 16.2935ms 9.8528ms 101.4939 Ops/s 100.5334 Ops/s $\color{#35bf28}+0.96\%$
test_td3_speed[False-None] 8.2025ms 8.0920ms 123.5782 Ops/s 122.7390 Ops/s $\color{#35bf28}+0.68\%$
test_td3_speed[False-backward] 11.0353ms 10.5865ms 94.4602 Ops/s 95.0467 Ops/s $\color{#d91a1a}-0.62\%$
test_td3_speed[True-None] 1.6868ms 1.6572ms 603.4258 Ops/s 603.0164 Ops/s $\color{#35bf28}+0.07\%$
test_td3_speed[True-backward] 3.3126ms 3.2521ms 307.4892 Ops/s 314.3987 Ops/s $\color{#d91a1a}-2.20\%$
test_td3_speed[reduce-overhead-None] 83.4214ms 24.8254ms 40.2813 Ops/s 39.8455 Ops/s $\color{#35bf28}+1.09\%$
test_cql_speed[False-None] 17.3609ms 17.1205ms 58.4094 Ops/s 57.8359 Ops/s $\color{#35bf28}+0.99\%$
test_cql_speed[False-backward] 23.3668ms 22.5260ms 44.3931 Ops/s 44.6785 Ops/s $\color{#d91a1a}-0.64\%$
test_cql_speed[True-None] 3.9117ms 3.3351ms 299.8371 Ops/s 301.1543 Ops/s $\color{#d91a1a}-0.44\%$
test_cql_speed[True-backward] 5.8753ms 5.5186ms 181.2057 Ops/s 184.5320 Ops/s $\color{#d91a1a}-1.80\%$
test_cql_speed[reduce-overhead-None] 18.9448ms 12.0463ms 83.0128 Ops/s 83.5172 Ops/s $\color{#d91a1a}-0.60\%$
test_a2c_speed[False-None] 4.2679ms 3.2321ms 309.3919 Ops/s 309.1116 Ops/s $\color{#35bf28}+0.09\%$
test_a2c_speed[False-backward] 6.5833ms 6.2274ms 160.5814 Ops/s 158.7673 Ops/s $\color{#35bf28}+1.14\%$
test_a2c_speed[True-None] 1.4250ms 1.3536ms 738.7562 Ops/s 734.4902 Ops/s $\color{#35bf28}+0.58\%$
test_a2c_speed[True-backward] 3.8978ms 3.1324ms 319.2437 Ops/s 319.9171 Ops/s $\color{#d91a1a}-0.21\%$
test_a2c_speed[reduce-overhead-None] 1.0833ms 0.9963ms 1.0038 KOps/s 1.0208 KOps/s $\color{#d91a1a}-1.67\%$
test_ppo_speed[False-None] 3.9445ms 3.7950ms 263.5017 Ops/s 260.6994 Ops/s $\color{#35bf28}+1.07\%$
test_ppo_speed[False-backward] 7.3520ms 6.9834ms 143.1964 Ops/s 143.0382 Ops/s $\color{#35bf28}+0.11\%$
test_ppo_speed[True-None] 1.5231ms 1.4503ms 689.5038 Ops/s 699.4508 Ops/s $\color{#d91a1a}-1.42\%$
test_ppo_speed[True-backward] 3.3485ms 3.2833ms 304.5745 Ops/s 318.8231 Ops/s $\color{#d91a1a}-4.47\%$
test_ppo_speed[reduce-overhead-None] 1.5063ms 1.0573ms 945.8311 Ops/s 940.0537 Ops/s $\color{#35bf28}+0.61\%$
test_reinforce_speed[False-None] 2.6755ms 2.2490ms 444.6417 Ops/s 435.8025 Ops/s $\color{#35bf28}+2.03\%$
test_reinforce_speed[False-backward] 3.8233ms 3.3667ms 297.0238 Ops/s 294.1458 Ops/s $\color{#35bf28}+0.98\%$
test_reinforce_speed[True-None] 1.4044ms 1.3079ms 764.5825 Ops/s 764.3224 Ops/s $\color{#35bf28}+0.03\%$
test_reinforce_speed[True-backward] 3.1580ms 3.0445ms 328.4616 Ops/s 327.5412 Ops/s $\color{#35bf28}+0.28\%$
test_reinforce_speed[reduce-overhead-None] 17.5608ms 9.6211ms 103.9386 Ops/s 102.8966 Ops/s $\color{#35bf28}+1.01\%$
test_iql_speed[False-None] 9.8896ms 9.2863ms 107.6851 Ops/s 106.4654 Ops/s $\color{#35bf28}+1.15\%$
test_iql_speed[False-backward] 13.0039ms 12.8202ms 78.0017 Ops/s 77.0028 Ops/s $\color{#35bf28}+1.30\%$
test_iql_speed[True-None] 2.6280ms 2.2110ms 452.2741 Ops/s 450.4900 Ops/s $\color{#35bf28}+0.40\%$
test_iql_speed[True-backward] 5.0469ms 4.8977ms 204.1757 Ops/s 203.1529 Ops/s $\color{#35bf28}+0.50\%$
test_iql_speed[reduce-overhead-None] 17.8182ms 10.4342ms 95.8386 Ops/s 92.8798 Ops/s $\color{#35bf28}+3.19\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 6.1263ms 5.9767ms 167.3170 Ops/s 167.4734 Ops/s $\color{#d91a1a}-0.09\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.7976ms 0.3643ms 2.7453 KOps/s 3.6209 KOps/s $\textbf{\color{#d91a1a}-24.18\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.6310ms 0.2922ms 3.4225 KOps/s 3.8394 KOps/s $\textbf{\color{#d91a1a}-10.86\%}$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 6.3806ms 5.7910ms 172.6811 Ops/s 175.2266 Ops/s $\color{#d91a1a}-1.45\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.9243ms 0.3259ms 3.0686 KOps/s 2.7888 KOps/s $\textbf{\color{#35bf28}+10.03\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.5642ms 0.3147ms 3.1781 KOps/s 2.9437 KOps/s $\textbf{\color{#35bf28}+7.96\%}$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 1.6878ms 1.2916ms 774.2198 Ops/s 731.6972 Ops/s $\textbf{\color{#35bf28}+5.81\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 1.6659ms 1.1426ms 875.2053 Ops/s 772.2239 Ops/s $\textbf{\color{#35bf28}+13.34\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 6.4050ms 5.9770ms 167.3080 Ops/s 165.9263 Ops/s $\color{#35bf28}+0.83\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1.0433ms 0.4504ms 2.2205 KOps/s 2.0584 KOps/s $\textbf{\color{#35bf28}+7.87\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.6146ms 0.4248ms 2.3539 KOps/s 2.1735 KOps/s $\textbf{\color{#35bf28}+8.30\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 6.2966ms 5.8692ms 170.3809 Ops/s 168.6440 Ops/s $\color{#35bf28}+1.03\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.9388ms 0.3597ms 2.7802 KOps/s 3.5726 KOps/s $\textbf{\color{#d91a1a}-22.18\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.5707ms 0.3446ms 2.9017 KOps/s 3.3050 KOps/s $\textbf{\color{#d91a1a}-12.20\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 6.3700ms 5.7991ms 172.4398 Ops/s 170.6877 Ops/s $\color{#35bf28}+1.03\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 1.0841ms 0.2742ms 3.6475 KOps/s 2.9914 KOps/s $\textbf{\color{#35bf28}+21.94\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.4547ms 0.2610ms 3.8316 KOps/s 3.2264 KOps/s $\textbf{\color{#35bf28}+18.76\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 6.4368ms 6.0036ms 166.5678 Ops/s 165.7261 Ops/s $\color{#35bf28}+0.51\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 0.5644s 1.3123ms 762.0095 Ops/s 2.1024 KOps/s $\textbf{\color{#d91a1a}-63.76\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.6746ms 0.4894ms 2.0435 KOps/s 2.4272 KOps/s $\textbf{\color{#d91a1a}-15.81\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 6.5774ms 5.1157ms 195.4784 Ops/s 196.0941 Ops/s $\color{#d91a1a}-0.31\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 8.5087ms 1.9715ms 507.2308 Ops/s 439.6075 Ops/s $\textbf{\color{#35bf28}+15.38\%}$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 9.7156ms 1.2871ms 776.9621 Ops/s 998.7247 Ops/s $\textbf{\color{#d91a1a}-22.20\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 6.9391ms 5.0270ms 198.9249 Ops/s 49.8469 Ops/s $\textbf{\color{#35bf28}+299.07\%}$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 4.0630ms 1.8018ms 555.0058 Ops/s 479.8795 Ops/s $\textbf{\color{#35bf28}+15.66\%}$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 1.0415ms 0.9119ms 1.0966 KOps/s 810.0927 Ops/s $\textbf{\color{#35bf28}+35.36\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 0.5375s 16.0212ms 62.4174 Ops/s 185.7242 Ops/s $\textbf{\color{#d91a1a}-66.39\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 4.1428ms 1.9396ms 515.5643 Ops/s 466.1791 Ops/s $\textbf{\color{#35bf28}+10.59\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 2.2430ms 1.1219ms 891.3702 Ops/s 919.0262 Ops/s $\color{#d91a1a}-3.01\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] 37.4228ms 35.1599ms 28.4415 Ops/s 28.0314 Ops/s $\color{#35bf28}+1.46\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] 19.3446ms 17.7536ms 56.3266 Ops/s 55.7369 Ops/s $\color{#35bf28}+1.06\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-True] 40.2737ms 36.7617ms 27.2022 Ops/s 27.1048 Ops/s $\color{#35bf28}+0.36\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-False] 19.6009ms 17.9238ms 55.7916 Ops/s 55.0019 Ops/s $\color{#35bf28}+1.44\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-True] 40.2746ms 38.4836ms 25.9851 Ops/s 25.7638 Ops/s $\color{#35bf28}+0.86\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-False] 20.7312ms 19.5503ms 51.1501 Ops/s 50.9672 Ops/s $\color{#35bf28}+0.36\%$
test_storage_write_lazystack[50-img_shape0-small] 0.8738ms 0.2183ms 4.5815 KOps/s 4.6464 KOps/s $\color{#d91a1a}-1.40\%$
test_storage_write_lazystack[100-img_shape1-atari] 1.5600ms 1.4119ms 708.2603 Ops/s 708.5095 Ops/s $\color{#d91a1a}-0.04\%$
test_storage_write_lazystack[100-img_shape2-large_img] 2.5242ms 2.2994ms 434.8922 Ops/s 424.1399 Ops/s $\color{#35bf28}+2.54\%$
test_storage_write_lazystack[200-img_shape3-large_batch] 3.3289ms 2.8933ms 345.6202 Ops/s 342.6124 Ops/s $\color{#35bf28}+0.88\%$
test_storage_write_contiguous[50-img_shape0-small] 0.6349ms 0.1604ms 6.2363 KOps/s 6.1801 KOps/s $\color{#35bf28}+0.91\%$
test_storage_write_contiguous[100-img_shape1-atari] 0.3890ms 0.2200ms 4.5449 KOps/s 4.3750 KOps/s $\color{#35bf28}+3.88\%$
test_storage_write_contiguous[100-img_shape2-large_img] 1.8597ms 1.7040ms 586.8392 Ops/s 547.3499 Ops/s $\textbf{\color{#35bf28}+7.21\%}$
test_storage_write_contiguous[200-img_shape3-large_batch] 1.5455ms 1.3859ms 721.5325 Ops/s 731.6624 Ops/s $\color{#d91a1a}-1.38\%$
test_collector_stack_then_write[50-img_shape0-small] 1.5775ms 1.1599ms 862.1680 Ops/s 871.5197 Ops/s $\color{#d91a1a}-1.07\%$
test_collector_stack_then_write[100-img_shape1-atari] 4.1063ms 3.5830ms 279.0939 Ops/s 277.5560 Ops/s $\color{#35bf28}+0.55\%$
test_collector_stack_then_write[100-img_shape2-large_img] 6.2708ms 5.8211ms 171.7894 Ops/s 171.8828 Ops/s $\color{#d91a1a}-0.05\%$
test_collector_stack_then_write[200-img_shape3-large_batch] 7.7923ms 7.3700ms 135.6847 Ops/s 136.4020 Ops/s $\color{#d91a1a}-0.53\%$
test_collector_lazystack_then_write[50-img_shape0-small] 0.4162ms 0.2703ms 3.6996 KOps/s 3.6741 KOps/s $\color{#35bf28}+0.69\%$
test_collector_lazystack_then_write[100-img_shape1-atari] 1.9716ms 1.5255ms 655.5331 Ops/s 653.4148 Ops/s $\color{#35bf28}+0.32\%$
test_collector_lazystack_then_write[100-img_shape2-large_img] 2.8831ms 2.3910ms 418.2402 Ops/s 434.9448 Ops/s $\color{#d91a1a}-3.84\%$
test_collector_lazystack_then_write[200-img_shape3-large_batch] 3.5477ms 3.1027ms 322.3037 Ops/s 318.9266 Ops/s $\color{#35bf28}+1.06\%$
test_collector_without_rb[100-img_shape0-atari] 34.7905ms 33.6178ms 29.7462 Ops/s 29.5790 Ops/s $\color{#35bf28}+0.57\%$
test_collector_without_rb[200-img_shape1-large_batch] 67.4346ms 66.1010ms 15.1284 Ops/s 15.0111 Ops/s $\color{#35bf28}+0.78\%$
test_collector_with_rb[100-img_shape0-atari] 39.7313ms 37.9364ms 26.3599 Ops/s 26.3699 Ops/s $\color{#d91a1a}-0.04\%$
test_collector_with_rb[200-img_shape1-large_batch] 75.6723ms 73.8230ms 13.5459 Ops/s 13.1207 Ops/s $\color{#35bf28}+3.24\%$
test_collector_without_rb_cuda[100-img_shape0-atari] 0.7618s 95.2890ms 10.4944 Ops/s 17.3836 Ops/s $\textbf{\color{#d91a1a}-39.63\%}$
test_collector_without_rb_cuda[200-img_shape1-large_batch] 0.1155s 0.1125s 8.8853 Ops/s 8.7001 Ops/s $\color{#35bf28}+2.13\%$
test_collector_with_rb_cuda[100-img_shape0-atari] 61.5329ms 58.4443ms 17.1103 Ops/s 16.8825 Ops/s $\color{#35bf28}+1.35\%$
test_collector_with_rb_cuda[200-img_shape1-large_batch] 0.1169s 0.1159s 8.6256 Ops/s 8.4938 Ops/s $\color{#35bf28}+1.55\%$

vmoens added a commit that referenced this pull request Feb 7, 2026
Optimise the output-reading phase of step_and_maybe_reset when shared
memory and target device are both known and different (the common
CPU-shared -> CUDA case).

- When shared_device is not None and shared_device != device: use a
  single td.to(device) instead of _fast_apply with per-tensor check.
  Since .to() already creates new tensors, the extra .clone() is
  unnecessary.
- Keep the _fast_apply fallback for the mixed-device case.
- Move _sync_w2m() into a conditional - only called when a cross-device
  transfer actually happened.

Co-authored-by: Cursor <[email protected]>
ghstack-source-id: 07aba16
Pull-Request: #3458
@vmoens vmoens merged commit fbd36ac into gh/vmoens/219/base Feb 7, 2026
115 of 116 checks passed
@vmoens vmoens deleted the gh/vmoens/219/head branch February 7, 2026 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Performance Performance issue or suggestion for improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant