Skip to content

[Perf] ParallelEnv: replace mp.Event with shared-memory done flags#3457

Merged
vmoens merged 2 commits intogh/vmoens/218/basefrom
gh/vmoens/218/head
Feb 7, 2026
Merged

[Perf] ParallelEnv: replace mp.Event with shared-memory done flags#3457
vmoens merged 2 commits intogh/vmoens/218/basefrom
gh/vmoens/218/head

Conversation

@vmoens
Copy link
Collaborator

@vmoens vmoens commented Feb 6, 2026

Stack from ghstack (oldest at bottom):

Replace multiprocessing.Event (futex-based syscalls) with
multiprocessing.RawArray shared-memory byte flags for worker-to-parent
completion signaling on the hot path (step_and_maybe_reset).

  • _start_workers: creates shm_done_flags RawArray, passes to workers
  • _wait_for_workers: spin-polls done_flags instead of Event.wait()
  • Worker: _signal_done() closure writes shm_done_flags[idx]=1
  • _shutdown_workers: uses _wait_for_workers instead of Event.wait()

Measured impact:

  • 10% FPS improvement (7,737 -> 8,509 fps) on H200 with 8 workers
  • 28% reduction in penv.wait_for_workers overhead (2,622us -> 1,891us)
  • ParallelEnv.close() fixed from 80s timeout to ~0.9s

Co-authored-by: Cursor [email protected]

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 6, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3457

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 1643a0b with merge base ab49b59 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 6, 2026
@github-actions github-actions bot added the Performance Performance issue or suggestion for improvement label Feb 6, 2026
[ghstack-poisoned]
@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of CPU Benchmark Tests

Total Benchmarks: 173. Improved: $\large\color{#35bf28}20$. Worsened: $\large\color{#d91a1a}9$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_tensor_to_bytestream_speed[pickle] 81.0465μs 80.1412μs 12.4780 KOps/s 12.3741 KOps/s $\color{#35bf28}+0.84\%$
test_tensor_to_bytestream_speed[torch.save] 0.1407ms 0.1400ms 7.1427 KOps/s 7.0182 KOps/s $\color{#35bf28}+1.77\%$
test_tensor_to_bytestream_speed[untyped_storage] 0.1178s 0.1175s 8.5126 Ops/s 8.4855 Ops/s $\color{#35bf28}+0.32\%$
test_tensor_to_bytestream_speed[numpy] 2.5658μs 2.5600μs 390.6195 KOps/s 394.9232 KOps/s $\color{#d91a1a}-1.09\%$
test_tensor_to_bytestream_speed[safetensors] 37.6820μs 37.3161μs 26.7981 KOps/s 26.2366 KOps/s $\color{#35bf28}+2.14\%$
test_simple 0.5529s 0.5517s 1.8125 Ops/s 1.7310 Ops/s $\color{#35bf28}+4.71\%$
test_transformed 1.2510s 1.1548s 0.8660 Ops/s 0.8554 Ops/s $\color{#35bf28}+1.23\%$
test_serial 1.7015s 1.6961s 0.5896 Ops/s 0.5770 Ops/s $\color{#35bf28}+2.18\%$
test_parallel 1.1461s 1.0526s 0.9500 Ops/s 0.9411 Ops/s $\color{#35bf28}+0.95\%$
test_step_mdp_speed[True-True-True-True-True] 87.4210μs 44.4850μs 22.4795 KOps/s 21.7650 KOps/s $\color{#35bf28}+3.28\%$
test_step_mdp_speed[True-True-True-True-False] 0.4357ms 25.0102μs 39.9837 KOps/s 38.3738 KOps/s $\color{#35bf28}+4.20\%$
test_step_mdp_speed[True-True-True-False-True] 0.4337ms 25.6375μs 39.0054 KOps/s 39.2977 KOps/s $\color{#d91a1a}-0.74\%$
test_step_mdp_speed[True-True-True-False-False] 40.1310μs 13.6844μs 73.0762 KOps/s 71.3605 KOps/s $\color{#35bf28}+2.40\%$
test_step_mdp_speed[True-True-False-True-True] 0.4736ms 47.7857μs 20.9267 KOps/s 20.6553 KOps/s $\color{#35bf28}+1.31\%$
test_step_mdp_speed[True-True-False-True-False] 0.4433ms 27.6432μs 36.1753 KOps/s 35.4633 KOps/s $\color{#35bf28}+2.01\%$
test_step_mdp_speed[True-True-False-False-True] 0.4407ms 27.4630μs 36.4127 KOps/s 34.7211 KOps/s $\color{#35bf28}+4.87\%$
test_step_mdp_speed[True-True-False-False-False] 52.3900μs 16.4380μs 60.8348 KOps/s 58.8232 KOps/s $\color{#35bf28}+3.42\%$
test_step_mdp_speed[True-False-True-True-True] 0.4677ms 50.0795μs 19.9683 KOps/s 19.3590 KOps/s $\color{#35bf28}+3.15\%$
test_step_mdp_speed[True-False-True-True-False] 0.4521ms 30.4839μs 32.8042 KOps/s 31.7955 KOps/s $\color{#35bf28}+3.17\%$
test_step_mdp_speed[True-False-True-False-True] 0.4519ms 27.9433μs 35.7867 KOps/s 34.9332 KOps/s $\color{#35bf28}+2.44\%$
test_step_mdp_speed[True-False-True-False-False] 46.4600μs 16.5974μs 60.2503 KOps/s 58.4306 KOps/s $\color{#35bf28}+3.11\%$
test_step_mdp_speed[True-False-False-True-True] 0.4714ms 52.7507μs 18.9571 KOps/s 18.3440 KOps/s $\color{#35bf28}+3.34\%$
test_step_mdp_speed[True-False-False-True-False] 0.4516ms 33.1597μs 30.1571 KOps/s 29.0350 KOps/s $\color{#35bf28}+3.86\%$
test_step_mdp_speed[True-False-False-False-True] 0.4498ms 30.2158μs 33.0953 KOps/s 31.9561 KOps/s $\color{#35bf28}+3.56\%$
test_step_mdp_speed[True-False-False-False-False] 51.1910μs 19.2299μs 52.0024 KOps/s 50.0472 KOps/s $\color{#35bf28}+3.91\%$
test_step_mdp_speed[False-True-True-True-True] 97.3910μs 49.8318μs 20.0675 KOps/s 19.6765 KOps/s $\color{#35bf28}+1.99\%$
test_step_mdp_speed[False-True-True-True-False] 62.8310μs 30.6069μs 32.6723 KOps/s 31.7326 KOps/s $\color{#35bf28}+2.96\%$
test_step_mdp_speed[False-True-True-False-True] 2.4280ms 31.8613μs 31.3861 KOps/s 31.3191 KOps/s $\color{#35bf28}+0.21\%$
test_step_mdp_speed[False-True-True-False-False] 46.6910μs 18.3906μs 54.3755 KOps/s 53.4349 KOps/s $\color{#35bf28}+1.76\%$
test_step_mdp_speed[False-True-False-True-True] 87.5210μs 53.4194μs 18.7198 KOps/s 18.5525 KOps/s $\color{#35bf28}+0.90\%$
test_step_mdp_speed[False-True-False-True-False] 60.8410μs 33.2997μs 30.0303 KOps/s 29.0813 KOps/s $\color{#35bf28}+3.26\%$
test_step_mdp_speed[False-True-False-False-True] 64.8410μs 34.2821μs 29.1698 KOps/s 28.5283 KOps/s $\color{#35bf28}+2.25\%$
test_step_mdp_speed[False-True-False-False-False] 61.8100μs 20.9360μs 47.7645 KOps/s 46.3734 KOps/s $\color{#35bf28}+3.00\%$
test_step_mdp_speed[False-False-True-True-True] 92.8710μs 55.8433μs 17.9072 KOps/s 17.6481 KOps/s $\color{#35bf28}+1.47\%$
test_step_mdp_speed[False-False-True-True-False] 90.2310μs 36.1535μs 27.6598 KOps/s 26.8893 KOps/s $\color{#35bf28}+2.87\%$
test_step_mdp_speed[False-False-True-False-True] 68.4400μs 34.1711μs 29.2645 KOps/s 28.7717 KOps/s $\color{#35bf28}+1.71\%$
test_step_mdp_speed[False-False-True-False-False] 52.9200μs 21.0896μs 47.4167 KOps/s 46.4705 KOps/s $\color{#35bf28}+2.04\%$
test_step_mdp_speed[False-False-False-True-True] 90.6610μs 58.8598μs 16.9895 KOps/s 16.9048 KOps/s $\color{#35bf28}+0.50\%$
test_step_mdp_speed[False-False-False-True-False] 66.3510μs 38.5631μs 25.9315 KOps/s 25.1312 KOps/s $\color{#35bf28}+3.18\%$
test_step_mdp_speed[False-False-False-False-True] 79.9510μs 36.2237μs 27.6063 KOps/s 26.8043 KOps/s $\color{#35bf28}+2.99\%$
test_step_mdp_speed[False-False-False-False-False] 51.2900μs 23.1204μs 43.2519 KOps/s 41.5762 KOps/s $\color{#35bf28}+4.03\%$
test_non_tensor_env_rollout_speed[1000-single-True] 0.8746s 0.7766s 1.2877 Ops/s 1.2772 Ops/s $\color{#35bf28}+0.82\%$
test_non_tensor_env_rollout_speed[1000-single-False] 0.7373s 0.6377s 1.5681 Ops/s 1.5602 Ops/s $\color{#35bf28}+0.50\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-True] 1.8039s 1.7110s 0.5845 Ops/s 0.5847 Ops/s $\color{#d91a1a}-0.04\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-False] 1.5396s 1.4622s 0.6839 Ops/s 0.6768 Ops/s $\color{#35bf28}+1.05\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-True] 2.0339s 1.9448s 0.5142 Ops/s 0.5107 Ops/s $\color{#35bf28}+0.69\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-False] 1.7953s 1.7170s 0.5824 Ops/s 0.5792 Ops/s $\color{#35bf28}+0.56\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-True] 4.7964s 4.6849s 0.2135 Ops/s 0.2108 Ops/s $\color{#35bf28}+1.28\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-False] 4.5404s 4.4691s 0.2238 Ops/s 0.2211 Ops/s $\color{#35bf28}+1.20\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-True] 1.9801s 1.9109s 0.5233 Ops/s 0.5188 Ops/s $\color{#35bf28}+0.87\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-False] 1.7845s 1.6277s 0.6144 Ops/s 0.6131 Ops/s $\color{#35bf28}+0.21\%$
test_values[generalized_advantage_estimate-True-True] 12.1420ms 10.9418ms 91.3923 Ops/s 94.4008 Ops/s $\color{#d91a1a}-3.19\%$
test_values[vec_generalized_advantage_estimate-True-True] 19.8166ms 13.7035ms 72.9741 Ops/s 55.9455 Ops/s $\textbf{\color{#35bf28}+30.44\%}$
test_values[td0_return_estimate-False-False] 0.2402ms 0.1360ms 7.3552 KOps/s 7.7533 KOps/s $\textbf{\color{#d91a1a}-5.13\%}$
test_values[td1_return_estimate-False-False] 29.6793ms 28.8758ms 34.6310 Ops/s 35.3518 Ops/s $\color{#d91a1a}-2.04\%$
test_values[vec_td1_return_estimate-False-False] 18.4331ms 15.1262ms 66.1105 Ops/s 54.8850 Ops/s $\textbf{\color{#35bf28}+20.45\%}$
test_values[td_lambda_return_estimate-True-False] 43.1755ms 42.6642ms 23.4388 Ops/s 23.6781 Ops/s $\color{#d91a1a}-1.01\%$
test_values[vec_td_lambda_return_estimate-True-False] 18.0284ms 12.3827ms 80.7578 Ops/s 55.2984 Ops/s $\textbf{\color{#35bf28}+46.04\%}$
test_gae_speed[generalized_advantage_estimate-False-1-512] 9.5956ms 9.4844ms 105.4365 Ops/s 105.6019 Ops/s $\color{#d91a1a}-0.16\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 1.7349ms 1.5792ms 633.2303 Ops/s 653.8908 Ops/s $\color{#d91a1a}-3.16\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.4937ms 0.4381ms 2.2824 KOps/s 2.2962 KOps/s $\color{#d91a1a}-0.60\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 30.4718ms 29.9030ms 33.4415 Ops/s 31.9575 Ops/s $\color{#35bf28}+4.64\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 1.8257ms 1.7145ms 583.2589 Ops/s 581.8817 Ops/s $\color{#35bf28}+0.24\%$
test_dqn_speed[False-None] 1.6580ms 1.4053ms 711.6048 Ops/s 705.0253 Ops/s $\color{#35bf28}+0.93\%$
test_dqn_speed[False-backward] 1.9885ms 1.9228ms 520.0775 Ops/s 513.6461 Ops/s $\color{#35bf28}+1.25\%$
test_dqn_speed[True-None] 0.6680ms 0.5483ms 1.8239 KOps/s 1.7496 KOps/s $\color{#35bf28}+4.25\%$
test_dqn_speed[True-backward] 1.0424ms 1.0154ms 984.8253 Ops/s 890.3364 Ops/s $\textbf{\color{#35bf28}+10.61\%}$
test_dqn_speed[reduce-overhead-None] 0.6549ms 0.5471ms 1.8279 KOps/s 1.7906 KOps/s $\color{#35bf28}+2.08\%$
test_ddpg_speed[False-None] 3.2707ms 2.8732ms 348.0419 Ops/s 349.2912 Ops/s $\color{#d91a1a}-0.36\%$
test_ddpg_speed[False-backward] 4.2245ms 4.0994ms 243.9391 Ops/s 245.8617 Ops/s $\color{#d91a1a}-0.78\%$
test_ddpg_speed[True-None] 1.5608ms 1.4108ms 708.8270 Ops/s 678.0647 Ops/s $\color{#35bf28}+4.54\%$
test_ddpg_speed[True-backward] 2.4897ms 2.4319ms 411.2035 Ops/s 379.6240 Ops/s $\textbf{\color{#35bf28}+8.32\%}$
test_ddpg_speed[reduce-overhead-None] 1.5674ms 1.4295ms 699.5505 Ops/s 690.8347 Ops/s $\color{#35bf28}+1.26\%$
test_sac_speed[False-None] 8.7360ms 8.0534ms 124.1709 Ops/s 120.9643 Ops/s $\color{#35bf28}+2.65\%$
test_sac_speed[False-backward] 11.8521ms 11.2811ms 88.6436 Ops/s 87.8442 Ops/s $\color{#35bf28}+0.91\%$
test_sac_speed[True-None] 2.3124ms 2.2002ms 454.5059 Ops/s 457.2900 Ops/s $\color{#d91a1a}-0.61\%$
test_sac_speed[True-backward] 4.1551ms 4.0431ms 247.3379 Ops/s 203.0807 Ops/s $\textbf{\color{#35bf28}+21.79\%}$
test_sac_speed[reduce-overhead-None] 2.3258ms 2.1814ms 458.4134 Ops/s 445.3448 Ops/s $\color{#35bf28}+2.93\%$
test_redq_speed[False-None] 10.8304ms 10.2997ms 97.0902 Ops/s 87.9269 Ops/s $\textbf{\color{#35bf28}+10.42\%}$
test_redq_speed[False-backward] 18.3138ms 17.5713ms 56.9111 Ops/s 55.3423 Ops/s $\color{#35bf28}+2.83\%$
test_redq_speed[True-None] 4.6406ms 4.4605ms 224.1893 Ops/s 221.5255 Ops/s $\color{#35bf28}+1.20\%$
test_redq_speed[True-backward] 9.9633ms 9.6097ms 104.0618 Ops/s 98.8074 Ops/s $\textbf{\color{#35bf28}+5.32\%}$
test_redq_speed[reduce-overhead-None] 4.6995ms 4.4872ms 222.8539 Ops/s 224.1008 Ops/s $\color{#d91a1a}-0.56\%$
test_redq_deprec_speed[False-None] 11.4638ms 10.9529ms 91.3003 Ops/s 89.5019 Ops/s $\color{#35bf28}+2.01\%$
test_redq_deprec_speed[False-backward] 16.0887ms 15.7838ms 63.3563 Ops/s 62.4139 Ops/s $\color{#35bf28}+1.51\%$
test_redq_deprec_speed[True-None] 4.0751ms 3.7104ms 269.5095 Ops/s 268.4680 Ops/s $\color{#35bf28}+0.39\%$
test_redq_deprec_speed[True-backward] 7.8902ms 7.6742ms 130.3065 Ops/s 129.3405 Ops/s $\color{#35bf28}+0.75\%$
test_redq_deprec_speed[reduce-overhead-None] 4.0286ms 3.6696ms 272.5124 Ops/s 270.7919 Ops/s $\color{#35bf28}+0.64\%$
test_td3_speed[False-None] 8.2193ms 8.0979ms 123.4884 Ops/s 123.8560 Ops/s $\color{#d91a1a}-0.30\%$
test_td3_speed[False-backward] 11.4823ms 10.9679ms 91.1750 Ops/s 90.1581 Ops/s $\color{#35bf28}+1.13\%$
test_td3_speed[True-None] 1.9005ms 1.8641ms 536.4539 Ops/s 540.1615 Ops/s $\color{#d91a1a}-0.69\%$
test_td3_speed[True-backward] 3.8317ms 3.6945ms 270.6738 Ops/s 250.5550 Ops/s $\textbf{\color{#35bf28}+8.03\%}$
test_td3_speed[reduce-overhead-None] 1.8583ms 1.8289ms 546.7881 Ops/s 548.4806 Ops/s $\color{#d91a1a}-0.31\%$
test_cql_speed[False-None] 28.8438ms 26.0854ms 38.3356 Ops/s 38.0473 Ops/s $\color{#35bf28}+0.76\%$
test_cql_speed[False-backward] 35.8561ms 35.1597ms 28.4416 Ops/s 27.8238 Ops/s $\color{#35bf28}+2.22\%$
test_cql_speed[True-None] 13.4990ms 12.3675ms 80.8568 Ops/s 79.7581 Ops/s $\color{#35bf28}+1.38\%$
test_cql_speed[True-backward] 18.4289ms 17.7859ms 56.2243 Ops/s 54.9368 Ops/s $\color{#35bf28}+2.34\%$
test_cql_speed[reduce-overhead-None] 12.9161ms 12.5227ms 79.8548 Ops/s 78.8935 Ops/s $\color{#35bf28}+1.22\%$
test_a2c_speed[False-None] 5.8576ms 5.4651ms 182.9799 Ops/s 185.7094 Ops/s $\color{#d91a1a}-1.47\%$
test_a2c_speed[False-backward] 12.5219ms 11.8998ms 84.0352 Ops/s 83.2200 Ops/s $\color{#35bf28}+0.98\%$
test_a2c_speed[True-None] 4.0082ms 3.7266ms 268.3445 Ops/s 258.9982 Ops/s $\color{#35bf28}+3.61\%$
test_a2c_speed[True-backward] 9.0050ms 8.6398ms 115.7433 Ops/s 110.1452 Ops/s $\textbf{\color{#35bf28}+5.08\%}$
test_a2c_speed[reduce-overhead-None] 3.9105ms 3.7783ms 264.6687 Ops/s 261.5434 Ops/s $\color{#35bf28}+1.19\%$
test_ppo_speed[False-None] 6.2680ms 5.9764ms 167.3240 Ops/s 169.2697 Ops/s $\color{#d91a1a}-1.15\%$
test_ppo_speed[False-backward] 12.8040ms 12.5112ms 79.9283 Ops/s 78.5734 Ops/s $\color{#35bf28}+1.72\%$
test_ppo_speed[True-None] 3.8658ms 3.6738ms 272.1977 Ops/s 263.8098 Ops/s $\color{#35bf28}+3.18\%$
test_ppo_speed[True-backward] 8.7197ms 8.5105ms 117.5013 Ops/s 114.6995 Ops/s $\color{#35bf28}+2.44\%$
test_ppo_speed[reduce-overhead-None] 3.9583ms 3.7775ms 264.7246 Ops/s 270.8703 Ops/s $\color{#d91a1a}-2.27\%$
test_reinforce_speed[False-None] 4.9557ms 4.6816ms 213.6000 Ops/s 220.2278 Ops/s $\color{#d91a1a}-3.01\%$
test_reinforce_speed[False-backward] 7.8066ms 7.5222ms 132.9392 Ops/s 135.3226 Ops/s $\color{#d91a1a}-1.76\%$
test_reinforce_speed[True-None] 3.0265ms 2.8622ms 349.3772 Ops/s 332.5637 Ops/s $\textbf{\color{#35bf28}+5.06\%}$
test_reinforce_speed[True-backward] 8.0508ms 7.7478ms 129.0695 Ops/s 127.5424 Ops/s $\color{#35bf28}+1.20\%$
test_reinforce_speed[reduce-overhead-None] 3.2158ms 2.8725ms 348.1294 Ops/s 342.2515 Ops/s $\color{#35bf28}+1.72\%$
test_iql_speed[False-None] 25.5059ms 20.3149ms 49.2248 Ops/s 47.7275 Ops/s $\color{#35bf28}+3.14\%$
test_iql_speed[False-backward] 36.1447ms 30.5996ms 32.6802 Ops/s 32.4758 Ops/s $\color{#35bf28}+0.63\%$
test_iql_speed[True-None] 8.7730ms 8.5094ms 117.5164 Ops/s 113.3086 Ops/s $\color{#35bf28}+3.71\%$
test_iql_speed[True-backward] 16.8638ms 16.5818ms 60.3071 Ops/s 58.8871 Ops/s $\color{#35bf28}+2.41\%$
test_iql_speed[reduce-overhead-None] 8.7732ms 8.6163ms 116.0585 Ops/s 114.9414 Ops/s $\color{#35bf28}+0.97\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 6.2927ms 6.1215ms 163.3595 Ops/s 162.5048 Ops/s $\color{#35bf28}+0.53\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 2.8361ms 0.3842ms 2.6031 KOps/s 3.2467 KOps/s $\textbf{\color{#d91a1a}-19.82\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.6886ms 0.3743ms 2.6715 KOps/s 3.5466 KOps/s $\textbf{\color{#d91a1a}-24.67\%}$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 6.0536ms 5.7918ms 172.6576 Ops/s 168.7717 Ops/s $\color{#35bf28}+2.30\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 2.2549ms 0.3130ms 3.1947 KOps/s 3.5230 KOps/s $\textbf{\color{#d91a1a}-9.32\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.6289ms 0.3083ms 3.2435 KOps/s 3.7462 KOps/s $\textbf{\color{#d91a1a}-13.42\%}$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 1.5875ms 1.3268ms 753.6763 Ops/s 779.3732 Ops/s $\color{#d91a1a}-3.30\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 1.5372ms 1.2809ms 780.6889 Ops/s 830.2367 Ops/s $\textbf{\color{#d91a1a}-5.97\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 9.5047ms 6.1143ms 163.5499 Ops/s 166.4931 Ops/s $\color{#d91a1a}-1.77\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 0.9773ms 0.4376ms 2.2854 KOps/s 2.2923 KOps/s $\color{#d91a1a}-0.30\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.6771ms 0.4401ms 2.2723 KOps/s 2.3912 KOps/s $\color{#d91a1a}-4.97\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 6.1574ms 5.8739ms 170.2454 Ops/s 168.1950 Ops/s $\color{#35bf28}+1.22\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 2.0946ms 0.3063ms 3.2648 KOps/s 3.1040 KOps/s $\textbf{\color{#35bf28}+5.18\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.5129ms 0.2887ms 3.4644 KOps/s 3.2120 KOps/s $\textbf{\color{#35bf28}+7.86\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 6.1032ms 5.8292ms 171.5510 Ops/s 169.5318 Ops/s $\color{#35bf28}+1.19\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 2.1580ms 0.3343ms 2.9916 KOps/s 2.7968 KOps/s $\textbf{\color{#35bf28}+6.97\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.5699ms 0.3130ms 3.1946 KOps/s 2.7473 KOps/s $\textbf{\color{#35bf28}+16.28\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 6.1190ms 5.9906ms 166.9287 Ops/s 165.9599 Ops/s $\color{#35bf28}+0.58\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1.8426ms 0.5095ms 1.9628 KOps/s 2.0519 KOps/s $\color{#d91a1a}-4.34\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.7349ms 0.4919ms 2.0329 KOps/s 2.3189 KOps/s $\textbf{\color{#d91a1a}-12.34\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 6.4438ms 5.0442ms 198.2471 Ops/s 57.0051 Ops/s $\textbf{\color{#35bf28}+247.77\%}$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 4.2254ms 2.0537ms 486.9346 Ops/s 504.7626 Ops/s $\color{#d91a1a}-3.53\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 3.2490ms 0.9202ms 1.0867 KOps/s 1.1455 KOps/s $\textbf{\color{#d91a1a}-5.13\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 0.5427s 15.8711ms 63.0077 Ops/s 194.3943 Ops/s $\textbf{\color{#d91a1a}-67.59\%}$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 12.8461ms 1.9755ms 506.1908 Ops/s 511.7632 Ops/s $\color{#d91a1a}-1.09\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 7.1460ms 1.2166ms 821.9544 Ops/s 771.7766 Ops/s $\textbf{\color{#35bf28}+6.50\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 6.6902ms 5.1823ms 192.9656 Ops/s 59.4087 Ops/s $\textbf{\color{#35bf28}+224.81\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 6.0448ms 1.9468ms 513.6626 Ops/s 463.1263 Ops/s $\textbf{\color{#35bf28}+10.91\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 1.1993ms 1.0188ms 981.5095 Ops/s 971.2701 Ops/s $\color{#35bf28}+1.05\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] 39.0299ms 36.3277ms 27.5272 Ops/s 27.3445 Ops/s $\color{#35bf28}+0.67\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] 20.9415ms 18.7068ms 53.4566 Ops/s 53.6611 Ops/s $\color{#d91a1a}-0.38\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-True] 40.0860ms 37.4539ms 26.6995 Ops/s 26.6721 Ops/s $\color{#35bf28}+0.10\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-False] 20.5218ms 18.6830ms 53.5247 Ops/s 53.0189 Ops/s $\color{#35bf28}+0.95\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-True] 40.7692ms 39.1663ms 25.5322 Ops/s 25.0887 Ops/s $\color{#35bf28}+1.77\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-False] 21.5623ms 20.2601ms 49.3580 Ops/s 48.9594 Ops/s $\color{#35bf28}+0.81\%$
test_storage_write_lazystack[50-img_shape0-small] 0.8605ms 0.2193ms 4.5598 KOps/s 4.5437 KOps/s $\color{#35bf28}+0.35\%$
test_storage_write_lazystack[100-img_shape1-atari] 1.6060ms 1.4038ms 712.3432 Ops/s 720.3182 Ops/s $\color{#d91a1a}-1.11\%$
test_storage_write_lazystack[100-img_shape2-large_img] 2.6216ms 2.3095ms 432.9929 Ops/s 429.0231 Ops/s $\color{#35bf28}+0.93\%$
test_storage_write_lazystack[200-img_shape3-large_batch] 3.1310ms 2.9628ms 337.5164 Ops/s 335.8331 Ops/s $\color{#35bf28}+0.50\%$
test_storage_write_contiguous[50-img_shape0-small] 0.2457ms 0.1351ms 7.4001 KOps/s 7.1221 KOps/s $\color{#35bf28}+3.90\%$
test_storage_write_contiguous[100-img_shape1-atari] 0.3489ms 0.2068ms 4.8350 KOps/s 4.7050 KOps/s $\color{#35bf28}+2.76\%$
test_storage_write_contiguous[100-img_shape2-large_img] 1.9507ms 1.7646ms 566.7126 Ops/s 552.7737 Ops/s $\color{#35bf28}+2.52\%$
test_storage_write_contiguous[200-img_shape3-large_batch] 1.4366ms 1.2943ms 772.6049 Ops/s 781.0504 Ops/s $\color{#d91a1a}-1.08\%$
test_collector_stack_then_write[50-img_shape0-small] 1.2487ms 1.1323ms 883.1777 Ops/s 880.2170 Ops/s $\color{#35bf28}+0.34\%$
test_collector_stack_then_write[100-img_shape1-atari] 3.7212ms 3.5174ms 284.3044 Ops/s 270.2823 Ops/s $\textbf{\color{#35bf28}+5.19\%}$
test_collector_stack_then_write[100-img_shape2-large_img] 11.1869ms 5.7361ms 174.3335 Ops/s 177.9472 Ops/s $\color{#d91a1a}-2.03\%$
test_collector_stack_then_write[200-img_shape3-large_batch] 7.4777ms 7.0511ms 141.8220 Ops/s 142.0237 Ops/s $\color{#d91a1a}-0.14\%$
test_collector_lazystack_then_write[50-img_shape0-small] 0.4299ms 0.2740ms 3.6494 KOps/s 3.6159 KOps/s $\color{#35bf28}+0.93\%$
test_collector_lazystack_then_write[100-img_shape1-atari] 1.8194ms 1.5363ms 650.8994 Ops/s 654.9538 Ops/s $\color{#d91a1a}-0.62\%$
test_collector_lazystack_then_write[100-img_shape2-large_img] 2.6531ms 2.4547ms 407.3782 Ops/s 412.1187 Ops/s $\color{#d91a1a}-1.15\%$
test_collector_lazystack_then_write[200-img_shape3-large_batch] 3.2677ms 3.1290ms 319.5905 Ops/s 319.6383 Ops/s $\color{#d91a1a}-0.01\%$
test_collector_without_rb[100-img_shape0-atari] 34.6747ms 34.1786ms 29.2581 Ops/s 28.9553 Ops/s $\color{#35bf28}+1.05\%$
test_collector_without_rb[200-img_shape1-large_batch] 67.4536ms 67.1721ms 14.8871 Ops/s 14.7717 Ops/s $\color{#35bf28}+0.78\%$
test_collector_with_rb[100-img_shape0-atari] 39.3759ms 38.7225ms 25.8247 Ops/s 25.4660 Ops/s $\color{#35bf28}+1.41\%$
test_collector_with_rb[200-img_shape1-large_batch] 76.3121ms 75.8431ms 13.1851 Ops/s 13.1130 Ops/s $\color{#35bf28}+0.55\%$

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of GPU Benchmark Tests

Total Benchmarks: 172. Improved: $\large\color{#35bf28}16$. Worsened: $\large\color{#d91a1a}11$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_tensor_to_bytestream_speed[pickle] 80.8382μs 79.6934μs 12.5481 KOps/s 12.3866 KOps/s $\color{#35bf28}+1.30\%$
test_tensor_to_bytestream_speed[torch.save] 0.1403ms 0.1397ms 7.1589 KOps/s 7.2104 KOps/s $\color{#d91a1a}-0.72\%$
test_tensor_to_bytestream_speed[untyped_storage] 0.1090s 0.1084s 9.2256 Ops/s 9.2166 Ops/s $\color{#35bf28}+0.10\%$
test_tensor_to_bytestream_speed[numpy] 2.6605μs 2.6423μs 378.4554 KOps/s 398.4340 KOps/s $\textbf{\color{#d91a1a}-5.01\%}$
test_tensor_to_bytestream_speed[safetensors] 37.7355μs 37.3775μs 26.7540 KOps/s 26.9845 KOps/s $\color{#d91a1a}-0.85\%$
test_simple 0.7943s 0.7936s 1.2601 Ops/s 1.2158 Ops/s $\color{#35bf28}+3.64\%$
test_transformed 1.5400s 1.4468s 0.6912 Ops/s 0.6849 Ops/s $\color{#35bf28}+0.92\%$
test_serial 2.4023s 2.3091s 0.4331 Ops/s 0.4282 Ops/s $\color{#35bf28}+1.13\%$
test_parallel 1.9112s 1.8155s 0.5508 Ops/s 0.5588 Ops/s $\color{#d91a1a}-1.43\%$
test_step_mdp_speed[True-True-True-True-True] 0.3509ms 45.2075μs 22.1202 KOps/s 22.2553 KOps/s $\color{#d91a1a}-0.61\%$
test_step_mdp_speed[True-True-True-True-False] 52.5530μs 26.0523μs 38.3843 KOps/s 39.6968 KOps/s $\color{#d91a1a}-3.31\%$
test_step_mdp_speed[True-True-True-False-True] 51.4030μs 25.2723μs 39.5690 KOps/s 40.2784 KOps/s $\color{#d91a1a}-1.76\%$
test_step_mdp_speed[True-True-True-False-False] 42.0720μs 13.9047μs 71.9182 KOps/s 72.2986 KOps/s $\color{#d91a1a}-0.53\%$
test_step_mdp_speed[True-True-False-True-True] 85.9540μs 48.6854μs 20.5400 KOps/s 21.1257 KOps/s $\color{#d91a1a}-2.77\%$
test_step_mdp_speed[True-True-False-True-False] 60.2120μs 28.4222μs 35.1838 KOps/s 35.9445 KOps/s $\color{#d91a1a}-2.12\%$
test_step_mdp_speed[True-True-False-False-True] 57.3330μs 28.1594μs 35.5121 KOps/s 36.0964 KOps/s $\color{#d91a1a}-1.62\%$
test_step_mdp_speed[True-True-False-False-False] 48.5620μs 16.7845μs 59.5789 KOps/s 60.1937 KOps/s $\color{#d91a1a}-1.02\%$
test_step_mdp_speed[True-False-True-True-True] 98.5450μs 50.9300μs 19.6348 KOps/s 20.0581 KOps/s $\color{#d91a1a}-2.11\%$
test_step_mdp_speed[True-False-True-True-False] 60.5630μs 31.4052μs 31.8418 KOps/s 32.6044 KOps/s $\color{#d91a1a}-2.34\%$
test_step_mdp_speed[True-False-True-False-True] 58.3130μs 27.9525μs 35.7750 KOps/s 36.0097 KOps/s $\color{#d91a1a}-0.65\%$
test_step_mdp_speed[True-False-True-False-False] 47.8930μs 16.6840μs 59.9378 KOps/s 60.0176 KOps/s $\color{#d91a1a}-0.13\%$
test_step_mdp_speed[True-False-False-True-True] 87.5840μs 54.0836μs 18.4899 KOps/s 18.9614 KOps/s $\color{#d91a1a}-2.49\%$
test_step_mdp_speed[True-False-False-True-False] 64.8330μs 34.0842μs 29.3391 KOps/s 29.9009 KOps/s $\color{#d91a1a}-1.88\%$
test_step_mdp_speed[True-False-False-False-True] 74.4630μs 30.8246μs 32.4416 KOps/s 33.4409 KOps/s $\color{#d91a1a}-2.99\%$
test_step_mdp_speed[True-False-False-False-False] 47.6920μs 19.3347μs 51.7205 KOps/s 51.0028 KOps/s $\color{#35bf28}+1.41\%$
test_step_mdp_speed[False-True-True-True-True] 91.6340μs 50.8954μs 19.6481 KOps/s 20.0800 KOps/s $\color{#d91a1a}-2.15\%$
test_step_mdp_speed[False-True-True-True-False] 62.9040μs 30.7334μs 32.5379 KOps/s 31.8054 KOps/s $\color{#35bf28}+2.30\%$
test_step_mdp_speed[False-True-True-False-True] 2.5204ms 34.0583μs 29.3615 KOps/s 31.8804 KOps/s $\textbf{\color{#d91a1a}-7.90\%}$
test_step_mdp_speed[False-True-True-False-False] 46.7630μs 18.6173μs 53.7136 KOps/s 54.2945 KOps/s $\color{#d91a1a}-1.07\%$
test_step_mdp_speed[False-True-False-True-True] 90.5040μs 54.9290μs 18.2053 KOps/s 19.0179 KOps/s $\color{#d91a1a}-4.27\%$
test_step_mdp_speed[False-True-False-True-False] 73.6740μs 34.1416μs 29.2898 KOps/s 29.6728 KOps/s $\color{#d91a1a}-1.29\%$
test_step_mdp_speed[False-True-False-False-True] 77.2540μs 34.9535μs 28.6095 KOps/s 29.7091 KOps/s $\color{#d91a1a}-3.70\%$
test_step_mdp_speed[False-True-False-False-False] 52.6630μs 21.1213μs 47.3455 KOps/s 47.3338 KOps/s $\color{#35bf28}+0.02\%$
test_step_mdp_speed[False-False-True-True-True] 91.9950μs 56.5261μs 17.6909 KOps/s 17.7258 KOps/s $\color{#d91a1a}-0.20\%$
test_step_mdp_speed[False-False-True-True-False] 69.6840μs 36.8453μs 27.1405 KOps/s 27.2548 KOps/s $\color{#d91a1a}-0.42\%$
test_step_mdp_speed[False-False-True-False-True] 68.3830μs 34.1380μs 29.2928 KOps/s 29.5216 KOps/s $\color{#d91a1a}-0.77\%$
test_step_mdp_speed[False-False-True-False-False] 56.8530μs 20.7002μs 48.3088 KOps/s 47.6112 KOps/s $\color{#35bf28}+1.47\%$
test_step_mdp_speed[False-False-False-True-True] 0.1105ms 58.0139μs 17.2373 KOps/s 17.2646 KOps/s $\color{#d91a1a}-0.16\%$
test_step_mdp_speed[False-False-False-True-False] 72.3030μs 39.0746μs 25.5921 KOps/s 25.9743 KOps/s $\color{#d91a1a}-1.47\%$
test_step_mdp_speed[False-False-False-False-True] 68.2230μs 36.8050μs 27.1702 KOps/s 28.0622 KOps/s $\color{#d91a1a}-3.18\%$
test_step_mdp_speed[False-False-False-False-False] 65.8430μs 23.5263μs 42.5057 KOps/s 41.9800 KOps/s $\color{#35bf28}+1.25\%$
test_non_tensor_env_rollout_speed[1000-single-True] 0.8707s 0.7650s 1.3072 Ops/s 1.2996 Ops/s $\color{#35bf28}+0.58\%$
test_non_tensor_env_rollout_speed[1000-single-False] 0.7266s 0.6316s 1.5833 Ops/s 1.5741 Ops/s $\color{#35bf28}+0.58\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-True] 1.7508s 1.6720s 0.5981 Ops/s 0.5948 Ops/s $\color{#35bf28}+0.55\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-False] 1.5362s 1.4536s 0.6879 Ops/s 0.6875 Ops/s $\color{#35bf28}+0.06\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-True] 2.0040s 1.9287s 0.5185 Ops/s 0.5199 Ops/s $\color{#d91a1a}-0.28\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-False] 1.7916s 1.7062s 0.5861 Ops/s 0.5867 Ops/s $\color{#d91a1a}-0.11\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-True] 4.7971s 4.6338s 0.2158 Ops/s 0.2127 Ops/s $\color{#35bf28}+1.45\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-False] 4.5571s 4.3985s 0.2274 Ops/s 0.2229 Ops/s $\color{#35bf28}+2.01\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-True] 1.9989s 1.9251s 0.5194 Ops/s 0.5158 Ops/s $\color{#35bf28}+0.70\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-False] 1.6975s 1.6190s 0.6176 Ops/s 0.6171 Ops/s $\color{#35bf28}+0.08\%$
test_values[generalized_advantage_estimate-True-True] 20.4550ms 19.9514ms 50.1219 Ops/s 49.2009 Ops/s $\color{#35bf28}+1.87\%$
test_values[vec_generalized_advantage_estimate-True-True] 0.1300s 3.5155ms 284.4516 Ops/s 287.8051 Ops/s $\color{#d91a1a}-1.17\%$
test_values[td0_return_estimate-False-False] 0.1077ms 81.6428μs 12.2485 KOps/s 12.2697 KOps/s $\color{#d91a1a}-0.17\%$
test_values[td1_return_estimate-False-False] 47.6179ms 47.2143ms 21.1800 Ops/s 20.7927 Ops/s $\color{#35bf28}+1.86\%$
test_values[vec_td1_return_estimate-False-False] 1.2874ms 1.0781ms 927.5725 Ops/s 922.9813 Ops/s $\color{#35bf28}+0.50\%$
test_values[td_lambda_return_estimate-True-False] 78.1378ms 77.4834ms 12.9060 Ops/s 12.8051 Ops/s $\color{#35bf28}+0.79\%$
test_values[vec_td_lambda_return_estimate-True-False] 1.2728ms 1.0757ms 929.6240 Ops/s 926.1803 Ops/s $\color{#35bf28}+0.37\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 20.5676ms 20.3099ms 49.2371 Ops/s 48.4035 Ops/s $\color{#35bf28}+1.72\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 1.0098ms 0.7445ms 1.3432 KOps/s 1.3305 KOps/s $\color{#35bf28}+0.95\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.7326ms 0.6751ms 1.4812 KOps/s 1.4854 KOps/s $\color{#d91a1a}-0.28\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 1.5578ms 1.4858ms 673.0188 Ops/s 673.6103 Ops/s $\color{#d91a1a}-0.09\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 0.7622ms 0.6883ms 1.4529 KOps/s 1.4503 KOps/s $\color{#35bf28}+0.18\%$
test_dqn_speed[False-None] 1.6304ms 1.5293ms 653.8955 Ops/s 659.2008 Ops/s $\color{#d91a1a}-0.80\%$
test_dqn_speed[False-backward] 2.2210ms 2.1622ms 462.4839 Ops/s 462.3999 Ops/s $\color{#35bf28}+0.02\%$
test_dqn_speed[True-None] 1.0852ms 0.5789ms 1.7273 KOps/s 1.6608 KOps/s $\color{#35bf28}+4.00\%$
test_dqn_speed[True-backward] 1.1696ms 1.1015ms 907.8635 Ops/s 812.1106 Ops/s $\textbf{\color{#35bf28}+11.79\%}$
test_dqn_speed[reduce-overhead-None] 0.6900ms 0.5928ms 1.6869 KOps/s 1.6470 KOps/s $\color{#35bf28}+2.42\%$
test_ddpg_speed[False-None] 3.2980ms 2.8965ms 345.2457 Ops/s 349.6749 Ops/s $\color{#d91a1a}-1.27\%$
test_ddpg_speed[False-backward] 4.6574ms 4.1074ms 243.4651 Ops/s 236.1135 Ops/s $\color{#35bf28}+3.11\%$
test_ddpg_speed[True-None] 1.4433ms 1.3389ms 746.9005 Ops/s 746.3871 Ops/s $\color{#35bf28}+0.07\%$
test_ddpg_speed[True-backward] 2.4678ms 2.3788ms 420.3876 Ops/s 391.0994 Ops/s $\textbf{\color{#35bf28}+7.49\%}$
test_ddpg_speed[reduce-overhead-None] 1.4573ms 1.3651ms 732.5401 Ops/s 731.2055 Ops/s $\color{#35bf28}+0.18\%$
test_sac_speed[False-None] 8.8845ms 8.2799ms 120.7744 Ops/s 121.9224 Ops/s $\color{#d91a1a}-0.94\%$
test_sac_speed[False-backward] 11.6098ms 11.1510ms 89.6783 Ops/s 88.4161 Ops/s $\color{#35bf28}+1.43\%$
test_sac_speed[True-None] 1.9618ms 1.8424ms 542.7711 Ops/s 538.5805 Ops/s $\color{#35bf28}+0.78\%$
test_sac_speed[True-backward] 3.5558ms 3.4541ms 289.5072 Ops/s 271.7500 Ops/s $\textbf{\color{#35bf28}+6.53\%}$
test_sac_speed[reduce-overhead-None] 20.1502ms 11.1029ms 90.0667 Ops/s 82.1603 Ops/s $\textbf{\color{#35bf28}+9.62\%}$
test_redq_deprec_speed[False-None] 9.8627ms 9.2510ms 108.0964 Ops/s 106.3392 Ops/s $\color{#35bf28}+1.65\%$
test_redq_deprec_speed[False-backward] 12.8369ms 12.2846ms 81.4029 Ops/s 79.0927 Ops/s $\color{#35bf28}+2.92\%$
test_redq_deprec_speed[True-None] 2.6513ms 2.5552ms 391.3636 Ops/s 389.9723 Ops/s $\color{#35bf28}+0.36\%$
test_redq_deprec_speed[True-backward] 4.2783ms 4.1167ms 242.9145 Ops/s 227.1056 Ops/s $\textbf{\color{#35bf28}+6.96\%}$
test_redq_deprec_speed[reduce-overhead-None] 16.3514ms 9.9248ms 100.7581 Ops/s 100.1937 Ops/s $\color{#35bf28}+0.56\%$
test_td3_speed[False-None] 8.3844ms 8.1586ms 122.5704 Ops/s 123.0100 Ops/s $\color{#d91a1a}-0.36\%$
test_td3_speed[False-backward] 10.9146ms 10.4473ms 95.7183 Ops/s 93.5386 Ops/s $\color{#35bf28}+2.33\%$
test_td3_speed[True-None] 1.7242ms 1.6818ms 594.5895 Ops/s 597.2951 Ops/s $\color{#d91a1a}-0.45\%$
test_td3_speed[True-backward] 3.3845ms 3.2637ms 306.3977 Ops/s 300.3268 Ops/s $\color{#35bf28}+2.02\%$
test_td3_speed[reduce-overhead-None] 83.5770ms 25.0231ms 39.9630 Ops/s 40.7721 Ops/s $\color{#d91a1a}-1.98\%$
test_cql_speed[False-None] 17.4671ms 17.1339ms 58.3640 Ops/s 58.2716 Ops/s $\color{#35bf28}+0.16\%$
test_cql_speed[False-backward] 23.1514ms 22.6211ms 44.2064 Ops/s 44.2589 Ops/s $\color{#d91a1a}-0.12\%$
test_cql_speed[True-None] 3.7919ms 3.3801ms 295.8473 Ops/s 299.7021 Ops/s $\color{#d91a1a}-1.29\%$
test_cql_speed[True-backward] 5.9554ms 5.4412ms 183.7821 Ops/s 176.4599 Ops/s $\color{#35bf28}+4.15\%$
test_cql_speed[reduce-overhead-None] 19.2258ms 12.0785ms 82.7917 Ops/s 83.8080 Ops/s $\color{#d91a1a}-1.21\%$
test_a2c_speed[False-None] 4.0214ms 3.2524ms 307.4679 Ops/s 310.9154 Ops/s $\color{#d91a1a}-1.11\%$
test_a2c_speed[False-backward] 6.1634ms 6.0237ms 166.0120 Ops/s 159.6938 Ops/s $\color{#35bf28}+3.96\%$
test_a2c_speed[True-None] 1.5209ms 1.3497ms 740.9267 Ops/s 733.6127 Ops/s $\color{#35bf28}+1.00\%$
test_a2c_speed[True-backward] 3.0542ms 2.9763ms 335.9894 Ops/s 317.9956 Ops/s $\textbf{\color{#35bf28}+5.66\%}$
test_a2c_speed[reduce-overhead-None] 1.1549ms 0.9913ms 1.0088 KOps/s 1.0176 KOps/s $\color{#d91a1a}-0.87\%$
test_ppo_speed[False-None] 3.9439ms 3.8090ms 262.5335 Ops/s 262.1957 Ops/s $\color{#35bf28}+0.13\%$
test_ppo_speed[False-backward] 7.2616ms 6.8280ms 146.4566 Ops/s 141.2602 Ops/s $\color{#35bf28}+3.68\%$
test_ppo_speed[True-None] 1.5503ms 1.4492ms 690.0414 Ops/s 694.9424 Ops/s $\color{#d91a1a}-0.71\%$
test_ppo_speed[True-backward] 3.2082ms 3.1038ms 322.1867 Ops/s 299.4491 Ops/s $\textbf{\color{#35bf28}+7.59\%}$
test_ppo_speed[reduce-overhead-None] 1.5147ms 1.0609ms 942.5975 Ops/s 922.5984 Ops/s $\color{#35bf28}+2.17\%$
test_reinforce_speed[False-None] 2.7099ms 2.2683ms 440.8671 Ops/s 429.0186 Ops/s $\color{#35bf28}+2.76\%$
test_reinforce_speed[False-backward] 3.7210ms 3.2581ms 306.9292 Ops/s 302.6965 Ops/s $\color{#35bf28}+1.40\%$
test_reinforce_speed[True-None] 1.7440ms 1.3071ms 765.0705 Ops/s 742.4942 Ops/s $\color{#35bf28}+3.04\%$
test_reinforce_speed[True-backward] 3.3126ms 2.9265ms 341.7008 Ops/s 336.2993 Ops/s $\color{#35bf28}+1.61\%$
test_reinforce_speed[reduce-overhead-None] 17.5102ms 9.5657ms 104.5404 Ops/s 104.3868 Ops/s $\color{#35bf28}+0.15\%$
test_iql_speed[False-None] 9.8437ms 9.3523ms 106.9252 Ops/s 106.5305 Ops/s $\color{#35bf28}+0.37\%$
test_iql_speed[False-backward] 13.3524ms 12.8868ms 77.5987 Ops/s 76.0391 Ops/s $\color{#35bf28}+2.05\%$
test_iql_speed[True-None] 2.4377ms 2.2238ms 449.6745 Ops/s 444.1160 Ops/s $\color{#35bf28}+1.25\%$
test_iql_speed[True-backward] 4.9113ms 4.7427ms 210.8482 Ops/s 200.2140 Ops/s $\textbf{\color{#35bf28}+5.31\%}$
test_iql_speed[reduce-overhead-None] 17.9525ms 10.6067ms 94.2801 Ops/s 94.6668 Ops/s $\color{#d91a1a}-0.41\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 6.1944ms 6.0280ms 165.8934 Ops/s 168.2051 Ops/s $\color{#d91a1a}-1.37\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.5862ms 0.2822ms 3.5442 KOps/s 3.5400 KOps/s $\color{#35bf28}+0.12\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.7066ms 0.2695ms 3.7112 KOps/s 3.7644 KOps/s $\color{#d91a1a}-1.41\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 6.2616ms 5.9119ms 169.1490 Ops/s 176.1013 Ops/s $\color{#d91a1a}-3.95\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.5948ms 0.3234ms 3.0926 KOps/s 2.8034 KOps/s $\textbf{\color{#35bf28}+10.32\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.6181ms 0.3316ms 3.0159 KOps/s 2.9586 KOps/s $\color{#35bf28}+1.93\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 1.6593ms 1.4479ms 690.6750 Ops/s 719.0616 Ops/s $\color{#d91a1a}-3.95\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 1.6148ms 1.3521ms 739.6115 Ops/s 785.8308 Ops/s $\textbf{\color{#d91a1a}-5.88\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 6.1877ms 5.9445ms 168.2241 Ops/s 170.3845 Ops/s $\color{#d91a1a}-1.27\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1.0439ms 0.4181ms 2.3916 KOps/s 1.9682 KOps/s $\textbf{\color{#35bf28}+21.51\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.5968ms 0.4030ms 2.4815 KOps/s 2.4570 KOps/s $\color{#35bf28}+1.00\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 6.0227ms 5.8622ms 170.5851 Ops/s 168.6049 Ops/s $\color{#35bf28}+1.17\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 1.0381ms 0.3052ms 3.2765 KOps/s 3.5040 KOps/s $\textbf{\color{#d91a1a}-6.49\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.5211ms 0.2906ms 3.4410 KOps/s 3.7690 KOps/s $\textbf{\color{#d91a1a}-8.70\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 6.0874ms 5.8179ms 171.8838 Ops/s 170.4174 Ops/s $\color{#35bf28}+0.86\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 1.8128ms 0.2774ms 3.6049 KOps/s 2.9914 KOps/s $\textbf{\color{#35bf28}+20.51\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.6867ms 0.2609ms 3.8335 KOps/s 3.4081 KOps/s $\textbf{\color{#35bf28}+12.48\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 6.1512ms 5.9898ms 166.9492 Ops/s 166.3304 Ops/s $\color{#35bf28}+0.37\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 0.5726s 1.2165ms 822.0350 Ops/s 2.3223 KOps/s $\textbf{\color{#d91a1a}-64.60\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.6925ms 0.4814ms 2.0775 KOps/s 2.4597 KOps/s $\textbf{\color{#d91a1a}-15.54\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 6.8055ms 5.1076ms 195.7876 Ops/s 199.1206 Ops/s $\color{#d91a1a}-1.67\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 12.2433ms 2.1712ms 460.5651 Ops/s 435.3415 Ops/s $\textbf{\color{#35bf28}+5.79\%}$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 2.2270ms 1.1400ms 877.1868 Ops/s 1.0069 KOps/s $\textbf{\color{#d91a1a}-12.88\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 8.5582ms 5.0889ms 196.5074 Ops/s 50.3994 Ops/s $\textbf{\color{#35bf28}+289.90\%}$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 4.0449ms 1.8637ms 536.5739 Ops/s 538.7943 Ops/s $\color{#d91a1a}-0.41\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 12.4661ms 1.4023ms 713.1086 Ops/s 774.8509 Ops/s $\textbf{\color{#d91a1a}-7.97\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 0.5487s 16.2858ms 61.4031 Ops/s 187.1406 Ops/s $\textbf{\color{#d91a1a}-67.19\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 4.2028ms 1.9755ms 506.2118 Ops/s 469.3880 Ops/s $\textbf{\color{#35bf28}+7.85\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 1.9547ms 1.1166ms 895.5824 Ops/s 928.9691 Ops/s $\color{#d91a1a}-3.59\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] 37.4807ms 35.4474ms 28.2108 Ops/s 28.0715 Ops/s $\color{#35bf28}+0.50\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] 19.5689ms 17.6958ms 56.5107 Ops/s 55.5459 Ops/s $\color{#35bf28}+1.74\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-True] 40.0885ms 37.0694ms 26.9764 Ops/s 26.9734 Ops/s $\color{#35bf28}+0.01\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-False] 19.9155ms 18.1371ms 55.1355 Ops/s 53.1076 Ops/s $\color{#35bf28}+3.82\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-True] 41.4180ms 39.5800ms 25.2653 Ops/s 25.5521 Ops/s $\color{#d91a1a}-1.12\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-False] 21.8332ms 20.3652ms 49.1035 Ops/s 50.1452 Ops/s $\color{#d91a1a}-2.08\%$
test_storage_write_lazystack[50-img_shape0-small] 0.8704ms 0.2198ms 4.5497 KOps/s 4.5459 KOps/s $\color{#35bf28}+0.08\%$
test_storage_write_lazystack[100-img_shape1-atari] 2.2890ms 1.4068ms 710.8204 Ops/s 693.5190 Ops/s $\color{#35bf28}+2.49\%$
test_storage_write_lazystack[100-img_shape2-large_img] 2.7076ms 2.3281ms 429.5309 Ops/s 428.0676 Ops/s $\color{#35bf28}+0.34\%$
test_storage_write_lazystack[200-img_shape3-large_batch] 3.0700ms 2.9342ms 340.8083 Ops/s 339.7671 Ops/s $\color{#35bf28}+0.31\%$
test_storage_write_contiguous[50-img_shape0-small] 0.2511ms 0.1638ms 6.1053 KOps/s 6.1439 KOps/s $\color{#d91a1a}-0.63\%$
test_storage_write_contiguous[100-img_shape1-atari] 0.3699ms 0.2296ms 4.3545 KOps/s 4.3892 KOps/s $\color{#d91a1a}-0.79\%$
test_storage_write_contiguous[100-img_shape2-large_img] 1.9907ms 1.8410ms 543.1863 Ops/s 543.6777 Ops/s $\color{#d91a1a}-0.09\%$
test_storage_write_contiguous[200-img_shape3-large_batch] 1.6066ms 1.2998ms 769.3454 Ops/s 705.9498 Ops/s $\textbf{\color{#35bf28}+8.98\%}$
test_collector_stack_then_write[50-img_shape0-small] 1.3389ms 1.1589ms 862.8930 Ops/s 869.8857 Ops/s $\color{#d91a1a}-0.80\%$
test_collector_stack_then_write[100-img_shape1-atari] 3.8711ms 3.6412ms 274.6355 Ops/s 268.2138 Ops/s $\color{#35bf28}+2.39\%$
test_collector_stack_then_write[100-img_shape2-large_img] 5.9968ms 5.8368ms 171.3278 Ops/s 173.3952 Ops/s $\color{#d91a1a}-1.19\%$
test_collector_stack_then_write[200-img_shape3-large_batch] 7.5920ms 7.3951ms 135.2245 Ops/s 132.5882 Ops/s $\color{#35bf28}+1.99\%$
test_collector_lazystack_then_write[50-img_shape0-small] 0.4169ms 0.2720ms 3.6770 KOps/s 3.6659 KOps/s $\color{#35bf28}+0.30\%$
test_collector_lazystack_then_write[100-img_shape1-atari] 1.6543ms 1.5094ms 662.5063 Ops/s 653.1470 Ops/s $\color{#35bf28}+1.43\%$
test_collector_lazystack_then_write[100-img_shape2-large_img] 2.5817ms 2.4128ms 414.4629 Ops/s 409.4435 Ops/s $\color{#35bf28}+1.23\%$
test_collector_lazystack_then_write[200-img_shape3-large_batch] 3.4268ms 3.1204ms 320.4718 Ops/s 318.0432 Ops/s $\color{#35bf28}+0.76\%$
test_collector_without_rb[100-img_shape0-atari] 34.0815ms 33.6226ms 29.7419 Ops/s 29.4402 Ops/s $\color{#35bf28}+1.02\%$
test_collector_without_rb[200-img_shape1-large_batch] 67.2309ms 65.9359ms 15.1662 Ops/s 15.0203 Ops/s $\color{#35bf28}+0.97\%$
test_collector_with_rb[100-img_shape0-atari] 38.8321ms 38.0020ms 26.3144 Ops/s 26.1608 Ops/s $\color{#35bf28}+0.59\%$
test_collector_with_rb[200-img_shape1-large_batch] 76.6722ms 75.3424ms 13.2727 Ops/s 13.0495 Ops/s $\color{#35bf28}+1.71\%$
test_collector_without_rb_cuda[100-img_shape0-atari] 59.3239ms 58.4965ms 17.0950 Ops/s 17.6099 Ops/s $\color{#d91a1a}-2.92\%$
test_collector_without_rb_cuda[200-img_shape1-large_batch] 0.8276s 0.1937s 5.1625 Ops/s 8.8412 Ops/s $\textbf{\color{#d91a1a}-41.61\%}$
test_collector_with_rb_cuda[100-img_shape0-atari] 61.6666ms 59.8250ms 16.7154 Ops/s 17.0329 Ops/s $\color{#d91a1a}-1.86\%$
test_collector_with_rb_cuda[200-img_shape1-large_batch] 0.1191s 0.1180s 8.4715 Ops/s 8.5335 Ops/s $\color{#d91a1a}-0.73\%$

vmoens added a commit that referenced this pull request Feb 7, 2026
Replace multiprocessing.Event (futex-based syscalls) with
multiprocessing.RawArray shared-memory byte flags for worker-to-parent
completion signaling on the hot path (step_and_maybe_reset).

- _start_workers: creates shm_done_flags RawArray, passes to workers
- _wait_for_workers: spin-polls done_flags instead of Event.wait()
- Worker: _signal_done() closure writes shm_done_flags[idx]=1
- _shutdown_workers: uses _wait_for_workers instead of Event.wait()

Measured impact:
- 10% FPS improvement (7,737 -> 8,509 fps) on H200 with 8 workers
- 28% reduction in penv.wait_for_workers overhead (2,622us -> 1,891us)
- ParallelEnv.close() fixed from 80s timeout to ~0.9s

Co-authored-by: Cursor <[email protected]>
ghstack-source-id: f29522a
Pull-Request: #3457
Co-authored-by: Cursor <[email protected]>
@vmoens vmoens merged commit 1643a0b into gh/vmoens/218/base Feb 7, 2026
114 of 116 checks passed
@vmoens vmoens deleted the gh/vmoens/218/head branch February 7, 2026 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Performance Performance issue or suggestion for improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant