Skip to content

[Feature] AsyncBatchedCollector: coordinator loop and direct submission mode#3499

Merged
vmoens merged 7 commits intogh/vmoens/241/basefrom
gh/vmoens/241/head
Feb 21, 2026
Merged

[Feature] AsyncBatchedCollector: coordinator loop and direct submission mode#3499
vmoens merged 7 commits intogh/vmoens/241/basefrom
gh/vmoens/241/head

Conversation

@vmoens
Copy link
Copy Markdown
Collaborator

@vmoens vmoens commented Feb 12, 2026

Stack from ghstack (oldest at bottom):

Rewrite the AsyncBatchedCollector to use a coordinator thread that
pipelines env stepping and batched inference without a global sync
barrier. Add a direct=True mode where each env thread submits
directly to the InferenceServer, eliminating the coordinator thread
and its serialization overhead.

Benchmark results (8 mock pixel envs, Nature-CNN, CPU):
AsyncBatchedCollector direct: 3183 fps (+72% vs coordinator)
AsyncBatchedCollector threading: 1850 fps (coordinator mode)
AsyncBatchedCollector mp: 1042 fps (coordinator mode)

Co-authored-by: Cursor cursoragent@cursor.com

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Feb 12, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3499

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures

As of commit 7cd1f9a with merge base 266e4aa (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vmoens added a commit that referenced this pull request Feb 12, 2026
…on mode

Rewrite the AsyncBatchedCollector to use a coordinator thread that
pipelines env stepping and batched inference without a global sync
barrier. Add a `direct=True` mode where each env thread submits
directly to the InferenceServer, eliminating the coordinator thread
and its serialization overhead.

Benchmark results (8 mock pixel envs, Nature-CNN, CPU):
  AsyncBatchedCollector direct:    3183 fps (+72% vs coordinator)
  AsyncBatchedCollector threading: 1850 fps (coordinator mode)
  AsyncBatchedCollector mp:        1042 fps (coordinator mode)

Co-authored-by: Cursor <cursoragent@cursor.com>
ghstack-source-id: 225d2a4
Pull-Request: #3499
@github-actions github-actions bot added the Feature New feature label Feb 12, 2026
@github-actions github-actions bot added Benchmarks rl/benchmark changes Collectors and removed Feature New feature labels Feb 12, 2026
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 12, 2026
@github-actions github-actions bot added the Feature New feature label Feb 12, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 12, 2026

$\color{#D29922}\textsf{\Large&amp;#x26A0;\kern{0.2cm}\normalsize Warning}$ Result of CPU Benchmark Tests

Total Benchmarks: 173. Improved: $\large\color{#35bf28}9$. Worsened: $\large\color{#d91a1a}11$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_tensor_to_bytestream_speed[pickle] 78.1155μs 77.5567μs 12.8938 KOps/s 12.7413 KOps/s $\color{#35bf28}+1.20\%$
test_tensor_to_bytestream_speed[torch.save] 0.1344ms 0.1340ms 7.4625 KOps/s 7.3419 KOps/s $\color{#35bf28}+1.64\%$
test_tensor_to_bytestream_speed[untyped_storage] 0.1023s 0.1018s 9.8186 Ops/s 9.9132 Ops/s $\color{#d91a1a}-0.95\%$
test_tensor_to_bytestream_speed[numpy] 2.5062μs 2.4907μs 401.4858 KOps/s 406.9183 KOps/s $\color{#d91a1a}-1.34\%$
test_tensor_to_bytestream_speed[safetensors] 38.6334μs 36.9422μs 27.0693 KOps/s 25.6151 KOps/s $\textbf{\color{#35bf28}+5.68\%}$
test_simple 0.5374s 0.5305s 1.8851 Ops/s 1.8093 Ops/s $\color{#35bf28}+4.19\%$
test_transformed 1.0549s 1.0502s 0.9522 Ops/s 0.9316 Ops/s $\color{#35bf28}+2.21\%$
test_serial 1.6111s 1.6069s 0.6223 Ops/s 0.6134 Ops/s $\color{#35bf28}+1.46\%$
test_parallel 0.9859s 0.9824s 1.0179 Ops/s 0.9874 Ops/s $\color{#35bf28}+3.10\%$
test_step_mdp_speed[True-True-True-True-True] 0.2581ms 40.8456μs 24.4824 KOps/s 23.8212 KOps/s $\color{#35bf28}+2.78\%$
test_step_mdp_speed[True-True-True-True-False] 0.4478ms 23.2164μs 43.0730 KOps/s 43.5799 KOps/s $\color{#d91a1a}-1.16\%$
test_step_mdp_speed[True-True-True-False-True] 0.4439ms 22.8189μs 43.8233 KOps/s 43.0113 KOps/s $\color{#35bf28}+1.89\%$
test_step_mdp_speed[True-True-True-False-False] 41.5820μs 12.5214μs 79.8634 KOps/s 78.5072 KOps/s $\color{#35bf28}+1.73\%$
test_step_mdp_speed[True-True-False-True-True] 0.4582ms 43.7141μs 22.8759 KOps/s 22.8082 KOps/s $\color{#35bf28}+0.30\%$
test_step_mdp_speed[True-True-False-True-False] 0.4361ms 25.2168μs 39.6561 KOps/s 39.1957 KOps/s $\color{#35bf28}+1.17\%$
test_step_mdp_speed[True-True-False-False-True] 0.4449ms 25.5651μs 39.1158 KOps/s 38.9387 KOps/s $\color{#35bf28}+0.46\%$
test_step_mdp_speed[True-True-False-False-False] 44.1620μs 15.1310μs 66.0895 KOps/s 65.0148 KOps/s $\color{#35bf28}+1.65\%$
test_step_mdp_speed[True-False-True-True-True] 0.4628ms 47.3479μs 21.1203 KOps/s 21.0736 KOps/s $\color{#35bf28}+0.22\%$
test_step_mdp_speed[True-False-True-True-False] 0.4356ms 27.7887μs 35.9859 KOps/s 35.7661 KOps/s $\color{#35bf28}+0.61\%$
test_step_mdp_speed[True-False-True-False-True] 59.1920μs 26.0264μs 38.4225 KOps/s 38.4337 KOps/s $\color{#d91a1a}-0.03\%$
test_step_mdp_speed[True-False-True-False-False] 0.4298ms 15.1573μs 65.9746 KOps/s 64.9223 KOps/s $\color{#35bf28}+1.62\%$
test_step_mdp_speed[True-False-False-True-True] 0.4764ms 49.5713μs 20.1730 KOps/s 20.3421 KOps/s $\color{#d91a1a}-0.83\%$
test_step_mdp_speed[True-False-False-True-False] 0.4492ms 30.7643μs 32.5053 KOps/s 33.0550 KOps/s $\color{#d91a1a}-1.66\%$
test_step_mdp_speed[True-False-False-False-True] 64.1920μs 28.4180μs 35.1890 KOps/s 35.1667 KOps/s $\color{#35bf28}+0.06\%$
test_step_mdp_speed[True-False-False-False-False] 0.4489ms 17.8986μs 55.8702 KOps/s 56.0103 KOps/s $\color{#d91a1a}-0.25\%$
test_step_mdp_speed[False-True-True-True-True] 0.4630ms 46.8786μs 21.3317 KOps/s 21.5301 KOps/s $\color{#d91a1a}-0.92\%$
test_step_mdp_speed[False-True-True-True-False] 0.4469ms 28.2396μs 35.4113 KOps/s 35.6774 KOps/s $\color{#d91a1a}-0.75\%$
test_step_mdp_speed[False-True-True-False-True] 2.5538ms 29.2775μs 34.1560 KOps/s 33.5061 KOps/s $\color{#35bf28}+1.94\%$
test_step_mdp_speed[False-True-True-False-False] 0.4510ms 16.9882μs 58.8643 KOps/s 58.8825 KOps/s $\color{#d91a1a}-0.03\%$
test_step_mdp_speed[False-True-False-True-True] 0.4787ms 49.3964μs 20.2444 KOps/s 20.5861 KOps/s $\color{#d91a1a}-1.66\%$
test_step_mdp_speed[False-True-False-True-False] 67.3630μs 30.6673μs 32.6080 KOps/s 33.1130 KOps/s $\color{#d91a1a}-1.52\%$
test_step_mdp_speed[False-True-False-False-True] 0.4520ms 31.8016μs 31.4450 KOps/s 31.4790 KOps/s $\color{#d91a1a}-0.11\%$
test_step_mdp_speed[False-True-False-False-False] 0.4389ms 19.4405μs 51.4390 KOps/s 51.5015 KOps/s $\color{#d91a1a}-0.12\%$
test_step_mdp_speed[False-False-True-True-True] 0.4657ms 51.2791μs 19.5011 KOps/s 19.6221 KOps/s $\color{#d91a1a}-0.62\%$
test_step_mdp_speed[False-False-True-True-False] 0.4457ms 33.1295μs 30.1846 KOps/s 30.5331 KOps/s $\color{#d91a1a}-1.14\%$
test_step_mdp_speed[False-False-True-False-True] 67.8230μs 31.9895μs 31.2603 KOps/s 31.2267 KOps/s $\color{#35bf28}+0.11\%$
test_step_mdp_speed[False-False-True-False-False] 0.4319ms 19.3519μs 51.6745 KOps/s 51.8123 KOps/s $\color{#d91a1a}-0.27\%$
test_step_mdp_speed[False-False-False-True-True] 0.4730ms 53.9023μs 18.5521 KOps/s 18.8605 KOps/s $\color{#d91a1a}-1.64\%$
test_step_mdp_speed[False-False-False-True-False] 0.4613ms 35.5262μs 28.1482 KOps/s 28.4980 KOps/s $\color{#d91a1a}-1.23\%$
test_step_mdp_speed[False-False-False-False-True] 76.4630μs 34.1793μs 29.2575 KOps/s 29.4910 KOps/s $\color{#d91a1a}-0.79\%$
test_step_mdp_speed[False-False-False-False-False] 0.4363ms 21.9395μs 45.5799 KOps/s 45.9109 KOps/s $\color{#d91a1a}-0.72\%$
test_non_tensor_env_rollout_speed[1000-single-True] 0.8177s 0.7197s 1.3895 Ops/s 1.3758 Ops/s $\color{#35bf28}+1.00\%$
test_non_tensor_env_rollout_speed[1000-single-False] 0.6855s 0.5874s 1.7025 Ops/s 1.6825 Ops/s $\color{#35bf28}+1.19\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-True] 1.6656s 1.5772s 0.6340 Ops/s 0.6200 Ops/s $\color{#35bf28}+2.26\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-False] 1.4471s 1.3655s 0.7324 Ops/s 0.7149 Ops/s $\color{#35bf28}+2.44\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-True] 1.9016s 1.8170s 0.5503 Ops/s 0.5414 Ops/s $\color{#35bf28}+1.66\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-False] 1.7005s 1.6156s 0.6190 Ops/s 0.6139 Ops/s $\color{#35bf28}+0.82\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-True] 4.5703s 4.4675s 0.2238 Ops/s 0.2205 Ops/s $\color{#35bf28}+1.52\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-False] 4.4122s 4.3684s 0.2289 Ops/s 0.2287 Ops/s $\color{#35bf28}+0.09\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-True] 1.8755s 1.7960s 0.5568 Ops/s 0.5425 Ops/s $\color{#35bf28}+2.64\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-False] 1.6301s 1.5405s 0.6491 Ops/s 0.6421 Ops/s $\color{#35bf28}+1.09\%$
test_values[generalized_advantage_estimate-True-True] 10.3016ms 9.7274ms 102.8021 Ops/s 101.6885 Ops/s $\color{#35bf28}+1.10\%$
test_values[vec_generalized_advantage_estimate-True-True] 19.7089ms 17.4989ms 57.1463 Ops/s 56.9791 Ops/s $\color{#35bf28}+0.29\%$
test_values[td0_return_estimate-False-False] 0.2342ms 0.1296ms 7.7161 KOps/s 7.7397 KOps/s $\color{#d91a1a}-0.31\%$
test_values[td1_return_estimate-False-False] 28.6400ms 26.6397ms 37.5380 Ops/s 37.8918 Ops/s $\color{#d91a1a}-0.93\%$
test_values[vec_td1_return_estimate-False-False] 17.8761ms 17.5053ms 57.1257 Ops/s 56.8852 Ops/s $\color{#35bf28}+0.42\%$
test_values[td_lambda_return_estimate-True-False] 39.6623ms 38.9814ms 25.6532 Ops/s 24.6362 Ops/s $\color{#35bf28}+4.13\%$
test_values[vec_td_lambda_return_estimate-True-False] 17.8578ms 17.5059ms 57.1237 Ops/s 56.7381 Ops/s $\color{#35bf28}+0.68\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 8.6067ms 8.5458ms 117.0161 Ops/s 111.8374 Ops/s $\color{#35bf28}+4.63\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 1.7899ms 1.4990ms 667.1307 Ops/s 703.6703 Ops/s $\textbf{\color{#d91a1a}-5.19\%}$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.4920ms 0.4085ms 2.4479 KOps/s 2.3718 KOps/s $\color{#35bf28}+3.21\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 34.5905ms 34.3663ms 29.0983 Ops/s 28.9262 Ops/s $\color{#35bf28}+0.59\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 2.0465ms 1.7007ms 587.9847 Ops/s 587.1227 Ops/s $\color{#35bf28}+0.15\%$
test_dqn_speed[False-None] 1.6176ms 1.3535ms 738.8024 Ops/s 738.1115 Ops/s $\color{#35bf28}+0.09\%$
test_dqn_speed[False-backward] 1.9316ms 1.8591ms 537.8870 Ops/s 523.1212 Ops/s $\color{#35bf28}+2.82\%$
test_dqn_speed[True-None] 0.5648ms 0.5214ms 1.9178 KOps/s 1.8817 KOps/s $\color{#35bf28}+1.92\%$
test_dqn_speed[True-backward] 1.0218ms 0.9700ms 1.0310 KOps/s 861.1486 Ops/s $\textbf{\color{#35bf28}+19.72\%}$
test_dqn_speed[reduce-overhead-None] 0.8838ms 0.5201ms 1.9229 KOps/s 1.8964 KOps/s $\color{#35bf28}+1.40\%$
test_ddpg_speed[False-None] 3.0930ms 2.7618ms 362.0800 Ops/s 362.1365 Ops/s $\color{#d91a1a}-0.02\%$
test_ddpg_speed[False-backward] 4.3556ms 3.9489ms 253.2357 Ops/s 254.1964 Ops/s $\color{#d91a1a}-0.38\%$
test_ddpg_speed[True-None] 1.5727ms 1.3467ms 742.5791 Ops/s 735.4795 Ops/s $\color{#35bf28}+0.97\%$
test_ddpg_speed[True-backward] 2.3687ms 2.3197ms 431.0861 Ops/s 357.3871 Ops/s $\textbf{\color{#35bf28}+20.62\%}$
test_ddpg_speed[reduce-overhead-None] 1.7244ms 1.3420ms 745.1511 Ops/s 737.7784 Ops/s $\color{#35bf28}+1.00\%$
test_sac_speed[False-None] 8.3662ms 7.7745ms 128.6250 Ops/s 129.9813 Ops/s $\color{#d91a1a}-1.04\%$
test_sac_speed[False-backward] 11.1295ms 10.9355ms 91.4450 Ops/s 91.7214 Ops/s $\color{#d91a1a}-0.30\%$
test_sac_speed[True-None] 2.3803ms 2.0712ms 482.8116 Ops/s 485.3676 Ops/s $\color{#d91a1a}-0.53\%$
test_sac_speed[True-backward] 3.9844ms 3.8835ms 257.5013 Ops/s 250.5610 Ops/s $\color{#35bf28}+2.77\%$
test_sac_speed[reduce-overhead-None] 2.3648ms 2.0510ms 487.5592 Ops/s 480.2474 Ops/s $\color{#35bf28}+1.52\%$
test_redq_speed[False-None] 14.5650ms 10.3376ms 96.7340 Ops/s 93.6453 Ops/s $\color{#35bf28}+3.30\%$
test_redq_speed[False-backward] 18.6387ms 17.4539ms 57.2938 Ops/s 57.8800 Ops/s $\color{#d91a1a}-1.01\%$
test_redq_speed[True-None] 4.6487ms 4.1873ms 238.8157 Ops/s 243.6965 Ops/s $\color{#d91a1a}-2.00\%$
test_redq_speed[True-backward] 10.0744ms 9.4500ms 105.8200 Ops/s 105.5043 Ops/s $\color{#35bf28}+0.30\%$
test_redq_speed[reduce-overhead-None] 4.4124ms 4.1404ms 241.5225 Ops/s 247.6528 Ops/s $\color{#d91a1a}-2.48\%$
test_redq_deprec_speed[False-None] 11.2115ms 10.7788ms 92.7749 Ops/s 95.0957 Ops/s $\color{#d91a1a}-2.44\%$
test_redq_deprec_speed[False-backward] 16.0447ms 15.4909ms 64.5542 Ops/s 65.9920 Ops/s $\color{#d91a1a}-2.18\%$
test_redq_deprec_speed[True-None] 3.8240ms 3.5503ms 281.6648 Ops/s 272.4641 Ops/s $\color{#35bf28}+3.38\%$
test_redq_deprec_speed[True-backward] 7.4958ms 7.3366ms 136.3023 Ops/s 129.9791 Ops/s $\color{#35bf28}+4.86\%$
test_redq_deprec_speed[reduce-overhead-None] 3.6714ms 3.4569ms 289.2790 Ops/s 285.7621 Ops/s $\color{#35bf28}+1.23\%$
test_td3_speed[False-None] 8.7321ms 7.8842ms 126.8362 Ops/s 127.3417 Ops/s $\color{#d91a1a}-0.40\%$
test_td3_speed[False-backward] 11.9294ms 10.8136ms 92.4761 Ops/s 93.7784 Ops/s $\color{#d91a1a}-1.39\%$
test_td3_speed[True-None] 1.7734ms 1.7352ms 576.2916 Ops/s 572.0680 Ops/s $\color{#35bf28}+0.74\%$
test_td3_speed[True-backward] 3.7126ms 3.5025ms 285.5105 Ops/s 264.3758 Ops/s $\textbf{\color{#35bf28}+7.99\%}$
test_td3_speed[reduce-overhead-None] 1.7536ms 1.7114ms 584.3129 Ops/s 585.8395 Ops/s $\color{#d91a1a}-0.26\%$
test_cql_speed[False-None] 30.1240ms 25.9847ms 38.4842 Ops/s 38.3491 Ops/s $\color{#35bf28}+0.35\%$
test_cql_speed[False-backward] 39.2054ms 34.9661ms 28.5991 Ops/s 28.5107 Ops/s $\color{#35bf28}+0.31\%$
test_cql_speed[True-None] 15.0734ms 12.2529ms 81.6133 Ops/s 79.8910 Ops/s $\color{#35bf28}+2.16\%$
test_cql_speed[True-backward] 18.1692ms 17.8690ms 55.9629 Ops/s 54.8742 Ops/s $\color{#35bf28}+1.98\%$
test_cql_speed[reduce-overhead-None] 14.7153ms 12.3392ms 81.0425 Ops/s 82.9582 Ops/s $\color{#d91a1a}-2.31\%$
test_a2c_speed[False-None] 5.5299ms 5.2977ms 188.7608 Ops/s 188.6835 Ops/s $\color{#35bf28}+0.04\%$
test_a2c_speed[False-backward] 11.8083ms 11.6282ms 85.9978 Ops/s 86.1883 Ops/s $\color{#d91a1a}-0.22\%$
test_a2c_speed[True-None] 4.3226ms 3.6640ms 272.9229 Ops/s 271.9795 Ops/s $\color{#35bf28}+0.35\%$
test_a2c_speed[True-backward] 8.6659ms 8.3524ms 119.7254 Ops/s 119.4123 Ops/s $\color{#35bf28}+0.26\%$
test_a2c_speed[reduce-overhead-None] 3.9774ms 3.6290ms 275.5607 Ops/s 274.8269 Ops/s $\color{#35bf28}+0.27\%$
test_ppo_speed[False-None] 6.0422ms 5.7169ms 174.9193 Ops/s 171.4088 Ops/s $\color{#35bf28}+2.05\%$
test_ppo_speed[False-backward] 12.6286ms 12.1875ms 82.0513 Ops/s 82.4182 Ops/s $\color{#d91a1a}-0.45\%$
test_ppo_speed[True-None] 4.0122ms 3.5410ms 282.4038 Ops/s 280.5500 Ops/s $\color{#35bf28}+0.66\%$
test_ppo_speed[True-backward] 8.5546ms 8.1628ms 122.5064 Ops/s 120.8308 Ops/s $\color{#35bf28}+1.39\%$
test_ppo_speed[reduce-overhead-None] 3.8301ms 3.5300ms 283.2829 Ops/s 282.4688 Ops/s $\color{#35bf28}+0.29\%$
test_reinforce_speed[False-None] 4.8782ms 4.4604ms 224.1959 Ops/s 224.8047 Ops/s $\color{#d91a1a}-0.27\%$
test_reinforce_speed[False-backward] 7.5448ms 7.2533ms 137.8689 Ops/s 138.8455 Ops/s $\color{#d91a1a}-0.70\%$
test_reinforce_speed[True-None] 3.1403ms 2.7680ms 361.2697 Ops/s 357.2876 Ops/s $\color{#35bf28}+1.11\%$
test_reinforce_speed[True-backward] 7.6305ms 7.4481ms 134.2622 Ops/s 131.8388 Ops/s $\color{#35bf28}+1.84\%$
test_reinforce_speed[reduce-overhead-None] 3.2291ms 2.7457ms 364.2052 Ops/s 348.1170 Ops/s $\color{#35bf28}+4.62\%$
test_iql_speed[False-None] 25.0708ms 19.7731ms 50.5738 Ops/s 50.0290 Ops/s $\color{#35bf28}+1.09\%$
test_iql_speed[False-backward] 30.4544ms 29.6276ms 33.7523 Ops/s 33.2741 Ops/s $\color{#35bf28}+1.44\%$
test_iql_speed[True-None] 8.4752ms 8.0769ms 123.8102 Ops/s 117.6196 Ops/s $\textbf{\color{#35bf28}+5.26\%}$
test_iql_speed[True-backward] 16.4753ms 15.9724ms 62.6082 Ops/s 60.3867 Ops/s $\color{#35bf28}+3.68\%$
test_iql_speed[reduce-overhead-None] 8.4184ms 8.1826ms 122.2106 Ops/s 119.7381 Ops/s $\color{#35bf28}+2.06\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 5.9802ms 5.8340ms 171.4077 Ops/s 169.8527 Ops/s $\color{#35bf28}+0.92\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 2.7111ms 0.3247ms 3.0801 KOps/s 3.0343 KOps/s $\color{#35bf28}+1.51\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.4951ms 0.2545ms 3.9288 KOps/s 3.6813 KOps/s $\textbf{\color{#35bf28}+6.72\%}$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 5.7795ms 5.5443ms 180.3667 Ops/s 178.5162 Ops/s $\color{#35bf28}+1.04\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 1.4542ms 0.3021ms 3.3101 KOps/s 3.3135 KOps/s $\color{#d91a1a}-0.10\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.6416ms 0.2924ms 3.4202 KOps/s 3.9152 KOps/s $\textbf{\color{#d91a1a}-12.64\%}$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 1.6771ms 1.3692ms 730.3439 Ops/s 821.0243 Ops/s $\textbf{\color{#d91a1a}-11.04\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 1.6220ms 1.2723ms 785.9686 Ops/s 867.6025 Ops/s $\textbf{\color{#d91a1a}-9.41\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 12.5545ms 5.8837ms 169.9609 Ops/s 175.8822 Ops/s $\color{#d91a1a}-3.37\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1.0005ms 0.4435ms 2.2550 KOps/s 2.3854 KOps/s $\textbf{\color{#d91a1a}-5.47\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.6465ms 0.4015ms 2.4908 KOps/s 2.5172 KOps/s $\color{#d91a1a}-1.05\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 5.9452ms 5.5903ms 178.8799 Ops/s 178.9095 Ops/s $\color{#d91a1a}-0.02\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.5949ms 0.3166ms 3.1586 KOps/s 3.3289 KOps/s $\textbf{\color{#d91a1a}-5.12\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.6366ms 0.3583ms 2.7913 KOps/s 3.5798 KOps/s $\textbf{\color{#d91a1a}-22.03\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 5.7902ms 5.5777ms 179.2842 Ops/s 177.5473 Ops/s $\color{#35bf28}+0.98\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.7171ms 0.3580ms 2.7934 KOps/s 3.1753 KOps/s $\textbf{\color{#d91a1a}-12.03\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 9.3546ms 0.3511ms 2.8484 KOps/s 3.1842 KOps/s $\textbf{\color{#d91a1a}-10.55\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 5.8228ms 5.7543ms 173.7841 Ops/s 172.8912 Ops/s $\color{#35bf28}+0.52\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1.9905ms 0.5100ms 1.9608 KOps/s 2.1819 KOps/s $\textbf{\color{#d91a1a}-10.13\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.7127ms 0.4912ms 2.0357 KOps/s 2.1768 KOps/s $\textbf{\color{#d91a1a}-6.48\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 6.2816ms 4.8652ms 205.5414 Ops/s 202.2927 Ops/s $\color{#35bf28}+1.61\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 9.6153ms 2.1702ms 460.7893 Ops/s 460.3537 Ops/s $\color{#35bf28}+0.09\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 3.2008ms 0.8813ms 1.1347 KOps/s 1.1150 KOps/s $\color{#35bf28}+1.77\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 0.5422s 15.7844ms 63.3536 Ops/s 60.1070 Ops/s $\textbf{\color{#35bf28}+5.40\%}$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 3.8147ms 1.7388ms 575.1182 Ops/s 524.6732 Ops/s $\textbf{\color{#35bf28}+9.61\%}$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 7.4080ms 1.1674ms 856.5797 Ops/s 857.1443 Ops/s $\color{#d91a1a}-0.07\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 6.5545ms 5.1185ms 195.3712 Ops/s 192.0085 Ops/s $\color{#35bf28}+1.75\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 12.7026ms 2.0156ms 496.1344 Ops/s 493.5029 Ops/s $\color{#35bf28}+0.53\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 1.2912ms 1.0188ms 981.5234 Ops/s 992.2286 Ops/s $\color{#d91a1a}-1.08\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] 37.0078ms 35.0213ms 28.5540 Ops/s 28.2650 Ops/s $\color{#35bf28}+1.02\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] 19.5622ms 17.7677ms 56.2819 Ops/s 56.1805 Ops/s $\color{#35bf28}+0.18\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-True] 38.7093ms 36.0754ms 27.7197 Ops/s 27.3741 Ops/s $\color{#35bf28}+1.26\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-False] 19.3189ms 17.8626ms 55.9828 Ops/s 55.2290 Ops/s $\color{#35bf28}+1.36\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-True] 39.4385ms 37.9110ms 26.3775 Ops/s 25.5299 Ops/s $\color{#35bf28}+3.32\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-False] 21.7363ms 19.7309ms 50.6819 Ops/s 50.3148 Ops/s $\color{#35bf28}+0.73\%$
test_storage_write_lazystack[50-img_shape0-small] 0.8985ms 0.2150ms 4.6513 KOps/s 4.5106 KOps/s $\color{#35bf28}+3.12\%$
test_storage_write_lazystack[100-img_shape1-atari] 1.5659ms 1.3911ms 718.8581 Ops/s 714.5053 Ops/s $\color{#35bf28}+0.61\%$
test_storage_write_lazystack[100-img_shape2-large_img] 2.7089ms 2.2875ms 437.1520 Ops/s 415.5390 Ops/s $\textbf{\color{#35bf28}+5.20\%}$
test_storage_write_lazystack[200-img_shape3-large_batch] 3.1361ms 2.9079ms 343.8850 Ops/s 340.5614 Ops/s $\color{#35bf28}+0.98\%$
test_storage_write_contiguous[50-img_shape0-small] 0.2095ms 0.1344ms 7.4409 KOps/s 7.4610 KOps/s $\color{#d91a1a}-0.27\%$
test_storage_write_contiguous[100-img_shape1-atari] 0.3315ms 0.1907ms 5.2427 KOps/s 5.1841 KOps/s $\color{#35bf28}+1.13\%$
test_storage_write_contiguous[100-img_shape2-large_img] 1.9269ms 1.7710ms 564.6612 Ops/s 574.3546 Ops/s $\color{#d91a1a}-1.69\%$
test_storage_write_contiguous[200-img_shape3-large_batch] 1.4331ms 1.2756ms 783.9314 Ops/s 764.8303 Ops/s $\color{#35bf28}+2.50\%$
test_collector_stack_then_write[50-img_shape0-small] 1.2156ms 1.0876ms 919.4575 Ops/s 919.9018 Ops/s $\color{#d91a1a}-0.05\%$
test_collector_stack_then_write[100-img_shape1-atari] 7.2812ms 3.4681ms 288.3435 Ops/s 284.4180 Ops/s $\color{#35bf28}+1.38\%$
test_collector_stack_then_write[100-img_shape2-large_img] 10.9369ms 5.6652ms 176.5151 Ops/s 178.8494 Ops/s $\color{#d91a1a}-1.31\%$
test_collector_stack_then_write[200-img_shape3-large_batch] 7.3144ms 6.8304ms 146.4034 Ops/s 146.5443 Ops/s $\color{#d91a1a}-0.10\%$
test_collector_lazystack_then_write[50-img_shape0-small] 0.4279ms 0.2706ms 3.6960 KOps/s 3.7176 KOps/s $\color{#d91a1a}-0.58\%$
test_collector_lazystack_then_write[100-img_shape1-atari] 1.7002ms 1.5163ms 659.5193 Ops/s 664.5383 Ops/s $\color{#d91a1a}-0.76\%$
test_collector_lazystack_then_write[100-img_shape2-large_img] 2.7915ms 2.4099ms 414.9593 Ops/s 396.6624 Ops/s $\color{#35bf28}+4.61\%$
test_collector_lazystack_then_write[200-img_shape3-large_batch] 3.2289ms 3.1000ms 322.5857 Ops/s 320.1489 Ops/s $\color{#35bf28}+0.76\%$
test_collector_without_rb[100-img_shape0-atari] 32.8773ms 31.8503ms 31.3968 Ops/s 30.8932 Ops/s $\color{#35bf28}+1.63\%$
test_collector_without_rb[200-img_shape1-large_batch] 64.6618ms 63.6129ms 15.7201 Ops/s 15.8866 Ops/s $\color{#d91a1a}-1.05\%$
test_collector_with_rb[100-img_shape0-atari] 37.5098ms 36.3318ms 27.5241 Ops/s 27.6394 Ops/s $\color{#d91a1a}-0.42\%$
test_collector_with_rb[200-img_shape1-large_batch] 93.3708ms 72.1759ms 13.8550 Ops/s 14.0710 Ops/s $\color{#d91a1a}-1.53\%$

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 12, 2026

$\color{#D29922}\textsf{\Large&amp;#x26A0;\kern{0.2cm}\normalsize Warning}$ Result of GPU Benchmark Tests

Total Benchmarks: 172. Improved: $\large\color{#35bf28}12$. Worsened: $\large\color{#d91a1a}12$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_tensor_to_bytestream_speed[pickle] 83.7445μs 82.1958μs 12.1661 KOps/s 12.4615 KOps/s $\color{#d91a1a}-2.37\%$
test_tensor_to_bytestream_speed[torch.save] 0.1464ms 0.1453ms 6.8804 KOps/s 7.1089 KOps/s $\color{#d91a1a}-3.21\%$
test_tensor_to_bytestream_speed[untyped_storage] 0.1162s 0.1160s 8.6211 Ops/s 8.9317 Ops/s $\color{#d91a1a}-3.48\%$
test_tensor_to_bytestream_speed[numpy] 2.5754μs 2.5612μs 390.4389 KOps/s 364.4169 KOps/s $\textbf{\color{#35bf28}+7.14\%}$
test_tensor_to_bytestream_speed[safetensors] 40.6848μs 40.3940μs 24.7562 KOps/s 26.8071 KOps/s $\textbf{\color{#d91a1a}-7.65\%}$
test_simple 0.8024s 0.7900s 1.2659 Ops/s 1.2158 Ops/s $\color{#35bf28}+4.12\%$
test_transformed 1.4009s 1.3850s 0.7220 Ops/s 0.7045 Ops/s $\color{#35bf28}+2.49\%$
test_serial 2.3123s 2.3015s 0.4345 Ops/s 0.4256 Ops/s $\color{#35bf28}+2.09\%$
test_parallel 1.9125s 1.8394s 0.5437 Ops/s 0.5515 Ops/s $\color{#d91a1a}-1.42\%$
test_step_mdp_speed[True-True-True-True-True] 0.4738ms 42.7527μs 23.3903 KOps/s 23.7642 KOps/s $\color{#d91a1a}-1.57\%$
test_step_mdp_speed[True-True-True-True-False] 47.5810μs 23.4606μs 42.6246 KOps/s 41.5905 KOps/s $\color{#35bf28}+2.49\%$
test_step_mdp_speed[True-True-True-False-True] 0.4466ms 23.4248μs 42.6898 KOps/s 41.8668 KOps/s $\color{#35bf28}+1.97\%$
test_step_mdp_speed[True-True-True-False-False] 0.4474ms 13.0056μs 76.8901 KOps/s 75.6332 KOps/s $\color{#35bf28}+1.66\%$
test_step_mdp_speed[True-True-False-True-True] 70.4810μs 45.0852μs 22.1802 KOps/s 21.9029 KOps/s $\color{#35bf28}+1.27\%$
test_step_mdp_speed[True-True-False-True-False] 0.4505ms 25.9696μs 38.5065 KOps/s 37.9471 KOps/s $\color{#35bf28}+1.47\%$
test_step_mdp_speed[True-True-False-False-True] 0.4735ms 26.2203μs 38.1384 KOps/s 37.6935 KOps/s $\color{#35bf28}+1.18\%$
test_step_mdp_speed[True-True-False-False-False] 39.1110μs 15.8850μs 62.9526 KOps/s 62.7788 KOps/s $\color{#35bf28}+0.28\%$
test_step_mdp_speed[True-False-True-True-True] 0.4732ms 47.6611μs 20.9815 KOps/s 20.8594 KOps/s $\color{#35bf28}+0.59\%$
test_step_mdp_speed[True-False-True-True-False] 0.4486ms 29.3931μs 34.0216 KOps/s 34.0316 KOps/s $\color{#d91a1a}-0.03\%$
test_step_mdp_speed[True-False-True-False-True] 0.4452ms 26.6950μs 37.4602 KOps/s 37.2311 KOps/s $\color{#35bf28}+0.62\%$
test_step_mdp_speed[True-False-True-False-False] 38.5110μs 15.8386μs 63.1371 KOps/s 62.8156 KOps/s $\color{#35bf28}+0.51\%$
test_step_mdp_speed[True-False-False-True-True] 0.4782ms 50.7500μs 19.7044 KOps/s 19.5954 KOps/s $\color{#35bf28}+0.56\%$
test_step_mdp_speed[True-False-False-True-False] 0.4504ms 31.7666μs 31.4796 KOps/s 31.6489 KOps/s $\color{#d91a1a}-0.53\%$
test_step_mdp_speed[True-False-False-False-True] 0.4517ms 29.2495μs 34.1886 KOps/s 34.4846 KOps/s $\color{#d91a1a}-0.86\%$
test_step_mdp_speed[True-False-False-False-False] 54.0910μs 18.5194μs 53.9975 KOps/s 53.8423 KOps/s $\color{#35bf28}+0.29\%$
test_step_mdp_speed[False-True-True-True-True] 0.4620ms 48.2062μs 20.7442 KOps/s 20.6581 KOps/s $\color{#35bf28}+0.42\%$
test_step_mdp_speed[False-True-True-True-False] 0.4465ms 29.3697μs 34.0487 KOps/s 34.4107 KOps/s $\color{#d91a1a}-1.05\%$
test_step_mdp_speed[False-True-True-False-True] 2.4987ms 30.4695μs 32.8197 KOps/s 33.1710 KOps/s $\color{#d91a1a}-1.06\%$
test_step_mdp_speed[False-True-True-False-False] 48.5610μs 17.6964μs 56.5086 KOps/s 57.0479 KOps/s $\color{#d91a1a}-0.95\%$
test_step_mdp_speed[False-True-False-True-True] 0.4725ms 50.6177μs 19.7559 KOps/s 19.6057 KOps/s $\color{#35bf28}+0.77\%$
test_step_mdp_speed[False-True-False-True-False] 0.4497ms 31.6960μs 31.5497 KOps/s 31.3628 KOps/s $\color{#35bf28}+0.60\%$
test_step_mdp_speed[False-True-False-False-True] 0.4503ms 32.2575μs 31.0005 KOps/s 30.6966 KOps/s $\color{#35bf28}+0.99\%$
test_step_mdp_speed[False-True-False-False-False] 79.7810μs 20.2208μs 49.4540 KOps/s 49.4621 KOps/s $\color{#d91a1a}-0.02\%$
test_step_mdp_speed[False-False-True-True-True] 0.4720ms 53.1469μs 18.8158 KOps/s 18.6794 KOps/s $\color{#35bf28}+0.73\%$
test_step_mdp_speed[False-False-True-True-False] 0.4486ms 34.6921μs 28.8250 KOps/s 29.8381 KOps/s $\color{#d91a1a}-3.40\%$
test_step_mdp_speed[False-False-True-False-True] 0.4503ms 32.6992μs 30.5818 KOps/s 30.1242 KOps/s $\color{#35bf28}+1.52\%$
test_step_mdp_speed[False-False-True-False-False] 41.8810μs 20.0713μs 49.8224 KOps/s 49.3617 KOps/s $\color{#35bf28}+0.93\%$
test_step_mdp_speed[False-False-False-True-True] 0.4755ms 54.4680μs 18.3594 KOps/s 18.1921 KOps/s $\color{#35bf28}+0.92\%$
test_step_mdp_speed[False-False-False-True-False] 0.4477ms 36.6029μs 27.3203 KOps/s 27.5305 KOps/s $\color{#d91a1a}-0.76\%$
test_step_mdp_speed[False-False-False-False-True] 0.4515ms 34.3341μs 29.1255 KOps/s 28.7724 KOps/s $\color{#35bf28}+1.23\%$
test_step_mdp_speed[False-False-False-False-False] 52.0410μs 22.6347μs 44.1800 KOps/s 43.5621 KOps/s $\color{#35bf28}+1.42\%$
test_non_tensor_env_rollout_speed[1000-single-True] 0.7171s 0.7136s 1.4014 Ops/s 1.3248 Ops/s $\textbf{\color{#35bf28}+5.78\%}$
test_non_tensor_env_rollout_speed[1000-single-False] 0.7023s 0.6054s 1.6518 Ops/s 1.6131 Ops/s $\color{#35bf28}+2.40\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-True] 1.7134s 1.6381s 0.6105 Ops/s 0.6033 Ops/s $\color{#35bf28}+1.19\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-False] 1.4892s 1.4114s 0.7085 Ops/s 0.6981 Ops/s $\color{#35bf28}+1.50\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-True] 1.9556s 1.8771s 0.5327 Ops/s 0.5240 Ops/s $\color{#35bf28}+1.66\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-False] 1.7373s 1.6590s 0.6028 Ops/s 0.5944 Ops/s $\color{#35bf28}+1.41\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-True] 4.7132s 4.6468s 0.2152 Ops/s 0.2147 Ops/s $\color{#35bf28}+0.25\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-False] 4.7018s 4.4525s 0.2246 Ops/s 0.2234 Ops/s $\color{#35bf28}+0.51\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-True] 1.9732s 1.8933s 0.5282 Ops/s 0.5208 Ops/s $\color{#35bf28}+1.41\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-False] 1.7078s 1.6052s 0.6230 Ops/s 0.6214 Ops/s $\color{#35bf28}+0.26\%$
test_values[generalized_advantage_estimate-True-True] 22.4518ms 21.9486ms 45.5609 Ops/s 46.9300 Ops/s $\color{#d91a1a}-2.92\%$
test_values[vec_generalized_advantage_estimate-True-True] 0.1324s 3.5685ms 280.2307 Ops/s 258.9641 Ops/s $\textbf{\color{#35bf28}+8.21\%}$
test_values[td0_return_estimate-False-False] 0.1155ms 84.6289μs 11.8163 KOps/s 11.7560 KOps/s $\color{#35bf28}+0.51\%$
test_values[td1_return_estimate-False-False] 53.4028ms 52.1658ms 19.1696 Ops/s 19.9124 Ops/s $\color{#d91a1a}-3.73\%$
test_values[vec_td1_return_estimate-False-False] 1.3559ms 1.0988ms 910.0591 Ops/s 903.3574 Ops/s $\color{#35bf28}+0.74\%$
test_values[td_lambda_return_estimate-True-False] 86.4269ms 80.1401ms 12.4781 Ops/s 12.1033 Ops/s $\color{#35bf28}+3.10\%$
test_values[vec_td_lambda_return_estimate-True-False] 1.3093ms 1.0780ms 927.6393 Ops/s 907.0933 Ops/s $\color{#35bf28}+2.27\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 20.7733ms 20.4725ms 48.8461 Ops/s 46.6614 Ops/s $\color{#35bf28}+4.68\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 1.0565ms 0.7680ms 1.3021 KOps/s 1.2836 KOps/s $\color{#35bf28}+1.44\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.7215ms 0.6741ms 1.4836 KOps/s 1.4426 KOps/s $\color{#35bf28}+2.84\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 1.5402ms 1.4926ms 669.9659 Ops/s 661.7279 Ops/s $\color{#35bf28}+1.24\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 0.7324ms 0.6923ms 1.4444 KOps/s 1.3816 KOps/s $\color{#35bf28}+4.54\%$
test_dqn_speed[False-None] 1.6383ms 1.5392ms 649.7003 Ops/s 640.3019 Ops/s $\color{#35bf28}+1.47\%$
test_dqn_speed[False-backward] 2.2396ms 2.1738ms 460.0342 Ops/s 451.1104 Ops/s $\color{#35bf28}+1.98\%$
test_dqn_speed[True-None] 0.8625ms 0.5534ms 1.8072 KOps/s 1.7639 KOps/s $\color{#35bf28}+2.45\%$
test_dqn_speed[True-backward] 1.2290ms 1.2001ms 833.2355 Ops/s 817.1911 Ops/s $\color{#35bf28}+1.96\%$
test_dqn_speed[reduce-overhead-None] 0.6225ms 0.5744ms 1.7409 KOps/s 1.6799 KOps/s $\color{#35bf28}+3.63\%$
test_ddpg_speed[False-None] 3.2747ms 2.8850ms 346.6260 Ops/s 343.2619 Ops/s $\color{#35bf28}+0.98\%$
test_ddpg_speed[False-backward] 4.6998ms 4.2831ms 233.4764 Ops/s 231.3746 Ops/s $\color{#35bf28}+0.91\%$
test_ddpg_speed[True-None] 1.3841ms 1.3030ms 767.4717 Ops/s 756.5933 Ops/s $\color{#35bf28}+1.44\%$
test_ddpg_speed[True-backward] 2.5427ms 2.4971ms 400.4602 Ops/s 389.8693 Ops/s $\color{#35bf28}+2.72\%$
test_ddpg_speed[reduce-overhead-None] 1.4215ms 1.3342ms 749.5398 Ops/s 741.0665 Ops/s $\color{#35bf28}+1.14\%$
test_sac_speed[False-None] 9.0175ms 8.3756ms 119.3939 Ops/s 117.6073 Ops/s $\color{#35bf28}+1.52\%$
test_sac_speed[False-backward] 12.7724ms 11.6970ms 85.4918 Ops/s 84.6934 Ops/s $\color{#35bf28}+0.94\%$
test_sac_speed[True-None] 1.8849ms 1.7863ms 559.8267 Ops/s 547.5621 Ops/s $\color{#35bf28}+2.24\%$
test_sac_speed[True-backward] 3.6020ms 3.5428ms 282.2662 Ops/s 275.1622 Ops/s $\color{#35bf28}+2.58\%$
test_sac_speed[reduce-overhead-None] 19.5920ms 11.0757ms 90.2874 Ops/s 82.8869 Ops/s $\textbf{\color{#35bf28}+8.93\%}$
test_redq_deprec_speed[False-None] 10.0113ms 9.3259ms 107.2287 Ops/s 105.6522 Ops/s $\color{#35bf28}+1.49\%$
test_redq_deprec_speed[False-backward] 13.1821ms 12.7946ms 78.1580 Ops/s 77.5644 Ops/s $\color{#35bf28}+0.77\%$
test_redq_deprec_speed[True-None] 2.7313ms 2.5007ms 399.8870 Ops/s 388.3130 Ops/s $\color{#35bf28}+2.98\%$
test_redq_deprec_speed[True-backward] 4.3515ms 4.2497ms 235.3115 Ops/s 228.1679 Ops/s $\color{#35bf28}+3.13\%$
test_redq_deprec_speed[reduce-overhead-None] 16.3049ms 9.9188ms 100.8184 Ops/s 101.0334 Ops/s $\color{#d91a1a}-0.21\%$
test_td3_speed[False-None] 8.3562ms 8.2088ms 121.8212 Ops/s 120.4402 Ops/s $\color{#35bf28}+1.15\%$
test_td3_speed[False-backward] 11.2911ms 10.8449ms 92.2094 Ops/s 91.2542 Ops/s $\color{#35bf28}+1.05\%$
test_td3_speed[True-None] 1.7134ms 1.6590ms 602.7742 Ops/s 615.0788 Ops/s $\color{#d91a1a}-2.00\%$
test_td3_speed[True-backward] 3.2528ms 3.1971ms 312.7852 Ops/s 302.8144 Ops/s $\color{#35bf28}+3.29\%$
test_td3_speed[reduce-overhead-None] 47.9191ms 24.5874ms 40.6713 Ops/s 40.1845 Ops/s $\color{#35bf28}+1.21\%$
test_cql_speed[False-None] 17.8126ms 17.3082ms 57.7762 Ops/s 56.9336 Ops/s $\color{#35bf28}+1.48\%$
test_cql_speed[False-backward] 23.4146ms 22.9417ms 43.5887 Ops/s 42.9751 Ops/s $\color{#35bf28}+1.43\%$
test_cql_speed[True-None] 3.3736ms 3.2022ms 312.2806 Ops/s 307.4202 Ops/s $\color{#35bf28}+1.58\%$
test_cql_speed[True-backward] 6.0816ms 5.4807ms 182.4597 Ops/s 179.3940 Ops/s $\color{#35bf28}+1.71\%$
test_cql_speed[reduce-overhead-None] 18.6150ms 11.5423ms 86.6381 Ops/s 83.5218 Ops/s $\color{#35bf28}+3.73\%$
test_a2c_speed[False-None] 3.4158ms 3.2463ms 308.0463 Ops/s 302.1774 Ops/s $\color{#35bf28}+1.94\%$
test_a2c_speed[False-backward] 6.8582ms 6.4498ms 155.0446 Ops/s 152.3943 Ops/s $\color{#35bf28}+1.74\%$
test_a2c_speed[True-None] 1.4204ms 1.3184ms 758.5070 Ops/s 755.4022 Ops/s $\color{#35bf28}+0.41\%$
test_a2c_speed[True-backward] 3.1791ms 3.0826ms 324.4028 Ops/s 336.5985 Ops/s $\color{#d91a1a}-3.62\%$
test_a2c_speed[reduce-overhead-None] 1.0317ms 0.9689ms 1.0321 KOps/s 1.0362 KOps/s $\color{#d91a1a}-0.39\%$
test_ppo_speed[False-None] 4.0440ms 3.8649ms 258.7370 Ops/s 254.0896 Ops/s $\color{#35bf28}+1.83\%$
test_ppo_speed[False-backward] 7.7965ms 7.2486ms 137.9574 Ops/s 138.3706 Ops/s $\color{#d91a1a}-0.30\%$
test_ppo_speed[True-None] 1.4672ms 1.4124ms 708.0253 Ops/s 707.7801 Ops/s $\color{#35bf28}+0.03\%$
test_ppo_speed[True-backward] 3.2691ms 3.2242ms 310.1587 Ops/s 302.5475 Ops/s $\color{#35bf28}+2.52\%$
test_ppo_speed[reduce-overhead-None] 1.1306ms 1.0452ms 956.7391 Ops/s 937.7425 Ops/s $\color{#35bf28}+2.03\%$
test_reinforce_speed[False-None] 2.3952ms 2.2936ms 435.9864 Ops/s 428.9260 Ops/s $\color{#35bf28}+1.65\%$
test_reinforce_speed[False-backward] 3.5554ms 3.4428ms 290.4577 Ops/s 285.3651 Ops/s $\color{#35bf28}+1.78\%$
test_reinforce_speed[True-None] 1.3218ms 1.2629ms 791.8421 Ops/s 789.6926 Ops/s $\color{#35bf28}+0.27\%$
test_reinforce_speed[True-backward] 3.0923ms 3.0298ms 330.0579 Ops/s 324.0054 Ops/s $\color{#35bf28}+1.87\%$
test_reinforce_speed[reduce-overhead-None] 17.4619ms 9.5924ms 104.2493 Ops/s 105.0339 Ops/s $\color{#d91a1a}-0.75\%$
test_iql_speed[False-None] 10.0297ms 9.4011ms 106.3700 Ops/s 104.0996 Ops/s $\color{#35bf28}+2.18\%$
test_iql_speed[False-backward] 13.8378ms 13.4444ms 74.3806 Ops/s 73.1800 Ops/s $\color{#35bf28}+1.64\%$
test_iql_speed[True-None] 2.2319ms 2.1366ms 468.0327 Ops/s 460.1777 Ops/s $\color{#35bf28}+1.71\%$
test_iql_speed[True-backward] 5.2928ms 4.8152ms 207.6756 Ops/s 205.4794 Ops/s $\color{#35bf28}+1.07\%$
test_iql_speed[reduce-overhead-None] 18.2853ms 10.5794ms 94.5237 Ops/s 94.9783 Ops/s $\color{#d91a1a}-0.48\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 6.3767ms 5.9856ms 167.0680 Ops/s 165.0444 Ops/s $\color{#35bf28}+1.23\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.9293ms 0.3454ms 2.8951 KOps/s 3.4881 KOps/s $\textbf{\color{#d91a1a}-17.00\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.5842ms 0.3493ms 2.8629 KOps/s 2.6595 KOps/s $\textbf{\color{#35bf28}+7.65\%}$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 5.9834ms 5.7711ms 173.2773 Ops/s 169.0471 Ops/s $\color{#35bf28}+2.50\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 2.6833ms 0.3636ms 2.7504 KOps/s 3.0647 KOps/s $\textbf{\color{#d91a1a}-10.26\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.5559ms 0.3427ms 2.9183 KOps/s 3.5646 KOps/s $\textbf{\color{#d91a1a}-18.13\%}$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 1.7128ms 1.4388ms 695.0211 Ops/s 710.4399 Ops/s $\color{#d91a1a}-2.17\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 1.5499ms 1.3558ms 737.5958 Ops/s 741.4830 Ops/s $\color{#d91a1a}-0.52\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 6.1874ms 6.0634ms 164.9240 Ops/s 164.0442 Ops/s $\color{#35bf28}+0.54\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 0.7188ms 0.4674ms 2.1396 KOps/s 2.1865 KOps/s $\color{#d91a1a}-2.15\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.8005ms 0.4467ms 2.2388 KOps/s 1.9831 KOps/s $\textbf{\color{#35bf28}+12.89\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 5.9757ms 5.8867ms 169.8739 Ops/s 168.3239 Ops/s $\color{#35bf28}+0.92\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 1.8550ms 0.3234ms 3.0918 KOps/s 2.8141 KOps/s $\textbf{\color{#35bf28}+9.87\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.7121ms 0.3168ms 3.1569 KOps/s 2.9328 KOps/s $\textbf{\color{#35bf28}+7.64\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 6.1685ms 5.8222ms 171.7550 Ops/s 169.3093 Ops/s $\color{#35bf28}+1.44\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.5924ms 0.2820ms 3.5463 KOps/s 3.1188 KOps/s $\textbf{\color{#35bf28}+13.71\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.5415ms 0.3427ms 2.9182 KOps/s 3.4584 KOps/s $\textbf{\color{#d91a1a}-15.62\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 6.1823ms 6.0363ms 165.6653 Ops/s 164.8446 Ops/s $\color{#35bf28}+0.50\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 0.8179ms 0.4648ms 2.1514 KOps/s 1.9169 KOps/s $\textbf{\color{#35bf28}+12.23\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.7939ms 0.4591ms 2.1779 KOps/s 2.1799 KOps/s $\color{#d91a1a}-0.09\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 0.5876s 16.7588ms 59.6703 Ops/s 51.6828 Ops/s $\textbf{\color{#35bf28}+15.45\%}$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 11.5998ms 2.0883ms 478.8482 Ops/s 533.0775 Ops/s $\textbf{\color{#d91a1a}-10.17\%}$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 7.2839ms 1.2858ms 777.6997 Ops/s 1.0281 KOps/s $\textbf{\color{#d91a1a}-24.35\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 6.5718ms 5.0663ms 197.3824 Ops/s 194.0368 Ops/s $\color{#35bf28}+1.72\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 13.2261ms 2.0781ms 481.2059 Ops/s 488.6475 Ops/s $\color{#d91a1a}-1.52\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 2.9819ms 1.2227ms 817.8717 Ops/s 1.0066 KOps/s $\textbf{\color{#d91a1a}-18.75\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 0.5334s 15.9182ms 62.8212 Ops/s 185.4838 Ops/s $\textbf{\color{#d91a1a}-66.13\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 4.3350ms 2.1301ms 469.4646 Ops/s 511.6881 Ops/s $\textbf{\color{#d91a1a}-8.25\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 12.0203ms 1.5718ms 636.1970 Ops/s 816.4758 Ops/s $\textbf{\color{#d91a1a}-22.08\%}$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] 38.0802ms 35.9684ms 27.8022 Ops/s 26.8908 Ops/s $\color{#35bf28}+3.39\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] 19.8673ms 18.4138ms 54.3072 Ops/s 52.6068 Ops/s $\color{#35bf28}+3.23\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-True] 40.3254ms 37.5190ms 26.6532 Ops/s 26.4449 Ops/s $\color{#35bf28}+0.79\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-False] 20.5473ms 18.6411ms 53.6449 Ops/s 52.7065 Ops/s $\color{#35bf28}+1.78\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-True] 48.4334ms 39.7921ms 25.1306 Ops/s 24.8962 Ops/s $\color{#35bf28}+0.94\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-False] 21.5745ms 20.1939ms 49.5199 Ops/s 48.3164 Ops/s $\color{#35bf28}+2.49\%$
test_storage_write_lazystack[50-img_shape0-small] 0.8834ms 0.2254ms 4.4361 KOps/s 4.3112 KOps/s $\color{#35bf28}+2.90\%$
test_storage_write_lazystack[100-img_shape1-atari] 1.6962ms 1.4343ms 697.2209 Ops/s 685.4818 Ops/s $\color{#35bf28}+1.71\%$
test_storage_write_lazystack[100-img_shape2-large_img] 2.5838ms 2.3706ms 421.8290 Ops/s 433.0922 Ops/s $\color{#d91a1a}-2.60\%$
test_storage_write_lazystack[200-img_shape3-large_batch] 3.2127ms 2.9811ms 335.4433 Ops/s 333.8206 Ops/s $\color{#35bf28}+0.49\%$
test_storage_write_contiguous[50-img_shape0-small] 0.2663ms 0.1667ms 5.9983 KOps/s 5.8212 KOps/s $\color{#35bf28}+3.04\%$
test_storage_write_contiguous[100-img_shape1-atari] 0.3576ms 0.2306ms 4.3374 KOps/s 4.0581 KOps/s $\textbf{\color{#35bf28}+6.88\%}$
test_storage_write_contiguous[100-img_shape2-large_img] 2.2652ms 1.8695ms 534.8937 Ops/s 561.4589 Ops/s $\color{#d91a1a}-4.73\%$
test_storage_write_contiguous[200-img_shape3-large_batch] 1.6364ms 1.4039ms 712.2965 Ops/s 708.9309 Ops/s $\color{#35bf28}+0.47\%$
test_collector_stack_then_write[50-img_shape0-small] 1.3249ms 1.1621ms 860.4784 Ops/s 852.1998 Ops/s $\color{#35bf28}+0.97\%$
test_collector_stack_then_write[100-img_shape1-atari] 3.9038ms 3.6513ms 273.8715 Ops/s 268.7875 Ops/s $\color{#35bf28}+1.89\%$
test_collector_stack_then_write[100-img_shape2-large_img] 10.7455ms 5.9690ms 167.5328 Ops/s 170.6738 Ops/s $\color{#d91a1a}-1.84\%$
test_collector_stack_then_write[200-img_shape3-large_batch] 7.6356ms 7.4494ms 134.2386 Ops/s 139.0672 Ops/s $\color{#d91a1a}-3.47\%$
test_collector_lazystack_then_write[50-img_shape0-small] 0.4338ms 0.2743ms 3.6458 KOps/s 3.5904 KOps/s $\color{#35bf28}+1.54\%$
test_collector_lazystack_then_write[100-img_shape1-atari] 1.7576ms 1.5450ms 647.2667 Ops/s 642.8983 Ops/s $\color{#35bf28}+0.68\%$
test_collector_lazystack_then_write[100-img_shape2-large_img] 2.8958ms 2.4741ms 404.1807 Ops/s 410.0503 Ops/s $\color{#d91a1a}-1.43\%$
test_collector_lazystack_then_write[200-img_shape3-large_batch] 3.4939ms 3.1753ms 314.9268 Ops/s 315.0125 Ops/s $\color{#d91a1a}-0.03\%$
test_collector_without_rb[100-img_shape0-atari] 33.7506ms 33.2481ms 30.0769 Ops/s 29.3722 Ops/s $\color{#35bf28}+2.40\%$
test_collector_without_rb[200-img_shape1-large_batch] 67.2205ms 65.3301ms 15.3069 Ops/s 14.9779 Ops/s $\color{#35bf28}+2.20\%$
test_collector_with_rb[100-img_shape0-atari] 38.2399ms 37.2759ms 26.8270 Ops/s 26.1363 Ops/s $\color{#35bf28}+2.64\%$
test_collector_with_rb[200-img_shape1-large_batch] 73.3180ms 72.8512ms 13.7266 Ops/s 13.2485 Ops/s $\color{#35bf28}+3.61\%$
test_collector_without_rb_cuda[100-img_shape0-atari] 56.1165ms 55.1131ms 18.1445 Ops/s 17.6635 Ops/s $\color{#35bf28}+2.72\%$
test_collector_without_rb_cuda[200-img_shape1-large_batch] 0.1099s 0.1095s 9.1296 Ops/s 8.7967 Ops/s $\color{#35bf28}+3.78\%$
test_collector_with_rb_cuda[100-img_shape0-atari] 57.3732ms 56.9226ms 17.5677 Ops/s 16.9302 Ops/s $\color{#35bf28}+3.77\%$
test_collector_with_rb_cuda[200-img_shape1-large_batch] 0.7793s 0.1877s 5.3280 Ops/s 8.5050 Ops/s $\textbf{\color{#d91a1a}-37.35\%}$

[ghstack-poisoned]
vmoens added a commit that referenced this pull request Feb 12, 2026
…on mode

Rewrite the AsyncBatchedCollector to use a coordinator thread that
pipelines env stepping and batched inference without a global sync
barrier. Add a `direct=True` mode where each env thread submits
directly to the InferenceServer, eliminating the coordinator thread
and its serialization overhead.

Benchmark results (8 mock pixel envs, Nature-CNN, CPU):
  AsyncBatchedCollector direct:    3183 fps (+72% vs coordinator)
  AsyncBatchedCollector threading: 1850 fps (coordinator mode)
  AsyncBatchedCollector mp:        1042 fps (coordinator mode)

Co-authored-by: Cursor <cursoragent@cursor.com>
ghstack-source-id: c4d370a
Pull-Request: #3499
[ghstack-poisoned]
vmoens added a commit that referenced this pull request Feb 13, 2026
…on mode

Rewrite the AsyncBatchedCollector to use a coordinator thread that
pipelines env stepping and batched inference without a global sync
barrier. Add a `direct=True` mode where each env thread submits
directly to the InferenceServer, eliminating the coordinator thread
and its serialization overhead.

Benchmark results (8 mock pixel envs, Nature-CNN, CPU):
  AsyncBatchedCollector direct:    3183 fps (+72% vs coordinator)
  AsyncBatchedCollector threading: 1850 fps (coordinator mode)
  AsyncBatchedCollector mp:        1042 fps (coordinator mode)

Co-authored-by: Cursor <cursoragent@cursor.com>
ghstack-source-id: 784abfc
Pull-Request: #3499
Co-authored-by: Cursor <cursoragent@cursor.com>
[ghstack-poisoned]
vmoens added a commit that referenced this pull request Feb 14, 2026
…on mode

Rewrite the AsyncBatchedCollector to use a coordinator thread that
pipelines env stepping and batched inference without a global sync
barrier. Add a `direct=True` mode where each env thread submits
directly to the InferenceServer, eliminating the coordinator thread
and its serialization overhead.

Benchmark results (8 mock pixel envs, Nature-CNN, CPU):
  AsyncBatchedCollector direct:    3183 fps (+72% vs coordinator)
  AsyncBatchedCollector threading: 1850 fps (coordinator mode)
  AsyncBatchedCollector mp:        1042 fps (coordinator mode)

Co-authored-by: Cursor <cursoragent@cursor.com>
ghstack-source-id: 270d3e3
Pull-Request: #3499
Co-authored-by: Cursor <cursoragent@cursor.com>
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
vmoens added a commit that referenced this pull request Feb 21, 2026
…on mode (#3499)

Rewrite the AsyncBatchedCollector to use a coordinator thread that
pipelines env stepping and batched inference without a global sync
barrier. Add a `direct=True` mode where each env thread submits
directly to the InferenceServer, eliminating the coordinator thread
and its serialization overhead.

Benchmark results (8 mock pixel envs, Nature-CNN, CPU):
  AsyncBatchedCollector direct:    3183 fps (+72% vs coordinator)
  AsyncBatchedCollector threading: 1850 fps (coordinator mode)
  AsyncBatchedCollector mp:        1042 fps (coordinator mode)

Co-authored-by: Cursor <cursoragent@cursor.com>
ghstack-source-id: 861a24d
Pull-Request: #3499
Co-authored-by: Cursor <cursoragent@cursor.com>
@vmoens vmoens merged commit 7cd1f9a into gh/vmoens/241/base Feb 21, 2026
111 of 116 checks passed
@vmoens vmoens deleted the gh/vmoens/241/head branch February 21, 2026 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Benchmarks rl/benchmark changes CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Collectors Examples Feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant