Skip to content

[Diffusion] Add PCG support for diffusion models#19828

Closed
BBuf wants to merge 11 commits intomainfrom
try_to_add_pcg
Closed

[Diffusion] Add PCG support for diffusion models#19828
BBuf wants to merge 11 commits intomainfrom
try_to_add_pcg

Conversation

@BBuf
Copy link
Collaborator

@BBuf BBuf commented Mar 4, 2026

Created by codex (gpt5.3 -codex-high, about 2$ cost)

main with torch compile:

sglang generate --model-path=black-forest-labs/FLUX.1-dev  --prompt="A futuristic cyberpunk city at night, neon lights reflecting on wet streets, highly detailed, 8k" --width=1024 --height=1024 --num-inference-steps=50 --guidance-scale=4.0 --seed=42 --save-output  --warmup=True --enable-torch-compile

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:06<00:00,  7.25it/s]
[03-04 03:52:28] [DenoisingStage] average time per step: 0.1379 seconds
[03-04 03:52:28] [DenoisingStage] finished in 6.8999 seconds
[03-04 03:52:28] [DecodingStage] started...
[03-04 03:52:28] [DecodingStage] finished in 0.0307 seconds
[03-04 03:52:28] Peak GPU memory: 31.51 GB, Peak allocated: 27.30 GB, Memory pool overhead: 4.21 GB (13.4%), Remaining GPU memory at peak: 108.89 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'text_encoder_2', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-04 03:52:29] Output saved to outputs/A_futuristic_cyberpunk_city_at_night_neon_lights_reflecting_on_wet_streets_highly_detailed_8k_20260304-035129_e4d47a2b.png
[03-04 03:52:29] Pixel data generated successfully in 59.57 seconds
[03-04 03:52:29] Completed batch processing. Generated 1 outputs in 59.57 seconds
[03-04 03:52:29] Warmed-up request processed in 7.14 seconds (with warmup excluded)
[03-04 03:52:29] Memory usage - Max peak: 32268.00 MB, Avg peak: 32268.00 MB

main without torch compile:

sglang generate --model-path=black-forest-labs/FLUX.1-dev  --prompt="A futuristic cyberpunk city at night, neon lights reflecting on wet streets, highly detailed, 8k" --width=1024 --height=1024 --num-inference-steps=50 --guidance-scale=4.0 --seed=42 --save-output  --warmup=True 

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:07<00:00,  6.69it/s]
[03-04 04:01:00] [DenoisingStage] average time per step: 0.1495 seconds
[03-04 04:01:00] [DenoisingStage] finished in 7.4808 seconds
[03-04 04:01:00] [DecodingStage] started...
[03-04 04:01:00] [DecodingStage] finished in 0.0339 seconds
[03-04 04:01:00] Peak GPU memory: 31.29 GB, Peak allocated: 27.30 GB, Memory pool overhead: 3.99 GB (12.8%), Remaining GPU memory at peak: 109.11 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'text_encoder_2', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-04 04:01:00] Output saved to outputs/A_futuristic_cyberpunk_city_at_night_neon_lights_reflecting_on_wet_streets_highly_detailed_8k_20260304-040043_50a75896.png
[03-04 04:01:00] Pixel data generated successfully in 17.61 seconds
[03-04 04:01:00] Completed batch processing. Generated 1 outputs in 17.61 seconds
[03-04 04:01:00] Warmed-up request processed in 7.73 seconds (with warmup excluded)
[03-04 04:01:00] Memory usage - Max peak: 32044.00 MB, Avg peak: 32044.00 MB
  • pr with pcg:
sglang generate --model-path=black-forest-labs/FLUX.1-dev  --prompt="A futuristic cyberpunk city at night, neon lights reflecting on wet streets, highly detailed, 8k" --width=1024 --height=1024 --num-inference-steps=50 --guidance-scale=4.0 --seed=42 --save-output --enable-piecewise-cuda-graph

[03-04 04:03:51] [InputValidationStage] started...
[03-04 04:03:51] [InputValidationStage] finished in 0.0001 seconds
[03-04 04:03:51] [TextEncodingStage] started...
[03-04 04:03:51] [TextEncodingStage] finished in 0.4045 seconds
[03-04 04:03:52] [TimestepPreparationStage] started...
[03-04 04:03:52] [TimestepPreparationStage] finished in 0.0050 seconds
[03-04 04:03:52] [LatentPreparationStage] started...
[03-04 04:03:52] [LatentPreparationStage] finished in 0.0016 seconds
[03-04 04:03:52] [DenoisingStage] started...
[03-04 04:03:52] Pre-capturing diffusion PCG before denoising loop (target_models=1)
[03-04 04:03:59] Enable diffusion PCG for FluxTransformer2DModel with 58 capture buckets (max=8192)
[03-04 04:03:59] install_torch_compiled
[03-04 04:03:59] Diffusion PCG init for FluxTransformer2DModel (raw_seq=4096, static_seq=4096)
[03-04 04:04:05] Initializing SGLangBackend
[03-04 04:04:05] SGLangBackend __call__
[03-04 04:04:07] Compiling a graph for dynamic shape takes 0.00 s
[03-04 04:04:07] Computation graph saved to /root/.cache/sglang/torch_compile_cache/rank_0_0/backbone/computation_graph_1772597047.1930342.py
[03-04 04:04:10] Pre-capture finished for 1 model(s) before formal denoising
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:07<00:00,  6.66it/s]
[03-04 04:04:18] [DenoisingStage] average time per step: 0.1521 seconds
[03-04 04:04:18] [DenoisingStage] finished in 25.9847 seconds
[03-04 04:04:18] [DecodingStage] started...
[03-04 04:04:18] [DecodingStage] finished in 0.4051 seconds
[03-04 04:04:18] Peak GPU memory: 32.10 GB, Peak allocated: 27.33 GB, Memory pool overhead: 4.77 GB (14.9%), Remaining GPU memory at peak: 108.30 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'text_encoder_2', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-04 04:04:18] Output saved to outputs/A_futuristic_cyberpunk_city_at_night_neon_lights_reflecting_on_wet_streets_highly_detailed_8k_20260304-040351_d677fd0f.png
[03-04 04:04:18] Pixel data generated successfully in 27.39 seconds
[03-04 04:04:18] Completed batch processing. Generated 1 outputs in 27.39 seconds
[03-04 04:04:18] Memory usage - Max peak: 32874.00 MB, Avg peak: 32874.00 MB
[03-04 04:04:18] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
[03-04 04:04:26] Worker 0: Shutdown complete.

0.1495(torch compile) VS 0.1521(cuda graph)

we need more profile and devleop, so convert this pr to draft now.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@BBuf BBuf marked this pull request as draft March 4, 2026 04:23
@github-actions github-actions bot added the diffusion SGLang Diffusion label Mar 4, 2026
@zhaochenyang20
Copy link
Collaborator

Stop reviewing codex PR. Review mine please:

#18806
#19152
#19225

@BBuf BBuf closed this Mar 4, 2026
@BBuf BBuf deleted the try_to_add_pcg branch March 4, 2026 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants