-
Couldn't load subscription status.
- Fork 6.5k
[perf] Cache version checks #12399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[perf] Cache version checks #12399
Conversation
I recently noticed that we are spending a non-negligible amount of time in `version.parse` when running pipelines (approx. ~50ms per step for the QwenImageEdit pipeline on a ZeroGPU Space for instance, which in this case represents almost 10% of the actual compute). The calls to those version checks originate from: - https://github.com/huggingface/diffusers/blob/4588bbeb4229fd307119257e273a424b370573b1/src/diffusers/hooks/hooks.py#L277 Maybe that the issue can otherwise be solved from root (why do we need to unwrap the modules at each call?) or maybe that my particular setup triggered this? (I patched the forward method at the blocks level but I don't feel like it has an incidence over _set_context)
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Interesting PR. Do we know why something like this would show up? |
Indeed, I forgot to mention that the model had LoRA weights loaded. I'll check without and push the results + a minimal code snippet to reproduce |
|
Small update: After a quick test on LambdaLabs, it looks like even though a lot of the time is spent in versions checking (18% on QwenImageEdit, LambdaLabs H100), it does not matter in the end (same end-to-end pipeline duration), probably because CUDA calls happens asynchronously with Python code. On ZeroGPU there was a real difference though (17% faster when disabling unwrap_module function). I'll investigate more in the coming days / weeks |
|
After some measurements I came to the conclusion that performance gains are hard to predict because it depends on whether PyTorch overhead (Python side) is a limiting factor or not. On fast-GPU + poor-CPU (slow CPU or CPU stressed by other apps) environments, speed-ups will be visible, otherwise there won't be any. That said I feel like caching the version checks is still incrementally better and makes diffusers more CPU friendly. @sayakpaul do you see any potential unwanted drawbacks? |
|
I don't! Would still just need to see if it interferes with
|
|
All good! Test command + output: (.venv) ubuntu@192-222-52-227:~/diffusers$ RUN_SLOW=true RUN_COMPILE=true python -m pytest -v tests/models/transformers/test_models_transformer_flux.py::FluxTransformerCompileTests
================================================================================ test session starts ================================================================================
platform linux -- Python 3.10.12, pytest-8.4.2, pluggy-1.6.0 -- /home/ubuntu/diffusers/.venv/bin/python
cachedir: .pytest_cache
rootdir: /home/ubuntu/diffusers
configfile: pyproject.toml
plugins: timeout-2.4.0, xdist-3.8.0, requests-mock-1.10.0, anyio-4.11.0
collected 5 items
tests/models/transformers/test_models_transformer_flux.py::FluxTransformerCompileTests::test_compile_on_different_shapes PASSED [ 20%]
tests/models/transformers/test_models_transformer_flux.py::FluxTransformerCompileTests::test_compile_with_group_offloading PASSED [ 40%]
tests/models/transformers/test_models_transformer_flux.py::FluxTransformerCompileTests::test_compile_works_with_aot PASSED [ 60%]
tests/models/transformers/test_models_transformer_flux.py::FluxTransformerCompileTests::test_torch_compile_recompilation_and_graph_break PASSED [ 80%]
tests/models/transformers/test_models_transformer_flux.py::FluxTransformerCompileTests::test_torch_compile_repeated_blocks PASSED [100%]
================================================================================= warnings summary ==================================================================================
tests/models/transformers/test_models_transformer_flux.py::FluxTransformerCompileTests::test_compile_on_different_shapes
tests/models/transformers/test_models_transformer_flux.py::FluxTransformerCompileTests::test_compile_with_group_offloading
tests/models/transformers/test_models_transformer_flux.py::FluxTransformerCompileTests::test_torch_compile_recompilation_and_graph_break
tests/models/transformers/test_models_transformer_flux.py::FluxTransformerCompileTests::test_torch_compile_repeated_blocks
/home/ubuntu/diffusers/.venv/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py:1575: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
torch._dynamo.utils.warn_once(msg)
tests/models/transformers/test_models_transformer_flux.py::FluxTransformerCompileTests::test_compile_on_different_shapes
/home/ubuntu/diffusers/.venv/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:282: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================== 5 passed, 5 warnings in 38.44s =========================================================================== |
|
would there be benefits to expand this to cover different |
I recently noticed that we are spending a non-negligible amount of time in
version.parsewhen running pipelines (between 50ms and 150ms per step for the QwenImageEdit pipeline on a ZeroGPU Space for instance, which in this case represents a significant amount of the actual compute). The calls to those version checks originate from:diffusers/src/diffusers/hooks/hooks.py
Line 277 in 4588bbe
Maybe that the issue can otherwise be solved from root (why do we need to unwrap the modules at each call?)
py-spy top results (QwenImageEdit H200 half, 28 steps) :
UPDATE: