-
Couldn't load subscription status.
- Fork 6.5k
Add RequestScopedPipeline for safe concurrent inference, tokenizer lock and non-mutating retrieve_timesteps #12328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…oid race conditions
… and backward-compatible retrieve_timesteps
|
Hello there, Thanks for this PR. We're touching a bunch of core files in this PR without discussing their scope separately. Our pipelines aren't meant to be async. To help expedite merging this, I would recommend:
This will likely be easier to follow for the future contributors and users looking into understanding what minimal elements they would need to change / add to have an async implementation going for Diffusers. LMK if any of it is unclear and if you have any thoughts to share. |
…eing moved to exmaples/server-async
|
Hey thanks for the feedback, I've already reverted all the changes to the diffusers core and I'm moving all the logic to |
|
Thanks! It seems there are some remnant changes in the core files, still. Would you mind reverting them as well? But anyway, give me a ping once you think this PR is ready for another review. |
|
Sure, I'll review the files I haven't reverted yet, and I'll let you know when it's ready for review :D |
|
Hey @sayakpaul I have already completed the rollback of all the diffusers core changes and I have left all the logic in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes. Left two clarification comments.
examples/server-async/README.md
Outdated
| @@ -0,0 +1,136 @@ | |||
| # Asynchronous server and parallel execution of models | |||
|
|
|||
| > Example/demo server that keeps a single model in memory while safely running parallel inference requests by creating per-request lightweight views and cloning only small, stateful components (schedulers, RNG state, small mutable attrs). Works with StableDiffusion3/Flux pipelines and a custom `diffusers` fork. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still have to use a custom diffusers when running this example even with all the changes included? If so, I think we would want to change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh no no no, I forgot to change it in the README, sorry, I'll update that now
examples/server-async/README.md
Outdated
|
|
||
| ## ⚠️ IMPORTANT | ||
|
|
||
| * The server and inference harness live in this repo: `https://github.com/F4k3r22/DiffusersServer`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned earlier, let's try to just stick to a single pipeline example so that it's easier to follow for the users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll cut down the example
|
Hey @sayakpaul , I've already simplified the server and only left the examples with the SD3-3.5 and Flux Pipelines, and updated the README with all the changes, no custom fork is needed |
|
In the end I decided to discard the Flux pipeline and only leave the SD3-3.5 ones |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for being patient with the feedback!
|
@bot /style |
|
Style bot fixed some files and pushed the changes. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
What does this PR do?
This PR introduces a request-scoped pipeline abstraction and several safety/compatibility improvements that enable running many inference requests in parallel while keeping a single copy of the heavy weights (UNet, VAE, text encoder) in memory.
Main changes:
RequestScopedPipeline(example implementation and utilities) which:copy.copy).components) to avoid "can't set attribute" errors.model_cpu_offload_context()to allow memory offload hooks during generation.Already borrowederrors).retrieve_timesteps(..., return_scheduler=True)helper:(timesteps, num_inference_steps, scheduler)without mutating the shared scheduler.return_scheduler=Trueis not passed, behavior is identical to the previous API.scheduler.clone_for_request()when available; otherwise attempt a safedeepcopy()and fall back to logging and safe defaults when cloning fails.examples/DiffusersServer/) showing how to run a single model in memory and serve concurrent inference requests safely./api/diffusers/inference.retrieve_timestepsbehavior.Motivation and context
pipe.__call__concurrently can hit race conditions (e.g.,scheduler.set_timestepsmutating shared scheduler) or accidentally duplicate the full pipeline in memory (deepcopy), exploding GPU memory.Files changed / added (high level)
src/diffusers/pipelines/pipeline_utils.py—RequestScopedPipelineimplementation and utilitiessrc/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py,src/diffusers/pipelines/flux/pipeline_flux.py,src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py,src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py,src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_adapter.py,src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_xl_adapter.py—retrieve_timesteps(..., return_scheduler=True)helper (backwards compatible)src/diffusers/schedulers/*- Implementation of theclone_for_request(self, ...)method to avoid race condition errors (It was adapted for each scheduler)examples/DiffusersServer/— demo server and helper scripts:serverasync.py(FastAPI app factory / server example)Pipelines.py(pipeline loader classes)uvicorn_diffu.py(recommended uvicorn flags)create_server.pyBackward compatibility
retrieve_timestepsis fully retro-compatible: if users do not passreturn_scheduler=True, the call behaves exactly as before (it will callset_timestepson the shared scheduler).RequestScopedPipelineis additive and opt-in for server authors who want safe concurrency.RequestScopedPipelinecreates a lightweight, per-request view of a pipeline via a shallow copy and clones only small, mutable components (scheduler, RNG state, callbacks, small lists/dicts). Large model weights (UNet, VAE, text encoder) remain shared and are not duplicated.clone_for_requestsemantics:scheduler.clone_for_request(num_inference_steps, ...)is used as the preferred mechanism to obtain a scheduler configured for a single request. This ensures that any mutations performed byset_timesteps(...)are applied only to the local scheduler copy and never to the shared scheduler.clone_for_request,retrieve_timesteps(..., return_scheduler=True)attempts safe fallbacks in this order: (1)deepcopy(scheduler)and configure the copy, (2)copy.copy(scheduler)with a logged warning about potential shared-state risk. Only if all cloning strategies fail will the code fall back to mutating the original scheduler (and this is logged as a last-resort warning).return_scheduler=False) preserve the original, pre-existing semantics.How to test / reproduce
pip install -e . pip install -r examples/DiffusersServer/requirements.txtor
Verify that:
Already borrowedtokenizer errors happen under load.retrieve_timesteps(..., return_scheduler=True)returns the scheduler instance for per-request use and does not mutate the shared scheduler.Performance notes
Security & maintenance notes
Before submitting
docs/usage/async_server.md)tests/test_request_scoped.py)Who can review?
Core library / Schedulers / Pipelines: @yiyixuxu, @asomoza, @sayakpaul
General / integrations: @DN6, @stevhliu