Add RequestScopedPipeline for safe concurrent inference, tokenizer lock and non-mutating retrieve_timesteps #12328

FredyRivera-dev · 2025-09-14T04:42:04Z

What does this PR do?

This PR introduces a request-scoped pipeline abstraction and several safety/compatibility improvements that enable running many inference requests in parallel while keeping a single copy of the heavy weights (UNet, VAE, text encoder) in memory.

Main changes:

Add RequestScopedPipeline (example implementation and utilities) which:
- Creates a lightweight per-request view of a pipeline via a shallow-copy (copy.copy).
- Clones only small, stateful components per-request (scheduler, RNG state, callbacks, small mutable attrs) while sharing large model weights.
- Detects and skips read-only pipeline properties (e.g., components) to avoid "can't set attribute" errors.
- Optionally enters a model_cpu_offload_context() to allow memory offload hooks during generation.
Add tokenizer concurrency safety:
- The request-scoped wrapper manages an internal tokenizer lock to avoid Rust tokenizer race conditions (Already borrowed errors).
Add retrieve_timesteps(..., return_scheduler=True) helper:
- Returns (timesteps, num_inference_steps, scheduler) without mutating the shared scheduler.
- Fully retro-compatible: if return_scheduler=True is not passed, behavior is identical to the previous API.
Add fallback heuristics:
- Prefer scheduler.clone_for_request() when available; otherwise attempt a safe deepcopy() and fall back to logging and safe defaults when cloning fails.
Documentation and examples:
- Add an example/demo server (under examples/DiffusersServer/) showing how to run a single model in memory and serve concurrent inference requests safely.
- Document recommended flags, environment, and an example POST request for /api/diffusers/inference.
Tests & CI:
- (See "How to test") unit tests and a simple concurrency test harness are included to validate the tokenizer lock and retrieve_timesteps behavior.

Motivation and context

A naive concurrent server that calls pipe.__call__ concurrently can hit race conditions (e.g., scheduler.set_timesteps mutating shared scheduler) or accidentally duplicate the full pipeline in memory (deepcopy), exploding GPU memory.
This PR provides a light-weight pattern to isolate per-request mutable state while keeping heavy model parameters shared, solving both correctness (race conditions) and memory usage problems.

Files changed / added (high level)

src/diffusers/pipelines/pipeline_utils.py — RequestScopedPipeline implementation and utilities
src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py, src/diffusers/pipelines/flux/pipeline_flux.py, src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py, src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py, src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_adapter.py, src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_xl_adapter.py — retrieve_timesteps(..., return_scheduler=True) helper (backwards compatible)
src/diffusers/schedulers/* - Implementation of the clone_for_request(self, ...) method to avoid race condition errors (It was adapted for each scheduler)
examples/DiffusersServer/ — demo server and helper scripts:
- serverasync.py (FastAPI app factory / server example)
- Pipelines.py (pipeline loader classes)
- uvicorn_diffu.py (recommended uvicorn flags)
- create_server.py
Minor additions to project README describing the example server and the expected behavior

Backward compatibility

retrieve_timesteps is fully retro-compatible: if users do not pass return_scheduler=True, the call behaves exactly as before (it will call set_timesteps on the shared scheduler).
Existing pipelines and public APIs are not modified in a breaking way. The new RequestScopedPipeline is additive and opt-in for server authors who want safe concurrency.
RequestScopedPipeline creates a lightweight, per-request view of a pipeline via a shallow copy and clones only small, mutable components (scheduler, RNG state, callbacks, small lists/dicts). Large model weights (UNet, VAE, text encoder) remain shared and are not duplicated.
Scheduler handling and clone_for_request semantics:
- When available, scheduler.clone_for_request(num_inference_steps, ...) is used as the preferred mechanism to obtain a scheduler configured for a single request. This ensures that any mutations performed by set_timesteps(...) are applied only to the local scheduler copy and never to the shared scheduler.
- If a scheduler does not implement clone_for_request, retrieve_timesteps(..., return_scheduler=True) attempts safe fallbacks in this order: (1) deepcopy(scheduler) and configure the copy, (2) copy.copy(scheduler) with a logged warning about potential shared-state risk. Only if all cloning strategies fail will the code fall back to mutating the original scheduler (and this is logged as a last-resort warning).
- This behavior is opt-in: callers who do not request a scheduler (or pass return_scheduler=False) preserve the original, pre-existing semantics.

How to test / reproduce

Install the package in editable mode:

  pip install -e .
  pip install -r examples/DiffusersServer/requirements.txt

Start the example server:

   python examples/DiffusersServer/serverasync.py

or

   python examples/DiffusersServer/uvicorn_diffu.py

Run multiple concurrent requests (example):

   python -c "import requests, concurrent.futures, json
   def r(): return requests.post('http://localhost:8500/api/diffusers/inference', json={'prompt':'A futuristic cityscape','num_inference_steps':30,'num_images_per_prompt':1}).json()
   with concurrent.futures.ThreadPoolExecutor(max_workers=20) as ex: print([ex.submit(r).result() for _ in range(20)])"

Verify that:
- No Already borrowed tokenizer errors happen under load.
- GPU memory usage does not grow linearly with requests (heavy weights remain shared).
- retrieve_timesteps(..., return_scheduler=True) returns the scheduler instance for per-request use and does not mutate the shared scheduler.

Performance notes

Small per-request overhead for shallow copy and cloning of small mutable state.
Large tensors/weights are shared; this keeps memory usage low while enabling tens of parallel inferences (recommended ~10–50 inferences in parallel depending on hardware).

Security & maintenance notes

The example server is a demo/harness and not hardened for production by itself; recommend placing it behind a proper auth/gateway and rate limits.
Add monitoring for memory and request queue lengths when deploying.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc?
Was this discussed/approved via a GitHub issue or the [forum]? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? (see docs/usage/async_server.md)
Did you write any new necessary tests? (see tests/test_request_scoped.py)

Who can review?

Core library / Schedulers / Pipelines: @yiyixuxu, @asomoza, @sayakpaul
General / integrations: @DN6, @stevhliu

…oid race conditions

… and backward-compatible retrieve_timesteps

sayakpaul · 2025-09-14T16:58:28Z

Hello there,

Thanks for this PR. We're touching a bunch of core files in this PR without discussing their scope separately. Our pipelines aren't meant to be async.

To help expedite merging this, I would recommend:

Working with a folder under examples like we're doing -- that's perfect.
Subclassing a few classes instead of so many pipelines in the beginning. Perhaps, we could just pick one from SDXL, Flux, or Qwen (or whichever you would like to choose).
- We could add clone_for_request() in the subclassed scheduler implementation, for example -- class AsyncDPMMultistepScheduler.
- Similarly, we could just maintain async_friendly_retrieve_timesteps() under a utils.py script in the example folder.
Showing a full working example with the subclassed implementation.
Putting all the details for execution in a README.

This will likely be easier to follow for the future contributors and users looking into understanding what minimal elements they would need to change / add to have an async implementation going for Diffusers.

LMK if any of it is unclear and if you have any thoughts to share.

…eing moved to exmaples/server-async

FredyRivera-dev · 2025-09-15T00:41:56Z

Hey thanks for the feedback, I've already reverted all the changes to the diffusers core and I'm moving all the logic to examples/server-async. In these days I will finish moving all the logic so that they can replicate the asynchronous operation and I will update the README :D @sayakpaul

sayakpaul · 2025-09-15T05:51:56Z

Thanks!

It seems there are some remnant changes in the core files, still. Would you mind reverting them as well? But anyway, give me a ping once you think this PR is ready for another review.

FredyRivera-dev · 2025-09-15T13:05:29Z

Sure, I'll review the files I haven't reverted yet, and I'll let you know when it's ready for review :D

FredyRivera-dev · 2025-09-15T19:06:01Z

Hey @sayakpaul I have already completed the rollback of all the diffusers core changes and I have left all the logic in examples/server-async and updated the README in examples/server-async/README.md so that you can replicate all that async server execution, check it out and tell me if everything is ok :D

sayakpaul

Thanks for the changes. Left two clarification comments.

sayakpaul · 2025-09-16T06:23:15Z

examples/server-async/README.md

@@ -0,0 +1,136 @@
+# Asynchronous server and parallel execution of models
+
+> Example/demo server that keeps a single model in memory while safely running parallel inference requests by creating per-request lightweight views and cloning only small, stateful components (schedulers, RNG state, small mutable attrs). Works with StableDiffusion3/Flux pipelines and a custom `diffusers` fork.


Do we still have to use a custom diffusers when running this example even with all the changes included? If so, I think we would want to change that.

Ahhh no no no, I forgot to change it in the README, sorry, I'll update that now

sayakpaul · 2025-09-16T06:23:49Z

examples/server-async/README.md

+
+## ⚠️ IMPORTANT
+
+* The server and inference harness live in this repo: `https://github.com/F4k3r22/DiffusersServer`.


As mentioned earlier, let's try to just stick to a single pipeline example so that it's easier to follow for the users.

Sure, I'll cut down the example

….5 and Flux Pipelines

FredyRivera-dev · 2025-09-17T02:09:43Z

Hey @sayakpaul , I've already simplified the server and only left the examples with the SD3-3.5 and Flux Pipelines, and updated the README with all the changes, no custom fork is needed

FredyRivera-dev · 2025-09-17T16:35:52Z

In the end I decided to discard the Flux pipeline and only leave the SD3-3.5 ones

sayakpaul

Thanks for being patient with the feedback!

sayakpaul · 2025-09-18T04:06:09Z

@bot /style

github-actions · 2025-09-18T04:06:41Z

Style bot fixed some files and pushed the changes.

HuggingFaceDocBuilderDev · 2025-09-18T04:16:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

FredyRivera-dev added 24 commits September 6, 2025 17:23

Basic implementation of request scheduling

bbfc5f4

Basic editing in SD and Flux Pipelines

a308e3e

Small Fix

4799b8e

Fix

eda5847

Update for more pipelines

6b5e6be

Add examples/server-async

df2933f

Add examples/server-async

5c7c7c6

Merge branch 'huggingface:main' into main

e3cd368

Merge branch 'huggingface:main' into main

09bf796

Updated RequestScopedPipeline to handle a single tokenizer lock to av…

bd3e48a

…oid race conditions

Fix

534710c

Fix _TokenizerLockWrapper

4d7c64f

Fix _TokenizerLockWrapper

18db9e6

Delete _TokenizerLockWrapper

8f0efb1

Fix tokenizer

b479039

Merge branch 'huggingface:main' into main

e676b34

Update examples/server-async

0beab1c

Fix server-async

840f0e4

Merge branch 'huggingface:main' into main

bb41c2b

Merge branch 'huggingface:main' into main

8a238c3

Optimizations in examples/server-async

ed617fe

We keep the implementation simple in examples/server-async

b052d27

Update examples/server-async/README.md

0f63f4d

Update examples/server-async/README.md for changes to tokenizer locks…

a9666b1

… and backward-compatible retrieve_timesteps

The changes to the diffusers core have been undone and all logic is b…

06bb136

…eing moved to exmaples/server-async

FredyRivera-dev added 2 commits September 14, 2025 20:29

Update examples/server-async/utils/*

a519915

Fix BaseAsyncScheduler

7cfee77

FredyRivera-dev added 4 commits September 15, 2025 07:09

Rollback in the core of the diffusers

e574f07

Merge branch 'huggingface:main' into main

05d7936

Update examples/server-async/README.md

1049663

Complete rollback of diffusers core files

5316620

sayakpaul reviewed Sep 16, 2025

View reviewed changes

FredyRivera-dev added 3 commits September 16, 2025 19:34

Simple implementation of an asynchronous server compatible with SD3-3…

0ecdfc3

….5 and Flux Pipelines

Update examples/server-async/README.md

ac5c9e6

Fixed import errors in 'examples/server-async/serverasync.py'

72e0215

Flux Pipeline Discard

edd550b

Update examples/server-async/README.md

6b69367

sayakpaul approved these changes Sep 18, 2025

View reviewed changes

Merge branch 'main' into main

5598557

Apply style fixes

7c4f883

sayakpaul merged commit eda9ff8 into huggingface:main Sep 18, 2025
25 checks passed

		@@ -0,0 +1,136 @@
		# Asynchronous server and parallel execution of models

		> Example/demo server that keeps a single model in memory while safely running parallel inference requests by creating per-request lightweight views and cloning only small, stateful components (schedulers, RNG state, small mutable attrs). Works with StableDiffusion3/Flux pipelines and a custom `diffusers` fork.


		## ⚠️ IMPORTANT

		* The server and inference harness live in this repo: `https://github.com/F4k3r22/DiffusersServer`.

Uh oh!

Add RequestScopedPipeline for safe concurrent inference, tokenizer lock and non-mutating retrieve_timesteps #12328

Add RequestScopedPipeline for safe concurrent inference, tokenizer lock and non-mutating retrieve_timesteps #12328

Uh oh!

Conversation

FredyRivera-dev commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

sayakpaul commented Sep 14, 2025

Uh oh!

FredyRivera-dev commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FredyRivera-dev commented Sep 15, 2025

Uh oh!

FredyRivera-dev commented Sep 15, 2025

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

sayakpaul Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

FredyRivera-dev Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

sayakpaul Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

FredyRivera-dev Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

FredyRivera-dev commented Sep 17, 2025

Uh oh!

FredyRivera-dev commented Sep 17, 2025

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented Sep 18, 2025

Uh oh!

github-actions bot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Sep 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

FredyRivera-dev commented Sep 14, 2025 •

edited

Loading

FredyRivera-dev commented Sep 15, 2025 •

edited

Loading

sayakpaul commented Sep 15, 2025 •

edited

Loading

github-actions bot commented Sep 18, 2025 •

edited

Loading