initial

stevhliu · stevhliu · commit b0dd3b7e00b1 · 2025-08-12T20:36:30.000-07:00
diff --git a/docs/source/en/using-diffusers/loading.md b/docs/source/en/using-diffusers/loading.md
@@ -112,6 +112,30 @@ print(pipe.transformer.dtype, pipe.vae.dtype)  # (torch.bfloat16, torch.float16)
 
 If a component is not explicitly specified in the dictionary and no `default` is provided, it will be loaded with `torch.float32`.
 
+#### Parallel loading
+
+Large models are often [sharded](../training/distributed_inference#model-sharding) into smaller files so that they are easier to load. Diffusers supports loading shards in parallel to speed up the loading process.
+
+Set the environment variables below to enable parallel loading.
+
+- Set `HF_ENABLE_PARALLEL_LOADING` to `"YES"` to enable parallel loading of shards.
+- Set `HF_PARALLEL_LOADING_WORKERS` to configure the number of parallel threads to use when loading shards. More workers loads a model faster but uses more memory.
+
+```py
+import os
+import torch
+from diffusers import DiffusionPipeline
+
+os.environ["HF_ENABLE_PARALLEL_LOADING"] = "YES"
+os.environ["HF_PARALLEL_LOADING_WORKERS"] = "12"
+
+pipeline = DiffusionPipeline.from_pretrained(
+    "Wan-AI/Wan2.2-I2V-A14B-Diffusers",
+    torch_dtype=torch.bfloat16,
+    device_map="cuda"
+)
+```
+
 ### Local pipeline
 
 To load a pipeline locally, use [git-lfs](https://git-lfs.github.com/) to manually download a checkpoint to your local disk.