Skip to content

[Feat] support for multi-block layerwise offloading#1486

Open
RuixiangMa wants to merge 4 commits intovllm-project:mainfrom
RuixiangMa:multiblockoffload
Open

[Feat] support for multi-block layerwise offloading#1486
RuixiangMa wants to merge 4 commits intovllm-project:mainfrom
RuixiangMa:multiblockoffload

Conversation

@RuixiangMa
Copy link
Contributor

@RuixiangMa RuixiangMa commented Feb 25, 2026

Purpose

Some diffusion models (e.g., Flux, LongCat, Ovis) have two types of transformer blocks(e.g., transformer_blocks and single_transformer_blocks ), the previous implementation only supported single block type, limiting layerwise offloading effectiveness for these models.

  • Implement _layerwise_offload_blocks_attrs attribute to support models with multiple block types
  • Compatible with existing single-block models using _layerwise_offload_blocks_attr
  • Added support for Flux, Flux2-Klein and Z-Image(single block) models
  • Bug fix : Fixed top-level parameters/buffers staying on CPU during offloading

Test Plan

Test Result

NVIDIA-4090(24G)

vllm serve --model /data/models/black-forest-labs/FLUX* --omni --enable_layerwise_offload --port 8004

curl -X POST http://localhost:8004/v1/images/generations   -H "Content-Type: application/json"   -d '{
    "prompt": "a majestic dragon perched on the mountain ridge of Vermont, misty morning atmosphere, photorealistic style",
    "size": "1024x1024",
    "num_inference_steps": 50,
    "cfg_scale": 4.0,
    "guidance_scale": 4.0,
    "seed": 42
  }' | jq -r '.data[0].b64_json' | base64 -d > dragon.png
Model FLUX.1-dev FLUX.2-klein-4B FLUX.2-klein-9B Qwen-Image-2512
Image

Note: FLUX series adopts a multi-block, while Qwen-Image-2512 uses a single-block.

Offload VS no offload

Since FLUX.1-dev and FLUX.2-klein-9B et.al incur OOM without layer offloading, we use FLUX.2-klein-4B and Z-Image as a representative example to illustrate memory usage:

Model No Offload With Offload
Image VRAM Image VRAM
FLUX.2-klein-4B 19.7GB 13.8GB
Z-Image 22.7GB 15.5GB

Signed-off-by: Lancer <maruixiang6688@gmail.com>

# Handle multiple block types (_layerwise_offload_blocks_attrs)
if blocks_attr_name is None:
blocks_attrs_names = getattr(model.__class__, "_layerwise_offload_blocks_attrs", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO having both _layerwise_offload_blocks_attrs and _layerwise_offload_blocks_attr is a little confusing. I think it would be cleaner to just have one attr that can also be a list, because the behavior is not well-defined if a module sets both attributes by mistake

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, accounted for that, only kept the legacy path for compatibility, but can refactor if needed.

Signed-off-by: Lancer <maruixiang6688@gmail.com>

def __init__(self):
self.blocks = nn.ModuleList([...]) # Transformer blocks
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds multi-block layerwise offloading but provides no test coverage. Add tests to verify: (1) multi-block offloading works correctly with different block types, (2) memory usage is reduced as expected, (3) output quality is maintained, and (4) edge cases like empty or invalid block attributes are handled.


if not blocks_attr_name or not blocks:
if not blocks:
logger.warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No validation for blocks_attr_names. What happens if an attribute name doesn't exist on the model? Add error handling to check that each attribute in _layerwise_offload_blocks_attrs exists and contains valid blocks, with clear error messages for misconfiguration.

m.to(self.device)

# Move top-level params/buffers to GPU (dit_module's own, not sub-modules)
for param in dit_module._parameters.values():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the offloading behavior from single 'blocks' attribute to multiple block attributes. Verify backward compatibility - existing models with only 'blocks' should still work. Consider adding a deprecation warning if the old single-attribute pattern is detected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single-block model test has been verified. I'll supplement the result

Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments on the backend changes. The multi-block approach looks right for Flux-style models.

m.to(self.device)
logger.debug(f"Moved {name} to device {self.device}")
if blocks_attr_names and name not in blocks_attr_names:
m.to(self.device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code had logger.debug calls here for skipped/moved modules. Dropping them makes offloading issues harder to debug — can you keep the logging?

for param in dit_module._parameters.values():
if param is not None:
param.data = param.data.to(self.device, non_blocking=True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving top-level params/buffers looks like a separate bug fix (previously they would stay on CPU). Worth calling out in the PR description so it does not get overlooked during review.

logger.debug(f"Skipped blocks module {name}")
continue
m.to(self.device)
logger.debug(f"Moved {name} to device {self.device}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the blocks_attr_names and guard is redundant — we already continue above when not blocks, and blocks being non-empty implies blocks_attr_names is non-empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ths, fixed it

Signed-off-by: Lancer <maruixiang6688@gmail.com>
Signed-off-by: Lancer <maruixiang6688@gmail.com>
@RuixiangMa
Copy link
Contributor Author

RuixiangMa commented Feb 28, 2026

z-image is also supported in the pr to validate memory savings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants