An Idea for partial sequential cpu offloading #12954

rodjjo · 2025-11-29T15:41:10Z

rodjjo
Nov 29, 2025

Just an idea.

It's not a problem or anything...

I've be using a custom offload for my potato GPU. Maybe there is another way to do it or so...

In short, I've being using sequential offloading for a long time, when I enable it it use a minimal of VRAM, however I know It could use more VRAM to do less IO, so I created a Mixin for partial CPU offload where the model can keep several layers on GPU and just offload some.

See code here: https://gist.github.com/rodjjo/20e2e842fea9ed58114adb560a4566b6

 class MyQwen3ForCausalLM(Qwen3ForCausalLM, PartialOffloadMixin):
          LAYERS_KEEP_GPU = 22
          MODEL_ATTR_NAME = "model"
          MODEL_LAYERS_ATTR_NAME = "layers"
          OFFLOAD_ON_CALL = True
       model = MyQwen3ForCausalLM.from_pretrained(
            repo_id,
            subfolder="text_encoder",
            local_files_only=True,
            torch_dtype=torch.bfloat16,
       )
       model.eval()
       model.enable_partial_cpu_offload()
       # pseudo code of inference
       result = model(...)  # call was overrided and calls go_gpu(True) go_gpu(False)
      example transformer:
      class MyZImageTransformer(ZImageTransformer2DModel, PartialOffloadMixin):
          MODEL_LAYERS_ATTR_NAME = "layers"
          LAYERS_KEEP_GPU = 22
      model = MyZImageTransformer.from_pretrained(
          repo_id,
          subfolder="transformer",
          torch_dtype=torch.bfloat16,
      )
      model.eval()
      model.enable_partial_cpu_offload()
      
      # denoise step
      model.go_gpu(True)
  
      while denoising:  #pseudo code
           predicted = model(...)
      model.go_gpu(False)

It's saving me 12 to 13 seconds of inference in zimage turbo (my custom pipeline with this partial layers offloading):

Before (normal sequential offloading):

After (partial sequential offloading):

2026-01-09T15:04:09Z

github-actions[bot]
bot Jan 9, 2026

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An Idea for partial sequential cpu offloading #12954

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

An Idea for partial sequential cpu offloading #12954

Uh oh!

Uh oh!

rodjjo Nov 29, 2025

Replies: 1 comment

Uh oh!

github-actions[bot] bot Jan 9, 2026

rodjjo
Nov 29, 2025

github-actions[bot]
bot Jan 9, 2026