Skip to content

[feature] Add TeaCache Support to Glm Image#1458

Open
akshatvishu wants to merge 15 commits intovllm-project:mainfrom
akshatvishu:teacache/glm_image
Open

[feature] Add TeaCache Support to Glm Image#1458
akshatvishu wants to merge 15 commits intovllm-project:mainfrom
akshatvishu:teacache/glm_image

Conversation

@akshatvishu
Copy link

Part of #1217

Purpose

Add TeaCache Support to Glm Image

Test Plan

Initial coefficient estimation and testing done at this colab notebook

Test Result


BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

- Implement GLM-Image CacheContext extractor
- Add GLMImageAdapter for coefficient estimation

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
- Remove collected_data tensor storage
- Compute and store scalar L1 diffs immediately
- Prevent RAM growth for large calibration runs
- No change to coefficient math

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Added `self.do_true_cfg = True` to GlmImageTransformer2DModel initialization.

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
@akshatvishu
Copy link
Author

akshatvishu commented Feb 24, 2026

Question 1: Coefficient Validity Without do_true_cfg = True

During coefficient estimation, GlmImageTransformer2DModel did not have self.do_true_cfg = True set.
The estimation pipeline performed full forward passes and collected modulated input/output tensors directly via the extractor. Since the extractor operates on the transformer forward pass itself (independent of CFG branch state handling), I would expect the collected tensors to remain valid.

However, I’m unsure whether the missing do_true_cfg flag could have caused:

  • unintended mixing of CFG branches during forward execution, or

  • altered internal behavior that would invalidate the polynomial fit training data.


Q2) After adding self.do_true_cfg = True to GlmImageTransformer2DModel, TeaCache still shows no speedup:

Baseline: ~31s

TeaCache: ~32s

Additionally, the log line:

"TeaCache applied with..."

never appears in subprocess logs, so I cannot confirm whether the hook is firing during inference. Maybe I am missing setting some env flag in colab to enable logs. Because I did test the Z Image teacache integration and while it not show TeaCache applied with in the logs as before, you can see that it was applied as the generation time was significantly less then baseline. All this testing can be seen in this colab notebook

So far, I’ve verified:

  • TeaCacheConfig initializes correctly with GLM coefficients
  • apply_teacache_hook() attaches correctly in the main process
  • get_cache_backend("tea_cache") returns a valid backend
  • pipeline.transformer exists and is the expected class
  • _WrappedForward correctly wraps forward() in isolation

Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. The biggest issue is do_true_cfg = True hardcoded in the transformer __init__ — pretty sure that's why the hook never fires correctly (see inline comment).

Also, estimate_teacache_coefficients() is now dead code after inlining its logic into estimate(). Either remove it or keep calling it — having both is confusing.

Same print() statements exist in the standalone function too (lines 132-135 on main), so those should get cleaned up as part of this.

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
… print with logger

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
…ew one in forward()

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
@akshatvishu
Copy link
Author

akshatvishu commented Feb 28, 2026

Thanks again for the quick review!

Left a few comments. The biggest issue is do_true_cfg = True hardcoded in the transformer __init__ — pretty sure that's why the hook never fires correctly (see inline comment).

Yes, this indeed was the main issue, now I am seeing a 1.3x vs the base when I tested it in-here at: colab -- under heading "Testing Teacache". I suspect the AR stage is a fixed cost (since TeaCache only optimized the DiT part) that makes the total speedup look smaller. Once my colab credits are back, I want to profile the AR and DiT stages separately to see the exact impact on the denoising loop.

Also, estimate_teacache_coefficients() is now dead code after inlining its logic into estimate(). Either remove it or keep calling it — having both is confusing.

I've removed the dead code!

Same print() statements exist in the standalone function too (lines 132-135 on main), so those should get cleaned up as part of this.

Cleaned this up too!

Signed-off-by: akshatvishu <33392262+akshatvishu@users.noreply.github.com>
@akshatvishu akshatvishu marked this pull request as ready for review February 28, 2026 12:47
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d39eb7a4d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

od_config = OmniDiffusionConfig.from_kwargs(model=model_path, dtype=dtype)
od_config.model_class_name = "GlmImagePipeline"

tf_config_path = hf_hub_download(model_path, "transformer/config.json")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle local checkpoints when reading GLM transformer config

GlmImageAdapter.load_pipeline unconditionally calls hf_hub_download(model_path, "transformer/config.json"), which treats model_path as a Hub repo id; when callers pass a local model directory (a supported pattern elsewhere in this repo), this path resolution fails before the pipeline is built. That makes TeaCache coefficient estimation for GLM unusable in local/offline setups and breaks parity with other loaders that accept filesystem paths.

Useful? React with 👍 / 👎.

Comment on lines +194 to 197
sampling_params = OmniDiffusionSamplingParams(
num_inference_steps=generate_kwargs.get("num_inference_steps", 20),
seed=generate_kwargs.get("seed", 42),
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pass GLM prior tokens into coefficient collection requests

collect_from_prompt only populates num_inference_steps and seed, then sends a plain prompt; for GlmImagePipeline, non-warmup requests require prior_token_ids (and optionally prior_token_image_ids) in request extras, otherwise pipeline_glm_image.forward raises a ValueError. As written, the new GLM estimator path cannot collect data from normal prompts, so the advertised support is functionally broken unless users bypass this API.

Useful? React with 👍 / 👎.

@akshatvishu
Copy link
Author

It seems like the new GlmImagePipeline requires prior_token_ids in the request extras for any non-warmup call which in-turns leads to the old coefficient estimation logic being broken. I think , we can load the AR model separately inside GlmImageAdapter (on CPU to avoid the memory issue)and generate prior tokens before each forward() call. Is this alright or do you want this handled any specific way? @lishunyang12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants