Skip to content

[Feature]: Support cfg kv-cache transfer in multi-stage#1422

Merged
hsliuustc0106 merged 32 commits intovllm-project:mainfrom
princepride:feature/cfg-multi-stage
Mar 2, 2026
Merged

[Feature]: Support cfg kv-cache transfer in multi-stage#1422
hsliuustc0106 merged 32 commits intovllm-project:mainfrom
princepride:feature/cfg-multi-stage

Conversation

@princepride
Copy link
Collaborator

@princepride princepride commented Feb 21, 2026

Purpose

Related: #1419

User Request req_0
    │
    ├─── prompt_expand_func ──→ companion req_0__cfg_text
    │
    ▼ (Stage-0: AR/LLM)
    ├── req_0 inference complete ──→ pending in _pending_parent_results
    └── req_0__cfg_text inference complete ──→ cfg_companion_done flag
                                                   │
                                                   ▼ all companions done?
                                                   │
    _forward_parent_with_cfg ◄─────────────────────┘
    │
    │  sp_next.cfg_kv_request_ids = {"cfg_text": "req_0__cfg_text"}
    │
    ▼ (Connector: SharedMemory)
    │
    ▼ (Stage-1: Diffusion/DiT)
    │
    receive_multi_kv_cache:
    │  1. receive_kv_cache(req_0)         → gen KV
    │  2. collect_cfg_kv_caches(cfg_ids)  → cfg_text KV
    │
    ▼ pipeline_bagel.forward:
       gen_context["past_key_values"]      = gen KV
       cfg_text_context["past_key_values"] = cfg_text KV
       cfg_img_context["past_key_values"]  = gen KV (text2img reuse)
       │
       ▼ 3-branch CFG DiT denoising
       │
       ▼ Output Image

Test Plan

Multi-Stage inference tasks can now generate outputs with the same high quality as only DIT stage.

Multi-Stage Test:

python3 examples/offline_inference/bagel/end2end.py --prompts "A cute cat" --modality text2img

Diffusion Test:

from vllm_omni.entrypoints.omni_diffusion import OmniDiffusion
from vllm_omni.inputs.data import OmniDiffusionSamplingParams, OmniPromptType

def main():
    pipeline = OmniDiffusion(
        model="../models/BAGEL-7B-MoT",
    )
    prompts = {
        "prompt": "A cute cat",
    }
    
    result = pipeline.generate(
        prompts,
        OmniDiffusionSamplingParams(
            seed=52
        )
    )
    result[0].images[0].save("bagel_i2i_output.png")

if __name__ == "__main__":
    main()

Test Result

Before:
image

After:
image

Signed-off-by: princepride <wangzhipeng628@gmail.com>
@princepride
Copy link
Collaborator Author

@natureofnature @hsliuustc0106 @ZJY0516 Can take a look after holiday ends.😊

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6cde323408

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link

@hsliuustc hsliuustc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

Overview

This PR adds CFG (Classifier-Free Guidance) KV-cache transfer for multi-stage inference in vllm-omni. It enables high-quality 3-branch CFG in Bagel's AR → Diffusion pipeline without degrading image quality.

Stats: 12 files changed, +487 / -18 lines


Critical Issues (Must Fix)

1. Race condition in _forward_parent_with_cfg (P1) 🚨

Location: vllm_omni/entrypoints/omni.py:~1046

Problem: When multiple CFG-enabled requests are in flight, _forward_parent_with_cfg recomputes next_inputs from shared mutable state (self.stage_list[0].engine_outputs) which can be overwritten by a different request's Stage-0 output. This causes the diffusion stage to receive token IDs from the wrong parent request.

Recommended Fix: Use the saved parent_result["engine_outputs"] instead of recomputing from shared state.


2. Missing error handling for companion request failures 🚨

Location: vllm_omni/entrypoints/omni.py ~806-825

Problem: If a CFG companion request fails at Stage-0, the parent request will wait indefinitely → deadlock.

Recommended Fix: Add timeout and error propagation.


3. Memory leak potential with _pending_parent_results 🚨

Location: vllm_omni/entrypoints/omni.py ~935-940

Problem: Failed companion requests are never cleaned up.

Recommended Fix: Implement cleanup on error/timeout paths.


Important Issues (Should Fix)

4. Missing validation in collect_cfg_kv_caches

vllm_omni/model_executor/stage_input_processors/bagel.py lines 107-139

5. Batch size limitation without documentation

vllm_omni/model_executor/stage_configs/bagel.yaml line 7

6. No handling for img2img case

vllm_omni/model_executor/stage_input_processors/bagel.py line 76


Positive Aspects ✅

  • Clean separation of concerns
  • Extensible hook-based design
  • Backward compatible
  • CI checks passing

Overall Assessment

The core design is sound, but the race condition is a must-fix before merge. Once the critical issues are addressed, this will be a valuable addition.

Action items: Fix race condition, add error handling, implement cleanup for memory leaks.

Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice approach to CFG KV-cache transfer. I have a few concerns around error handling and concurrency.

…add companion timeout

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\

@hsliuustc0106
Copy link
Collaborator

@vllm-omni-reviewer

@github-actions
Copy link

🤖 VLLM-Omni PR Review

Code Review: Support CFG KV-Cache Transfer in Multi-Stage

1. Overview

This PR implements CFG (Classifier-Free Guidance) KV-cache transfer for multi-stage inference in VLLM-Omni. The key insight is that BAGEL's 3-branch CFG requires multiple prompts through the AR stage (gen, cfg_text, cfg_img), and their KV caches need to be transferred to the diffusion stage.

Key Changes:

  • Added prompt expansion mechanism to generate CFG companion requests
  • Implemented multi-KV cache collection and transfer
  • Updated BAGEL pipeline to use multiple KV caches for CFG
  • Added model-specific processor functions for BAGEL

Overall Assessment: Positive - The implementation is well-structured and addresses a real limitation. The before/after images demonstrate clear quality improvement.

2. Code Quality

Strengths

  • Good separation of concerns with model-specific functions in bagel.py
  • Comprehensive logging for debugging
  • Proper error handling with try/except blocks
  • Clear docstrings in the new processor module

Issues

vllm_omni/entrypoints/omni.py:906 - Shallow copy concern:

sp_next = copy.copy(sampling_params_list[next_stage_id])

This is a shallow copy. If OmniDiffusionSamplingParams contains nested mutable objects, modifications could affect the original. Consider using copy.deepcopy() or ensuring the class implements __copy__ properly.

vllm_omni/diffusion/models/bagel/pipeline_bagel.py:339-358 - Repeated getattr calls:

cfg_text_kv = getattr(req.sampling_params, "cfg_text_past_key_values", None)
# ... later ...
cfg_text_metadata = getattr(req.sampling_params, "cfg_text_kv_metadata", None)

Consider extracting sampling_params to a local variable and using getattr once per attribute.

vllm_omni/model_executor/stage_input_processors/bagel.py:152 - Magic string for default negative prompt:

return "<|im_start|><|im_end|>"

Consider making this a constant or configurable value.

3. Architecture & Design

Strengths

  • Callback injection pattern: Loading prompt_expand_func and cfg_kv_collect_func from config is elegant and extensible
  • Race condition mitigation: The source_outputs_override parameter addresses the race condition where deferred requests read stale outputs
  • Configurable timeout: VLLM_CFG_PENDING_TIMEOUT_S environment variable for safety timeout

Concerns

vllm_omni/entrypoints/omni_stage.py:236-242 - Dynamic function loading security:

def _load_func_from_config(stage_config: Any, attr_name: str):
    func_path = getattr(stage_config, attr_name, None)
    if not func_path:
        return None
    module_path, func_name = func_path.rsplit(".", 1)
    module = importlib.import_module(module_path)
    return getattr(module, func_name)

This dynamically imports and calls functions based on config. If config files can be user-controlled, this is a potential security vector. Consider:

  1. Adding a whitelist of allowed modules/functions
  2. Documenting that config files should be trusted

vllm_omni/entrypoints/omni.py - The orchestrator logic has grown significantly complex. Consider extracting CFG-related logic into a separate helper class:

class CFGCompanionManager:
    """Manages CFG companion request lifecycle."""
    
    def __init__(self, prompt_expand_func, timeout_s: float = 120.0):
        self.companion_map: dict[str, dict[str, str]] = {}
        self.companion_ids: set[str] = set()
        self.companion_done: dict[str, set[str]] = {}
        self.pending_parents: dict[str, Any] = {}
        self.failed_parents: set[str] = set()
    
    def expand_prompts(self, request_id_to_prompt, sampling_params): ...
    def on_companion_complete(self, req_id): ...
    def check_pending_timeouts(self): ...

4. Security & Safety

Resource Management

  • vllm_omni/entrypoints/omni_stage.py:654-664: Good use of try/finally to restore engine_outputs after temporary modification

Input Validation

  • vllm_omni/distributed/omni_connectors/kv_transfer_manager.py:502-504: The exception handling silently continues after logging. Consider whether this should propagate or return a partial result indicator:
except Exception:
    logger.exception("Failed to collect CFG KV caches for %s", request_id)

Potential Issues

  • Memory: Multiple KV caches per request could increase memory usage significantly. Consider adding memory monitoring or limits.

5. Testing & Documentation

Test Coverage

  • The test plan demonstrates functional correctness with before/after images
  • Missing: unit tests for the new processor functions
  • Missing: edge case tests (companion failure, timeout, empty prompts)

Documentation

  • Good docstrings in bagel.py
  • Missing: documentation for the new config options (prompt_expand_func, cfg_kv_collect_func)
  • Missing: explanation of the CFG companion request flow in architecture docs

Suggested Test Cases

# Test prompt expansion
def test_expand_cfg_prompts_text2img():
    prompt = {"prompt": "A cat", "modalities": ["image"]}
    result = expand_cfg_prompts(prompt, mock_sampling_params)
    assert len(result) == 1
    assert result[0].role == "cfg_text"

# Test timeout handling
def test_cfg_companion_timeout():
    # Verify parent request is properly cleaned up after timeout
    pass

# Test companion failure propagation
def test_cfg_companion_failure_propagates():
    # Verify parent fails when companion fails
    pass

6. Specific Suggestions

vllm_omni/entrypoints/omni.py:772

remaining_by_stage: list[int] = [len(request_prompts) + len(cfg_companion_ids)] + [0] * (num_stages - 1)

Consider adding a comment explaining why companion IDs are counted in stage-0 but not in total_requests.

vllm_omni/entrypoints/omni.py:799-821

The companion failure handling logic is duplicated in two places (error result and timeout). Extract to a helper:

def _handle_companion_failure(self, parent_id: str, reason: str):
    _cfg_failed_parents.add(parent_id)
    logger.error(f"[{self._name}] {reason}")
    if parent_id in _pending_parent_results:
        _pending_parent_results.pop(parent_id)
        # ... rest of cleanup

vllm_omni/diffusion/data.py:382

cfg_kv_collect_func: Any | None = None

Consider using Callable | None with proper signature:

from typing import Callable
CfgKvCollectFunc = Callable[[str, dict[str, str], Any, torch.device | None], dict[str, Any]]
cfg_kv_collect_func: CfgKvCollectFunc | None = None

vllm_omni/model_executor/stage_input_processors/bagel.py:20

The constant CFG_TEXT_SUFFIX is good. Consider adding other suffixes as constants for future img2img support:

CFG_TEXT_SUFFIX = "__cfg_text"
CFG_IMG_SUFFIX = "__cfg_img"  # Reserved for img2img

vllm_omni/inputs/data.py:232-237

Consider grouping CFG-related fields with a comment:

# [Omni] Multi-KV for CFG: populated by model-specific cfg_kv_collect_func
# These fields store companion KV caches for 3-branch CFG
cfg_text_past_key_values: Any | None = None
cfg_img_past_key_values: Any | None = None
cfg_text_kv_metadata: dict[str, Any] | None = None
cfg_img_kv_metadata: dict[str, Any] | None = None
cfg_kv_request_ids: dict[str, str] | None = None  # role -> request_id mapping

7. Approval Status

LGTM with suggestions

The PR is well-designed and achieves its stated goal of enabling CFG in multi-stage inference. The before/after results clearly demonstrate the quality improvement. The architecture is extensible and the error handling is reasonable.

Minor suggestions to address before merge:

  1. Consider using copy.deepcopy or verifying shallow copy is safe
  2. Add documentation for new config options
  3. Consider extracting CFG management logic into a helper class for maintainability
  4. Add unit tests for the processor functions

These are not blocking issues - the core functionality is solid and the code is production-ready. The suggestions are for long-term maintainability.


This review was generated automatically by the VLLM-Omni PR Reviewer Bot
using glm-5.

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the design is solid — the hook-based approach keeps model-specific CFG logic out of the orchestrator core, and the companion request paradigm is a reasonable way to handle multi-branch CFG across disaggregated stages.

Main concerns:

  1. Verify request accounting (completed_requests) is correct in all paths (normal, timeout, companion failure) — bugs there would cause hangs.
  2. O(n) parent lookup for companion requests could be improved with a reverse index.
  3. source_outputs_override temporary mutation of shared state is not thread-safe.
  4. Unused utility functions (is_cfg_companion_request, get_parent_request_id) should be removed or deferred.

See inline comments for details.

princepride and others added 8 commits February 24, 2026 14:15
The parameter is passed as a plain int (value-copied), so any mutation
inside the method would never propagate back to the caller's loop.
The method never uses it either.  Remove to avoid misleading future
maintainers into thinking they can increment it here.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Replace two O(parents × roles) linear scans in the error-handling and
companion-completion paths with a pre-built cfg_companion_to_parent
dict populated during the expansion phase.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
The temporary mutation of stage_list[].engine_outputs is safe today
because the orchestrator loop is single-threaded, but would race
under concurrency. Add a NOTE comment to flag this for future work.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Neither function is called anywhere in the codebase. Removing them
keeps the diff focused; they can be re-added when actually needed.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
The resolved function object is never used in the orchestrator process;
only the raw string travels to the worker for re-resolution.  Add a
comment explaining this is intentional early-validation of the import.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
An empty string negative_prompt is not meaningful for Bagel CFG
guidance. Change `if neg is not None` to `if neg` so that both
None and "" fall through to the default token pair.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Stage-0 max_batch_size=2 assumes single-prompt inference (1 user + 1
CFG companion). For multi-prompt batches it should scale accordingly.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
@princepride princepride force-pushed the feature/cfg-multi-stage branch from c4ef389 to 6cca43d Compare February 24, 2026 06:51
Signed-off-by: princepride <wangzhipeng628@gmail.com>
@hsliuustc0106
Copy link
Collaborator

@vllm-omni-reviewer

@github-actions
Copy link

🤖 VLLM-Omni PR Review

Code Review: Support CFG KV-Cache Transfer in Multi-Stage

1. Overview

This PR implements Classifier-Free Guidance (CFG) support across disaggregated multi-stage inference pipelines through a "companion request" paradigm. The implementation:

  1. Expands prompts in Stage 0 (AR/LLM) to generate companion requests (e.g., negative prompts)
  2. Transfers KV caches for both primary and companion requests across stage boundaries
  3. Collects and injects companion KV caches in downstream diffusion stages for CFG computation

The approach is well-designed, allowing CFG to work efficiently without redundant text encoder evaluations on diffusion workers.

Overall Assessment: Positive - The implementation is solid, well-documented, and follows the existing architecture patterns. A few suggestions for robustness and maintainability are noted below.


2. Code Quality

Strengths

  • Clean separation of model-specific logic into bagel.py
  • Good error handling for companion failures with timeout mechanism
  • Comprehensive logging for debugging
  • Backward compatible - existing configs work without changes

Concerns

Magic Strings: Several string literals are used throughout that should be constants:

# In omni.py and bagel.py
"cfg_text", "cfg_img", "__cfg_text"

Type Annotations: Some functions in bagel.py use Any excessively, which reduces type safety benefits.

Complex State Management: The orchestrator in omni.py now manages multiple dictionaries for CFG state tracking. While functional, this adds complexity.


3. Architecture & Design

Strengths

  • Configuration-driven design: New hooks (prompt_expand_func, cfg_kv_collect_func) are specified in YAML configs
  • Extensibility: Easy to add support for other models by creating new processor modules
  • Consistent patterns: Follows existing stage configuration and worker initialization patterns

Suggestions

Thread Safety Acknowledgment: The comment in omni_stage.py:658-662 correctly identifies a potential issue:

# NOTE: This relies on the orchestrator being single-threaded.
# If concurrency is introduced, replace with a per-call context
# or a thread-local to avoid racing on shared mutable state.

Consider adding a docstring or assertion to document this assumption explicitly.


4. Security & Safety

Dynamic Function Loading

The _load_func_from_config function dynamically imports functions based on config values. This is consistent with existing patterns but could be a concern if untrusted configs are loaded:

vllm_omni/entrypoints/omni_stage.py:236-243

def _load_func_from_config(stage_config: Any, attr_name: str):
    """Dynamically import a function referenced by a dotted path in stage config."""
    func_path = getattr(stage_config, attr_name, None)
    if not func_path:
        return None
    module_path, func_name = func_path.rsplit(".", 1)
    module = importlib.import_module(module_path)
    return getattr(module, func_name)

Suggestion: Consider validating that the function path starts with an allowed prefix (e.g., vllm_omni.) to prevent arbitrary code execution from malicious configs.

Resource Management

The timeout mechanism for pending parents is good:

vllm_omni/entrypoints/omni.py:1028-1042

However, the default timeout of 120 seconds may be too long for some use cases. Consider documenting this environment variable more prominently.


5. Testing & Documentation

Documentation

Documentation is comprehensive and well-written. The architecture overview clearly explains the CFG companion flow.

Testing

The test changes update reference pixels to match new CFG-enabled output, but there are no unit tests for:

  • expand_cfg_prompts function
  • collect_cfg_kv_caches function
  • CFG companion failure scenarios
  • Timeout handling

Suggestion: Add unit tests for the new processor functions in bagel.py.


6. Specific Suggestions

vllm_omni/model_executor/stage_input_processors/bagel.py

Line 21: Consider defining constants for role names:

# Suggestion
ROLE_CFG_TEXT = "cfg_text"
ROLE_CFG_IMG = "cfg_img"
CFG_TEXT_SUFFIX = "__cfg_text"

Line 45-47: The logic for determining when to expand could be clearer:

# Current
if "image" not in modalities and "img2img" not in modalities:
    return []

# Suggestion - add comment explaining why
# Only expand for image generation tasks (text2img, img2img)
# Text-only tasks don't need CFG expansion
if "image" not in modalities and "img2img" not in modalities:
    return []

Line 138-142: The fallback default for negative prompt could be documented:

# Suggestion - add comment
# Bagel's default unconditional prompt is the empty chat template
# This produces the text-unconditional branch for CFG
return "<|im_start|><|im_end|>"

vllm_omni/entrypoints/omni.py

Line 680-684: Consider extracting CFG state into a dataclass for clarity:

# Suggestion
@dataclass
class CFGState:
    companion_map: dict[str, dict[str, str]]  # parent_id -> {role: companion_id}
    companion_ids: set[str]
    companion_done: dict[str, set[str]]  # parent_id -> set of completed companion_ids
    companion_to_parent: dict[str, str]  # reverse index

Line 826-852: The companion handling logic is nested deeply. Consider extracting to a helper method:

# Suggestion - extract to method
def _handle_completed_companion(self, req_id: str, companion_parent_id: str | None, ...):
    """Handle a completed CFG companion request at Stage-0."""
    ...

Line 1028-1042: The timeout loop iterates over all pending parents each cycle. For efficiency with many pending requests:

# Current - O(n) check every iteration
for parent_id in list(_pending_parent_results.keys()):
    pending_since = _pending_parent_results[parent_id].get("pending_since", _now)

# Suggestion - use a heap or sorted structure for O(log n) expiry checks
# Or check only periodically rather than every iteration

vllm_omni/diffusion/models/bagel/pipeline_bagel.py

Line 337-367: The KV cache injection logic handles multiple cases. Consider adding a docstring explaining the CFG KV structure:

# Suggestion - add docstring or comment
# CFG requires 3 KV caches for Bagel:
# 1. gen (conditional): user prompt KV
# 2. cfg_text (text-unconditional): negative/empty prompt KV
# 3. cfg_img (image-unconditional): for text2img, same as gen

vllm_omni/entrypoints/omni_stage.py

Line 658-671: The temporary swap of engine_outputs is clever but fragile. Consider a safer approach:

# Suggestion - pass override explicitly through the call chain
# rather than mutating shared state
return self.custom_process_input_func(
    stage_list, engine_input_source, prompt, self.requires_multimodal_data,
    source_outputs_override=source_outputs_override
)

This would require updating the custom processor signature, but would be safer.

vllm_omni/distributed/omni_connectors/kv_transfer_manager.py

Line 501-503: Good error handling, but consider logging the specific exception:

# Current
except Exception:
    logger.exception("Failed to collect CFG KV caches for %s", request_id)

# Already good - logger.exception includes the traceback

7. Approval Status

LGTM with suggestions

The PR is well-designed and implements an important feature for multi-modal CFG. The architecture is sound, documentation is thorough, and the implementation follows existing patterns. The suggestions above are primarily for:

  1. Maintainability: Constants for magic strings, extracting complex logic
  2. Robustness: Unit tests for new functions, safer state management
  3. Future-proofing: Thread safety documentation

None of the suggestions block merging - they can be addressed in follow-up PRs if preferred. The core functionality is solid and the test results demonstrate the feature works correctly.


This review was generated automatically by the VLLM-Omni PR Reviewer Bot
using glm-5.

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final Review Pass

All my original concerns have been addressed:

  1. Deadlock risk -- Resolved. Companion failure propagation (error path + timeout expiry at VLLM_CFG_PENDING_TIMEOUT_S) covers all three failure scenarios.

  2. Linear scan -- Resolved. cfg_companion_to_parent reverse index gives O(1) lookup.

  3. Race condition on shared engine_outputs -- Resolved. _forward_parent_with_cfg passes saved parent_result["engine_outputs"] via source_outputs_override. Single-threaded assumption NOTE is appreciated.

  4. receive_kv_cache_for_request definition -- Confirmed existing in the codebase.

  5. img2img CFG skip -- Understood; img2img bypasses CFG expansion by design.

  6. Hard failure for connector -- Agreed; KV transfer is required for CFG, RuntimeError is correct.

  7. Batch size comment -- Added and clear.

Also noted that feedback from @hsliuustc0106 was addressed: unused completed_requests param removed, empty-string negative prompt handled, unused utility functions removed, clarifying comments added.

Minor suggestions for follow-up (non-blocking):

  • Unit tests for expand_cfg_prompts and collect_cfg_kv_caches would improve confidence in edge cases
  • Role strings ("cfg_text", "cfg_img") could be extracted to constants alongside CFG_TEXT_SUFFIX
  • cfg_kv_collect_func typing in diffusion/data.py could use Callable instead of Any

LGTM -- approving.

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
@princepride
Copy link
Collaborator Author

@hsliuustc0106 @tzhouam I encapsulated the CFG prompt expansion and companion tracking logic in the Omni orchestrator, PTAL

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactoring into CfgCompanionTracker is a big improvement. One issue: if forward_parent_with_cfg hits an exception in process_engine_inputs, the parent was already popped from _pending_parents via pop_pending_parent. That means the request is orphaned -- it won't be caught by timeout, and completed_requests never increments, so the scheduling loop hangs.

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
@princepride
Copy link
Collaborator Author

If process_engine_inputs raises here, the parent was already popped from _pending_parents at the call site but never counted as completed. This silently orphans the request and hangs the loop. Either re-insert the parent into _pending_parents on failure, or return a success/failure indicator so the caller can handle it.

Excellent suggestion! I looked at the original code, and it seems that handling exceptions in the connector section doesn't update completed_requests. I will update original code and cfg code.

try:
# Derive inputs for the next stage, record preprocess time
with metrics.stage_postprocess_timer(stage_id, req_id):
next_inputs = next_stage.process_engine_inputs(
self.stage_list, [request_id_to_prompt[req_id]]
)
except Exception as e:
logger.exception(
f"[{self._name}] Process engine inputs error for req {req_id}"
f" at stage {next_stage_id}: {e}",
)
continue

@hsliuustc0106 @tzhouam I think @lishunyang12's suggestion is reasonable. Our original code did not properly handle data that failed to be sent to the next stage.

Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three concerns addressed. The completed_requests accounting fix in the original code path is a good catch too. LGTM.

@hsliuustc0106
Copy link
Collaborator

Review Summary

PR Type: Feature - CFG KV-cache transfer for multi-stage inference

Key Observations:

  • No MRO issues (uses composition, not mixins with nn.Module)
  • Clean separation of concerns with CfgCompanionTracker class
  • Visual quality improvement demonstrated in PR description

Suggestions:

  • Add unit tests for CFG logic (prompt expansion, companion tracking)
  • Use specific exception handling instead of bare except Exception:

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Mar 2, 2026
Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hsliuustc0106 hsliuustc0106 enabled auto-merge (squash) March 2, 2026 03:04
Signed-off-by: princepride <wangzhipeng628@gmail.com>
@hsliuustc0106 hsliuustc0106 merged commit e37a89f into vllm-project:main Mar 2, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants