vllm-project
diff --git a/‎docs/configuration/stage_configs.md‎
Lines changed: 0 additions & 8 deletions b/‎docs/configuration/stage_configs.md‎
Lines changed: 0 additions & 8 deletions
diff --git a/‎docs/design/architecture_overview.md‎
Lines changed: 0 additions & 7 deletions b/‎docs/design/architecture_overview.md‎
Lines changed: 0 additions & 7 deletions
diff --git a/‎examples/offline_inference/bagel/end2end.py‎
Lines changed: 2 additions & 7 deletions b/‎examples/offline_inference/bagel/end2end.py‎
Lines changed: 2 additions & 7 deletions
diff --git a/‎tests/diffusion/test_diffusion_model_runner.py‎
Lines changed: 1 addition & 4 deletions b/‎tests/diffusion/test_diffusion_model_runner.py‎
Lines changed: 1 addition & 4 deletions
diff --git a/‎tests/e2e/offline_inference/stage_configs/bagel_mooncake_ci.yaml‎
Lines changed: 0 additions & 2 deletions b/‎tests/e2e/offline_inference/stage_configs/bagel_mooncake_ci.yaml‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎tests/e2e/offline_inference/stage_configs/bagel_sharedmemory_ci.yaml‎
Lines changed: 0 additions & 2 deletions b/‎tests/e2e/offline_inference/stage_configs/bagel_sharedmemory_ci.yaml‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎tests/e2e/offline_inference/test_bagel_text2img.py‎
Lines changed: 10 additions & 14 deletions b/‎tests/e2e/offline_inference/test_bagel_text2img.py‎
Lines changed: 10 additions & 14 deletions
diff --git a/‎tests/e2e/offline_inference/test_qwen3_omni_pd.py‎
Lines changed: 66 additions & 0 deletions b/‎tests/e2e/offline_inference/test_qwen3_omni_pd.py‎
Lines changed: 66 additions & 0 deletions
diff --git a/‎tests/e2e/online_serving/test_qwen3_omni_pd.py‎
Lines changed: 122 additions & 0 deletions b/‎tests/e2e/online_serving/test_qwen3_omni_pd.py‎
Lines changed: 122 additions & 0 deletions
@@ -135,14 +135,6 @@ Each stage in the `stage_args` list contains the following configuration options
 
 A unique identifier for each stage in the multi-stage pipeline. Stages are numbered sequentially starting from 0, and this ID is used to reference stages in inter-stage dependencies (e.g., `engine_input_source`).
 
-### `prompt_expand_func` (Optional)
-
-A custom Python function hook for the LLM stage (Stage 0) that expands a single incoming prompt object into multiple prompts. This is primarily used for multi-modal Classifier-Free Guidance (CFG), where it generates the necessary companion requests (like a negative text prompt) and tags them with internal roles (e.g., `cfg_text`). This ensures the upstream LLM generates the needed contextual hidden states for both the conditional and unconditional generations simultaneously.
-
-### `cfg_kv_collect_func` (Optional)
-
-A custom Python function hook for downstream diffusion stages (Stage 1+) to collect, map, and process the KV caches transferred from the companion requests fired by `prompt_expand_func`. It aggregates the hidden condition states cleanly (e.g., binding them as `cfg_text_past_key_values` and `cfg_text_kv_metadata`), allowing the diffusion runtime to perform CFG smoothly without redundantly evaluating text paths on the DiT workers.
-
 ### `runtime`
 
 Configuration for disaggregated execution of the stage, controlling how the stage is deployed and executed.
 
@@ -92,13 +92,6 @@ The framework achieves high performance through several optimization techniques:
     * **Quantization:** Supports various quantization implementations including FP8 and AWQ.
     * **FusedOps:** Allows for custom and third-party integration.
 
-### Classifier-Free Guidance (CFG) Companion Flow
-
-vLLM-Omni natively models Classifier-Free Guidance (CFG) across disaggregated multi-stage setups via a "companion request" paradigm, eliminating redundant textual/multimodal context computation boundaries:
-1. **Prompt Expansion:** In the initial autoregressive (AR) stage, a customized `prompt_expand_func` hook intercepts incoming generation prompts and pairs them directly with negative companion prompts (e.g., a default negative prompt) on the fly, tagging the secondary prompt with a specific internal role (`cfg_text`).
-2. **Synchronized KV Cache Transfer:** The AR stage evaluates both the primary and companion sequence batches concurrently. The `OmniConnector` captures these specific structural dependencies and reliably passes the positive and negative outcome KV caches seamlessly across stage boundaries via shared memory or network protocols.
-3. **KV Cache Collection & Injection:** Upon reaching the downstream Diffusion (DiT) Engine, an assigned `cfg_kv_collect_func` automatically intercepts the mapped companion caches (`cfg_text_past_key_values`). These auxiliary dependencies are natively gathered and seamlessly bound to the primary generation sequence variables, enabling the DiT Engine to cleanly implement cross-attention CFG guidance over accurate conditioning and unconditioning structures in parallel.
-
 ### Flexibility and Usability
 
 vLLM-Omni is designed to be flexible and straightforward for users:
 
@@ -49,7 +49,7 @@ def parse_args():
     parser.add_argument("--cfg-text-scale", type=float, default=4.0, help="Text CFG scale (default: 4.0)")
     parser.add_argument("--cfg-img-scale", type=float, default=1.5, help="Image CFG scale (default: 1.5)")
     parser.add_argument(
-        "--negative-prompt", type=str, default=None, help="Negative prompt for CFG (default: empty prompt)"
+        "--negative-prompt", type=str, default=None, help="Negative prompt (not yet supported, reserved for future)"
     )
 
     args = parser.parse_args()
@@ -162,8 +162,6 @@ def main():
                 # text2img
                 final_prompt_text = f"<|im_start|>{p}<|im_end|>"
                 prompt_dict = {"prompt": final_prompt_text, "modalities": ["image"]}
-                if args.negative_prompt is not None:
-                    prompt_dict["negative_prompt"] = args.negative_prompt
                 formatted_prompts.append(prompt_dict)
 
         params_list = omni.default_sampling_params_list
@@ -172,13 +170,10 @@ def main():
             if len(params_list) > 1:
                 diffusion_params = params_list[1]
                 diffusion_params.num_inference_steps = args.steps  # type: ignore
-                extra = {
+                diffusion_params.extra_args = {  # type: ignore
                     "cfg_text_scale": args.cfg_text_scale,
                     "cfg_img_scale": args.cfg_img_scale,
                 }
-                if args.negative_prompt is not None:
-                    extra["negative_prompt"] = args.negative_prompt
-                diffusion_params.extra_args = extra  # type: ignore
 
         omni_outputs = list(omni.generate(prompts=formatted_prompts, sampling_params_list=params_list))
 
 
@@ -56,10 +56,7 @@ def _make_runner(cache_backend, cache_backend_name: str, enable_cache_dit_summar
         enable_cache_dit_summary=enable_cache_dit_summary,
         parallel_config=SimpleNamespace(use_hsdp=False),
     )
-    runner.kv_transfer_manager = SimpleNamespace(
-        receive_kv_cache=lambda req, target_device=None: None,
-        receive_multi_kv_cache=lambda req, cfg_kv_collect_func=None, target_device=None: None,
-    )
+    runner.kv_transfer_manager = SimpleNamespace(receive_kv_cache=lambda req, target_device: None)
     return runner
 
 
 
@@ -4,7 +4,6 @@
 stage_args:
   - stage_id: 0
     stage_type: llm
-    prompt_expand_func: vllm_omni.model_executor.stage_input_processors.bagel.expand_cfg_prompts
     runtime:
       devices: "0"
       max_batch_size: 1
@@ -40,7 +39,6 @@ stage_args:
       to_stage_1: mooncake_connector
   - stage_id: 1
     stage_type: diffusion
-    cfg_kv_collect_func: vllm_omni.model_executor.stage_input_processors.bagel.collect_cfg_kv_caches
     runtime:
       devices: "0"
       max_batch_size: 1
 
@@ -4,7 +4,6 @@
 stage_args:
   - stage_id: 0
     stage_type: llm
-    prompt_expand_func: vllm_omni.model_executor.stage_input_processors.bagel.expand_cfg_prompts
     runtime:
       devices: "0"
       max_batch_size: 1
@@ -39,7 +38,6 @@ stage_args:
 
   - stage_id: 1
     stage_type: diffusion
-    cfg_kv_collect_func: vllm_omni.model_executor.stage_input_processors.bagel.collect_cfg_kv_caches
     runtime:
       devices: "0"
       max_batch_size: 1
 
@@ -37,16 +37,16 @@
 # "Generated with seed=52, num_inference_steps=15,
 # prompt='A futuristic city skyline at twilight, cyberpunk style'"
 REFERENCE_PIXELS = [
-    {"position": (100, 100), "rgb": (49, 96, 134)},
-    {"position": (400, 50), "rgb": (63, 127, 167)},
-    {"position": (700, 100), "rgb": (70, 101, 141)},
-    {"position": (150, 400), "rgb": (115, 90, 150)},
-    {"position": (512, 512), "rgb": (98, 86, 119)},
-    {"position": (700, 400), "rgb": (29, 42, 91)},
-    {"position": (100, 700), "rgb": (47, 50, 88)},
-    {"position": (400, 700), "rgb": (36, 52, 91)},
-    {"position": (700, 700), "rgb": (45, 58, 99)},
-    {"position": (256, 256), "rgb": (62, 94, 135)},
+    {"position": (100, 100), "rgb": (68, 107, 134)},
+    {"position": (400, 50), "rgb": (95, 139, 166)},
+    {"position": (700, 100), "rgb": (99, 122, 151)},
+    {"position": (150, 400), "rgb": (111, 125, 153)},
+    {"position": (512, 512), "rgb": (97, 107, 131)},
+    {"position": (700, 400), "rgb": (48, 64, 98)},
+    {"position": (100, 700), "rgb": (79, 63, 84)},
+    {"position": (400, 700), "rgb": (40, 58, 79)},
+    {"position": (700, 700), "rgb": (60, 75, 103)},
+    {"position": (256, 256), "rgb": (97, 128, 156)},
 ]
 
 # Maximum allowed difference per color channel
@@ -80,10 +80,6 @@ def _configure_sampling_params(omni: Omni, max_tokens: int = 1, num_inference_st
     params_list[0].max_tokens = max_tokens  # type: ignore
     if len(params_list) > 1:
         params_list[1].num_inference_steps = num_inference_steps  # type: ignore
-        params_list[1].extra_args = {  # type: ignore
-            "cfg_text_scale": 4.0,
-            "cfg_img_scale": 1.5,
-        }
     return params_list
 
 
 
@@ -0,0 +1,66 @@
+"""
+E2E offline tests for Qwen3-Omni-MoE with PD (Prefill-Decode) disaggregation.
+
+Tests both text-only and audio output modalities through the 4-stage
+PD pipeline: Prefill -> Decode -> Talker -> Code2Wav.
+"""
+
+import os
+
+os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
+os.environ["VLLM_TEST_CLEAN_GPU_MEMORY"] = "0"
+
+from pathlib import Path
+
+import pytest
+
+from tests.conftest import (
+    generate_synthetic_video,
+)
+from tests.utils import hardware_test
+
+models = ["Qwen/Qwen3-Omni-30B-A3B-Instruct"]
+
+# PD disaggregation CI stage config (requires 3x GPUs)
+stage_configs = [str(Path(__file__).parent.parent / "stage_configs" / "qwen3_omni_pd_ci.yaml")]
+
+# Create parameter combinations for model and stage config
+test_params = [(model, stage_config) for model in models for stage_config in stage_configs]
+
+
+def get_question(prompt_type="video"):
+    prompts = {
+        "video": "Describe the video briefly.",
+        "text": "What is the capital of China? Answer in 20 words.",
+    }
+    return prompts.get(prompt_type, prompts["video"])
+
+
+@pytest.mark.core_model
+@pytest.mark.omni
+@hardware_test(res={"cuda": "H100"}, num_cards=3)
+@pytest.mark.parametrize("omni_runner", test_params, indirect=True)
+def test_pd_text_only(omni_runner, omni_runner_handler) -> None:
+    """Test PD disaggregation with text-only output (no talker/code2wav)."""
+    request_config = {
+        "prompts": get_question("text"),
+        "modalities": ["text"],
+    }
+    omni_runner_handler.send_request(request_config)
+
+
+@pytest.mark.core_model
+@pytest.mark.omni
+@hardware_test(res={"cuda": "H100"}, num_cards=3)
+@pytest.mark.parametrize("omni_runner", test_params, indirect=True)
+def test_pd_video_to_audio(omni_runner, omni_runner_handler) -> None:
+    """Test PD disaggregation with video input and audio output
+    through the full 4-stage pipeline."""
+    video = generate_synthetic_video(224, 224, 300)["np_array"]
+
+    request_config = {
+        "prompts": get_question("video"),
+        "videos": video,
+        "modalities": ["audio"],
+    }
+    omni_runner_handler.send_request(request_config)
@@ -0,0 +1,122 @@
+"""
+E2E online serving tests for Qwen3-Omni-MoE with PD (Prefill-Decode) disaggregation.
+
+Tests both text-only and audio output modalities via the OpenAI-compatible API
+through the 4-stage PD pipeline: Prefill -> Decode -> Talker -> Code2Wav.
+"""
+
+import os
+from pathlib import Path
+
+import pytest
+
+from tests.conftest import (
+    dummy_messages_from_mix_data,
+    generate_synthetic_audio,
+    generate_synthetic_image,
+    generate_synthetic_video,
+)
+from tests.utils import hardware_test
+
+os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
+os.environ["VLLM_TEST_CLEAN_GPU_MEMORY"] = "0"
+
+models = ["Qwen/Qwen3-Omni-30B-A3B-Instruct"]
+
+# PD disaggregation CI stage config (requires 3x GPUs)
+stage_configs = [str(Path(__file__).parent.parent / "stage_configs" / "qwen3_omni_pd_ci.yaml")]
+
+# Create parameter combinations for model and stage config
+test_params = [(model, stage_config) for model in models for stage_config in stage_configs]
+
+
+def get_system_prompt():
+    return {
+        "role": "system",
+        "content": [
+            {
+                "type": "text",
+                "text": (
+                    "You are Qwen, a virtual human developed by the Qwen Team, "
+                    "Alibaba Group, capable of perceiving auditory and visual inputs, "
+                    "as well as generating text and speech."
+                ),
+            }
+        ],
+    }
+
+
+def get_prompt(prompt_type="text_only"):
+    prompts = {
+        "text_only": "What is the capital of China? Answer in 20 words.",
+        "mix": "What is recited in the audio? What is in this image? Describe the video briefly.",
+    }
+    return prompts.get(prompt_type, prompts["text_only"])
+
+
+@pytest.mark.advanced_model
+@pytest.mark.core_model
+@pytest.mark.omni
+@hardware_test(res={"cuda": "H100"}, num_cards=3)
+@pytest.mark.parametrize("omni_server", test_params, indirect=True)
+def test_pd_text_to_text(omni_server, openai_client) -> None:
+    """
+    Test PD disaggregation with text-only output via OpenAI API.
+    Deploy Setting: PD separation yaml
+    Input Modal: text
+    Output Modal: text
+    Input Setting: stream=False
+    Datasets: single request
+    """
+    messages = dummy_messages_from_mix_data(
+        system_prompt=get_system_prompt(),
+        content_text=get_prompt("text_only"),
+    )
+
+    request_config = {
+        "model": omni_server.model,
+        "messages": messages,
+        "stream": False,
+        "modalities": ["text"],
+        "key_words": {"text": ["beijing"]},
+    }
+
+    openai_client.send_request(request_config)
+
+
+@pytest.mark.advanced_model
+@pytest.mark.core_model
+@pytest.mark.omni
+@hardware_test(res={"cuda": "H100"}, num_cards=3)
+@pytest.mark.parametrize("omni_server", test_params, indirect=True)
+def test_pd_mix_to_text_audio(omni_server, openai_client) -> None:
+    """
+    Test PD disaggregation with multi-modal input and text+audio output via OpenAI API.
+    Deploy Setting: PD separation yaml
+    Input Modal: text + audio + video + image
+    Output Modal: text + audio
+    Input Setting: stream=True
+    Datasets: single request
+    """
+    video_data_url = f"data:video/mp4;base64,{generate_synthetic_video(224, 224, 300)['base64']}"
+    image_data_url = f"data:image/jpeg;base64,{generate_synthetic_image(224, 224)['base64']}"
+    audio_data_url = f"data:audio/wav;base64,{generate_synthetic_audio(5, 1)['base64']}"
+    messages = dummy_messages_from_mix_data(
+        system_prompt=get_system_prompt(),
+        video_data_url=video_data_url,
+        image_data_url=image_data_url,
+        audio_data_url=audio_data_url,
+        content_text=get_prompt("mix"),
+    )
+
+    request_config = {
+        "model": omni_server.model,
+        "messages": messages,
+        "stream": True,
+        "key_words": {
+            "audio": ["water", "chirping", "crackling", "rain"],
+            "image": ["square", "quadrate"],
+        },
+    }
+
+    openai_client.send_request(request_config)