-
Notifications
You must be signed in to change notification settings - Fork 475
[CI] Modify some CI test cases to run on L4 environment to reduce H100 resource usage. #1543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e031a23
602c764
b1ece67
1c4031d
357ea20
17b0639
97f6453
9f90f97
36e1297
be22626
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,36 +3,18 @@ | |
|
|
||
| import pytest | ||
|
|
||
| from tests.conftest import OmniServer | ||
| from tests.utils import hardware_test | ||
|
|
||
| models = ["Qwen/Qwen3-Omni-30B-A3B-Instruct"] | ||
| stage_configs = [str(Path(__file__).parent.parent / "e2e" / "stage_configs" / "qwen3_omni_ci.yaml")] | ||
| models = ["Qwen/Qwen2.5-Omni-7B"] | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Switching from Qwen3-30B to Qwen2.5-7B means benchmark numbers are no longer comparable across runs. If this test is meant to track perf regressions over time, consider keeping a Qwen3 benchmark on H100 (even if less frequent) alongside this L4 one. |
||
| stage_configs = [str(Path(__file__).parent.parent / "e2e" / "stage_configs" / "qwen2_5_omni_ci.yaml")] | ||
yenuo26 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| # Create parameter combinations for model and stage config | ||
| test_params = [(model, stage_config) for model in models for stage_config in stage_configs] | ||
|
|
||
|
|
||
| @pytest.fixture(scope="module") | ||
| def omni_server(request): | ||
| """Start vLLM-Omni server as a subprocess with actual model weights. | ||
| Uses session scope so the server starts only once for the entire test session. | ||
| Multi-stage initialization can take 10-20+ minutes. | ||
| """ | ||
| model, stage_config_path = request.param | ||
|
|
||
| print(f"Starting OmniServer with model: {model}") | ||
| print("This may take 10-20+ minutes for initialization...") | ||
|
|
||
| with OmniServer(model, ["--stage-configs-path", stage_config_path, "--stage-init-timeout", "120"]) as server: | ||
| print("OmniServer started successfully") | ||
| yield server | ||
| print("OmniServer stopped") | ||
|
|
||
|
|
||
| @pytest.mark.core_model | ||
| @pytest.mark.benchmark | ||
| @hardware_test(res={"cuda": "H100"}, num_cards=2) | ||
| @hardware_test(res={"cuda": "L4"}, num_cards=3) | ||
| @pytest.mark.parametrize("omni_server", test_params, indirect=True) | ||
| def test_bench_serve_chat(omni_server): | ||
| command = [ | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| stage_args: | ||
| - stage_id: 0 | ||
| runtime: | ||
| process: true # Run this stage in a separate process | ||
| devices: "0" # Visible devices for this stage (CUDA_VISIBLE_DEVICES/torch.cuda.set_device) | ||
| max_batch_size: 1 | ||
| engine_args: | ||
| model_stage: thinker | ||
| model_arch: Qwen2_5OmniForConditionalGeneration | ||
| worker_type: ar | ||
| scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler | ||
| max_model_len: 16384 | ||
| max_num_batched_tokens: 16384 | ||
| max_num_seqs: 1 | ||
| gpu_memory_utilization: 0.9 | ||
| skip_mm_profiling: true | ||
| enforce_eager: true # Now we only support eager mode | ||
| trust_remote_code: true | ||
| engine_output_type: latent | ||
| enable_prefix_caching: false | ||
| is_comprehension: true | ||
| final_output: true | ||
| final_output_type: text | ||
| default_sampling_params: | ||
| temperature: 0.0 | ||
| top_p: 1.0 | ||
| top_k: -1 | ||
| max_tokens: 128 | ||
| seed: 42 | ||
| detokenize: True | ||
| repetition_penalty: 1.1 |
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -236,7 +236,6 @@ def test_modality_control_003(omni_server) -> None: | |
| # TODO: Verify the E2E latency after confirmation baseline. | ||
|
|
||
|
|
||
| @pytest.mark.skip(reason="There is a known issue with stream error.") | ||
| @pytest.mark.advanced_model | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which fix resolved the stream error? Worth adding a comment or linking the PR in the commit message so this does not get re-skipped later. |
||
| @pytest.mark.omni | ||
| @hardware_test(res={"cuda": "L4", "rocm": "MI325"}, num_cards={"cuda": 4, "rocm": 2}) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,6 +19,7 @@ stage_args: | |
| engine_output_type: latent | ||
| enable_prefix_caching: false | ||
| max_num_batched_tokens: 32768 | ||
| mm_processor_cache_gb: 0 | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please see #1534 for the reason of the change.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I saw #1534, makes sense for the CI config. But this same change is also added to the production stage configs (qwen2_5_omni.yaml and qwen2_5_omni_multiconnector.yaml) — disables the mm processor cache for all users, not just CI. Was that intentional? If it is only needed to work around an L4 memory constraint, keep it in the CI configs only.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we make accuracy higher priority |
||
| is_comprehension: true | ||
| final_output: true | ||
| final_output_type: text | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old config had timeout_in_minutes: 15 at the Buildkite level. The inner timeout 15m only kills the bash process — if the Docker pull or container startup hangs, Buildkite will wait forever. Add timeout_in_minutes back.