Skip to content

Conversation

@yossiovadia
Copy link
Collaborator

This commit introduces a mock vLLM server infrastructure to enable e2e testing without requiring GPU resources. The mock infrastructure simulates intelligent routing behavior while maintaining compatibility with the existing semantic router.

Key changes:

  • Add mock-vllm-server.py: Simulates vLLM OpenAI-compatible API with intelligent content-based routing (math queries → TinyLlama, general → Qwen)
  • Add start-mock-servers.sh: Launch mock servers in foreground mode
  • Add start/stop-vllm-servers.sh: Manage real vLLM server instances
  • Update config.yaml: Add minimal vLLM endpoint configuration for Qwen (port 8000) and TinyLlama (port 8001) with smart routing preference
  • Update 00-client-request-test.py: Fix import path and use configured model
  • Update e2e-tests/README.md: Document mock infrastructure usage
  • Update build-run-test.mk: Add mock server management targets

The mock infrastructure enables:

  • Fast e2e testing without GPU dependencies
  • Content-aware model selection simulation
  • vLLM API compatibility testing
  • Smart routing behavior validation

Release Notes: No

@netlify
Copy link

netlify bot commented Sep 25, 2025

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit 41c6186
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/68d6c898889080000864b7b3
😎 Deploy Preview https://deploy-preview-228--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@Xunzhuo
Copy link
Member

Xunzhuo commented Sep 26, 2025

great! can you make it run it CI?

@yossiovadia
Copy link
Collaborator Author

yossiovadia commented Sep 26, 2025 via email

@rootfs
Copy link
Collaborator

rootfs commented Sep 26, 2025

making some changes for it to be using real ( tiny ) model and not mock. https://pypi.org/project/llm-katan/ will submit probably tmrw after some more testing

That's so cool!

@rootfs
Copy link
Collaborator

rootfs commented Sep 26, 2025

@yossiovadia can you remove the whl file, and dist and egg-info folder? they are generated by python build. Or can you add them to .gitignore?

@yossiovadia yossiovadia force-pushed the mock-vllm-infrastructure branch from e2bb7d3 to 52a2b54 Compare September 26, 2025 02:14
- "phi4" # Same model can be served by multiple endpoints for redundancy
- "mistral-small3.1"
weight: 2 # Higher weight for more powerful endpoint
- name: "qwen-endpoint"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create an e2e config.yaml (just like the mock vllm config.testing.yaml)

models:
- "Qwen/Qwen2-0.5B-Instruct"
weight: 1
health_check_path: "/health"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is removed in #201

maintainers = [
{name = "Yossi Ovadia", email = "[email protected]"}
]
license = {text = "MIT"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use Apache license?

@rootfs rootfs added this to the v0.1 milestone Sep 26, 2025
@yossiovadia yossiovadia force-pushed the mock-vllm-infrastructure branch 2 times, most recently from 184ed74 to 7683d8c Compare September 26, 2025 16:06
yossiovadia and others added 8 commits September 26, 2025 09:09
This commit introduces a mock vLLM server infrastructure to enable e2e
testing without requiring GPU resources. The mock infrastructure simulates
intelligent routing behavior while maintaining compatibility with the
existing semantic router.

Key changes:
- Add mock-vllm-server.py: Simulates vLLM OpenAI-compatible API with
  intelligent content-based routing (math queries → TinyLlama, general → Qwen)
- Add start-mock-servers.sh: Launch mock servers in foreground mode
- Update config.yaml: Add minimal vLLM endpoint configuration for
  Qwen (port 8000) and TinyLlama (port 8001) with smart routing preference
- Update 00-client-request-test.py: Fix import path and use configured model
- Update e2e-tests/README.md: Document mock infrastructure usage
- Update build-run-test.mk: Add mock server management targets

The mock infrastructure enables:
- Fast e2e testing without GPU dependencies
- Content-aware model selection simulation
- vLLM API compatibility testing
- Smart routing behavior validation

Signed-off-by: Yossi Ovadia <[email protected]>
Replace the mock vLLM server with a real FastAPI-based implementation using HuggingFace transformers and tiny models. The new LLM Katan package provides actual inference while maintaining lightweight testing benefits.

Key changes:
- Add complete LLM Katan PyPI package (v0.1.4) under e2e-tests/
- FastAPI server with OpenAI-compatible endpoints (/v1/chat/completions, /v1/models, /health, /metrics)
- Real Qwen/Qwen3-0.6B model with name aliasing for multi-model testing
- Enhanced logging and Prometheus metrics endpoint
- CLI tool with comprehensive configuration options
- Replace start-mock-servers.sh with start-llm-katan.sh
- Update e2e-tests README with new LLM Katan usage instructions
- Remove obsolete mock-vllm-server.py and start-mock-servers.sh

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>
Add comprehensive setup section covering HuggingFace token requirements with three authentication methods:
- Environment variable (HUGGINGFACE_HUB_TOKEN)
- CLI login (huggingface-cli login)
- Token file in home directory

Explains why token is needed (private models, rate limits, reliable downloads) and provides direct link to HuggingFace token settings.

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>
- Add dist/, build/, *.egg-info/, *.whl to ignore Python build outputs
- Prevents accidentally committing generated files

Signed-off-by: Yossi Ovadia <[email protected]>
- Create config.e2e.yaml with LLM Katan endpoints for e2e tests
- Restore config.yaml to original production endpoints (matches origin/main)
- Add run-router-e2e target to use e2e config (config/config.e2e.yaml)
- Add start-llm-katan and test-e2e-vllm targets for LLM Katan testing
- Update Makefile help with new e2e test targets
- Remove egg-info directory from git tracking (now in .gitignore)
- Keep pyproject.toml at stable version 0.1.4, always install latest via pip

This separation allows:
- Production config stays clean with real vLLM endpoints
- E2E tests use lightweight LLM Katan servers
- Clear distinction between test and production environments
- Always use latest LLM Katan features via unpinned pip installation

Signed-off-by: Yossi Ovadia <[email protected]>
- Change test model from 'gemma3:27b' to 'Qwen/Qwen2-0.5B-Instruct'
- Ensures Envoy health check uses model available in e2e config
- Fixes 503 errors when checking if Envoy proxy is running

Signed-off-by: Yossi Ovadia <[email protected]>
- Bump version to 0.1.6 for PyPI publishing
- Change license from MIT to Apache-2.0

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>
- Update license classifier from MIT to Apache Software License
- Bump version to 0.1.7 for corrected license display on PyPI

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>
@github-actions
Copy link

github-actions bot commented Sep 26, 2025

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 config

Owners: @rootfs
Files changed:

  • config/config.e2e.yaml

📁 e2e-tests

Owners: @yossiovadia
Files changed:

  • e2e-tests/llm-katan/README.md
  • e2e-tests/llm-katan/llm_katan/__init__.py
  • e2e-tests/llm-katan/llm_katan/cli.py
  • e2e-tests/llm-katan/llm_katan/config.py
  • e2e-tests/llm-katan/llm_katan/model.py
  • e2e-tests/llm-katan/llm_katan/server.py
  • e2e-tests/llm-katan/pyproject.toml
  • e2e-tests/llm-katan/requirements.txt
  • e2e-tests/start-llm-katan.sh
  • e2e-tests/00-client-request-test.py
  • e2e-tests/README.md
  • e2e-tests/run_all_tests.py

📁 Root Directory

Owners: @rootfs, @Xunzhuo
Files changed:

  • .gitignore
  • .pre-commit-config.yaml

📁 tools

Owners: @yuluo-yx, @rootfs, @Xunzhuo
Files changed:

  • tools/make/build-run-test.mk
  • tools/make/common.mk

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

- Fix markdown linting issues (MD032, MD031, MD047) in README files
- Remove binary distribution files from git tracking
- Add Python build artifacts to .gitignore
- Auto-format Python files with black and isort
- Add CLAUDE.md exclusion to prevent future commits

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>
@yossiovadia yossiovadia force-pushed the mock-vllm-infrastructure branch from bd65f99 to 745d6d5 Compare September 26, 2025 16:37
rootfs
rootfs previously approved these changes Sep 26, 2025
Update repository URLs in pyproject.toml to point to the correct vllm-project
organization instead of personal fork.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>
yossiovadia and others added 2 commits September 26, 2025 10:01
Revert production config.yaml to original state from main branch.
The config modifications were not intended for this PR and should
remain unchanged to preserve production configuration.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>
Copy config.yaml from upstream main to ensure it matches exactly
and includes the health_check_path and other missing fields.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>
@rootfs
Copy link
Collaborator

rootfs commented Sep 26, 2025

@yossiovadia this is great! can you follow up with a PR to add the instructions to the doc too (along https://github.com/vllm-project/semantic-router/blob/main/website/docs/installation/deploy-quickstart.md#choosing-a-path)?
@JaredforReal would it be possible to consolidate the mock and this lightweight llm server?

@rootfs rootfs merged commit 858dd50 into vllm-project:main Sep 26, 2025
9 checks passed
Aias00 pushed a commit to Aias00/semantic-router that referenced this pull request Oct 4, 2025
…project#228)

* feat: add mock vLLM infrastructure for lightweight e2e testing

This commit introduces a mock vLLM server infrastructure to enable e2e
testing without requiring GPU resources. The mock infrastructure simulates
intelligent routing behavior while maintaining compatibility with the
existing semantic router.

Key changes:
- Add mock-vllm-server.py: Simulates vLLM OpenAI-compatible API with
  intelligent content-based routing (math queries → TinyLlama, general → Qwen)
- Add start-mock-servers.sh: Launch mock servers in foreground mode
- Update config.yaml: Add minimal vLLM endpoint configuration for
  Qwen (port 8000) and TinyLlama (port 8001) with smart routing preference
- Update 00-client-request-test.py: Fix import path and use configured model
- Update e2e-tests/README.md: Document mock infrastructure usage
- Update build-run-test.mk: Add mock server management targets

The mock infrastructure enables:
- Fast e2e testing without GPU dependencies
- Content-aware model selection simulation
- vLLM API compatibility testing
- Smart routing behavior validation

Signed-off-by: Yossi Ovadia <[email protected]>

* feat: replace mock vLLM infrastructure with LLM Katan package

Replace the mock vLLM server with a real FastAPI-based implementation using HuggingFace transformers and tiny models. The new LLM Katan package provides actual inference while maintaining lightweight testing benefits.

Key changes:
- Add complete LLM Katan PyPI package (v0.1.4) under e2e-tests/
- FastAPI server with OpenAI-compatible endpoints (/v1/chat/completions, /v1/models, /health, /metrics)
- Real Qwen/Qwen3-0.6B model with name aliasing for multi-model testing
- Enhanced logging and Prometheus metrics endpoint
- CLI tool with comprehensive configuration options
- Replace start-mock-servers.sh with start-llm-katan.sh
- Update e2e-tests README with new LLM Katan usage instructions
- Remove obsolete mock-vllm-server.py and start-mock-servers.sh

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>

* docs: add HuggingFace token setup instructions to LLM Katan README

Add comprehensive setup section covering HuggingFace token requirements with three authentication methods:
- Environment variable (HUGGINGFACE_HUB_TOKEN)
- CLI login (huggingface-cli login)
- Token file in home directory

Explains why token is needed (private models, rate limits, reliable downloads) and provides direct link to HuggingFace token settings.

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>

* fix: add Python build artifacts to .gitignore

- Add dist/, build/, *.egg-info/, *.whl to ignore Python build outputs
- Prevents accidentally committing generated files

Signed-off-by: Yossi Ovadia <[email protected]>

* refactor: separate e2e and production configs

- Create config.e2e.yaml with LLM Katan endpoints for e2e tests
- Restore config.yaml to original production endpoints (matches origin/main)
- Add run-router-e2e target to use e2e config (config/config.e2e.yaml)
- Add start-llm-katan and test-e2e-vllm targets for LLM Katan testing
- Update Makefile help with new e2e test targets
- Remove egg-info directory from git tracking (now in .gitignore)
- Keep pyproject.toml at stable version 0.1.4, always install latest via pip

This separation allows:
- Production config stays clean with real vLLM endpoints
- E2E tests use lightweight LLM Katan servers
- Clear distinction between test and production environments
- Always use latest LLM Katan features via unpinned pip installation

Signed-off-by: Yossi Ovadia <[email protected]>

* fix: update e2e test to use model from config.e2e.yaml

- Change test model from 'gemma3:27b' to 'Qwen/Qwen2-0.5B-Instruct'
- Ensures Envoy health check uses model available in e2e config
- Fixes 503 errors when checking if Envoy proxy is running

Signed-off-by: Yossi Ovadia <[email protected]>

* Update llm-katan package metadata

- Bump version to 0.1.6 for PyPI publishing
- Change license from MIT to Apache-2.0

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>

* Fix Apache license classifier in pyproject.toml

- Update license classifier from MIT to Apache Software License
- Bump version to 0.1.7 for corrected license display on PyPI

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>

* fix: resolve pre-commit hook failures

- Fix markdown linting issues (MD032, MD031, MD047) in README files
- Remove binary distribution files from git tracking
- Add Python build artifacts to .gitignore
- Auto-format Python files with black and isort
- Add CLAUDE.md exclusion to prevent future commits

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>

* fix: update llm-katan project URLs to vllm-project repository

Update repository URLs in pyproject.toml to point to the correct vllm-project
organization instead of personal fork.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>

* fix: revert config.yaml to original main branch version

Revert production config.yaml to original state from main branch.
The config modifications were not intended for this PR and should
remain unchanged to preserve production configuration.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>

* fix: restore config.yaml to match upstream main exactly

Copy config.yaml from upstream main to ensure it matches exactly
and includes the health_check_path and other missing fields.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>

---------

Signed-off-by: Yossi Ovadia <[email protected]>
Co-authored-by: Claude <[email protected]>
Signed-off-by: liuhy <[email protected]>
yossiovadia added a commit to yossiovadia/semantic-router that referenced this pull request Oct 8, 2025
…project#228)

* feat: add mock vLLM infrastructure for lightweight e2e testing

This commit introduces a mock vLLM server infrastructure to enable e2e
testing without requiring GPU resources. The mock infrastructure simulates
intelligent routing behavior while maintaining compatibility with the
existing semantic router.

Key changes:
- Add mock-vllm-server.py: Simulates vLLM OpenAI-compatible API with
  intelligent content-based routing (math queries → TinyLlama, general → Qwen)
- Add start-mock-servers.sh: Launch mock servers in foreground mode
- Update config.yaml: Add minimal vLLM endpoint configuration for
  Qwen (port 8000) and TinyLlama (port 8001) with smart routing preference
- Update 00-client-request-test.py: Fix import path and use configured model
- Update e2e-tests/README.md: Document mock infrastructure usage
- Update build-run-test.mk: Add mock server management targets

The mock infrastructure enables:
- Fast e2e testing without GPU dependencies
- Content-aware model selection simulation
- vLLM API compatibility testing
- Smart routing behavior validation

Signed-off-by: Yossi Ovadia <[email protected]>

* feat: replace mock vLLM infrastructure with LLM Katan package

Replace the mock vLLM server with a real FastAPI-based implementation using HuggingFace transformers and tiny models. The new LLM Katan package provides actual inference while maintaining lightweight testing benefits.

Key changes:
- Add complete LLM Katan PyPI package (v0.1.4) under e2e-tests/
- FastAPI server with OpenAI-compatible endpoints (/v1/chat/completions, /v1/models, /health, /metrics)
- Real Qwen/Qwen3-0.6B model with name aliasing for multi-model testing
- Enhanced logging and Prometheus metrics endpoint
- CLI tool with comprehensive configuration options
- Replace start-mock-servers.sh with start-llm-katan.sh
- Update e2e-tests README with new LLM Katan usage instructions
- Remove obsolete mock-vllm-server.py and start-mock-servers.sh

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>

* docs: add HuggingFace token setup instructions to LLM Katan README

Add comprehensive setup section covering HuggingFace token requirements with three authentication methods:
- Environment variable (HUGGINGFACE_HUB_TOKEN)
- CLI login (huggingface-cli login)
- Token file in home directory

Explains why token is needed (private models, rate limits, reliable downloads) and provides direct link to HuggingFace token settings.

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>

* fix: add Python build artifacts to .gitignore

- Add dist/, build/, *.egg-info/, *.whl to ignore Python build outputs
- Prevents accidentally committing generated files

Signed-off-by: Yossi Ovadia <[email protected]>

* refactor: separate e2e and production configs

- Create config.e2e.yaml with LLM Katan endpoints for e2e tests
- Restore config.yaml to original production endpoints (matches origin/main)
- Add run-router-e2e target to use e2e config (config/config.e2e.yaml)
- Add start-llm-katan and test-e2e-vllm targets for LLM Katan testing
- Update Makefile help with new e2e test targets
- Remove egg-info directory from git tracking (now in .gitignore)
- Keep pyproject.toml at stable version 0.1.4, always install latest via pip

This separation allows:
- Production config stays clean with real vLLM endpoints
- E2E tests use lightweight LLM Katan servers
- Clear distinction between test and production environments
- Always use latest LLM Katan features via unpinned pip installation

Signed-off-by: Yossi Ovadia <[email protected]>

* fix: update e2e test to use model from config.e2e.yaml

- Change test model from 'gemma3:27b' to 'Qwen/Qwen2-0.5B-Instruct'
- Ensures Envoy health check uses model available in e2e config
- Fixes 503 errors when checking if Envoy proxy is running

Signed-off-by: Yossi Ovadia <[email protected]>

* Update llm-katan package metadata

- Bump version to 0.1.6 for PyPI publishing
- Change license from MIT to Apache-2.0

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>

* Fix Apache license classifier in pyproject.toml

- Update license classifier from MIT to Apache Software License
- Bump version to 0.1.7 for corrected license display on PyPI

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>

* fix: resolve pre-commit hook failures

- Fix markdown linting issues (MD032, MD031, MD047) in README files
- Remove binary distribution files from git tracking
- Add Python build artifacts to .gitignore
- Auto-format Python files with black and isort
- Add CLAUDE.md exclusion to prevent future commits

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>

* fix: update llm-katan project URLs to vllm-project repository

Update repository URLs in pyproject.toml to point to the correct vllm-project
organization instead of personal fork.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>

* fix: revert config.yaml to original main branch version

Revert production config.yaml to original state from main branch.
The config modifications were not intended for this PR and should
remain unchanged to preserve production configuration.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>

* fix: restore config.yaml to match upstream main exactly

Copy config.yaml from upstream main to ensure it matches exactly
and includes the health_check_path and other missing fields.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>

---------

Signed-off-by: Yossi Ovadia <[email protected]>
Co-authored-by: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants