Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Summary of ChangesHello @viraatc, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new benchmarking example designed to measure the client-side overhead of different inference benchmarking tools. By replacing the existing echo server implementation with a high-performance Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Pull request overview
This PR refactors the echo server implementation to use Granian (Rust-based ASGI server) instead of aiohttp, running in a subprocess for improved performance. The title indicates "do-not-merge" and contains a spelling error ("graning" instead of "Granian"). Key changes include:
- Replaced aiohttp-based threading implementation with Granian subprocess approach
- Introduced new ASGI application module for handling requests
- Added configurable worker count (defaulting to 8 workers)
- Implemented improved process management with signal handling
- Added comprehensive benchmarking example comparing inference-endpoint and vLLM tools
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/inference_endpoint/testing/echo_server.py | Replaced aiohttp threading implementation with Granian subprocess-based server |
| src/inference_endpoint/testing/_echo_asgi.py | New ASGI application module for Granian server with OpenAI-compatible endpoints |
| pyproject.toml | Added granian>=1.6.0 dependency |
| examples/04_EchoServerBenchmark/benchmark_echo_server.py | New benchmarking script comparing inference-endpoint and vLLM performance |
| examples/04_EchoServerBenchmark/README.md | Documentation for echo server benchmark example |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """HTTP Echo Server for testing inference endpoint clients. | ||
|
|
||
| High-performance implementation using Granian (Rust-based ASGI server) running | ||
| in a separate process for maximum throughput. | ||
| """ |
There was a problem hiding this comment.
The PR title contains 'graning' which should be 'Granian' to match the module docstring spelling.
| self._loop = None | ||
| self._shutdown_event = threading.Event() | ||
| self._port_ready_event = threading.Event() # Signal when port is ready | ||
| # Default to 4 workers - sweet spot for echo server |
There was a problem hiding this comment.
The comment states the default is 4 workers, but the code sets it to 8. Update the comment to match the actual default value.
| # Default to 4 workers - sweet spot for echo server | |
| # Default to 8 workers - sweet spot for echo server |
| def set_max_osl(self, max_osl: int): | ||
| """Set maximum output sequence length.""" | ||
| self.max_osl = max_osl | ||
| # Note: Changes don't affect running server - restart required |
There was a problem hiding this comment.
This comment should clarify that changes to max_osl after the server starts require a restart, as this is important information for API users.
| # Note: Changes don't affect running server - restart required | |
| # Note: Changing max_osl after the server has started does not affect the | |
| # currently running server; restart is required for changes | |
| # to take effect. |
| python benchmark_echo_server.py --num-prompts 30000 --workers 4 | ||
| python benchmark_echo_server.py --num-prompts 30000 --workers 4 --verbose |
There was a problem hiding this comment.
The usage examples show a '--workers' argument, but the script's argument parser only accepts '--port', '--no-stream', and '--verbose'. Either add the '--workers' argument to the parser or remove it from the examples.
| python benchmark_echo_server.py --num-prompts 30000 --workers 4 | |
| python benchmark_echo_server.py --num-prompts 30000 --workers 4 --verbose | |
| python benchmark_echo_server.py --num-prompts 30000 | |
| python benchmark_echo_server.py --num-prompts 30000 --verbose |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant performance improvement by replacing the aiohttp-based echo server with a high-performance implementation using Granian. The new architecture, which runs Granian in a separate subprocess, is well-structured and robustly handles server startup and shutdown. The changes also include a new benchmark example to compare the performance of inference-endpoint and vLLM.
My review has identified a couple of areas for improvement:
- The new benchmark script and its documentation have a discrepancy regarding the
--workerscommand-line argument, which is documented but not implemented. - The ASGI application for the echo server could be improved by returning a
500 Internal Server Errorfor unexpected exceptions, which is more semantically correct than the current400 Bad Request.
Also, there's a small typo in the pull request title ("graning" instead of "Granian").
Overall, this is a great enhancement. Addressing these points will improve the usability and correctness of the new additions.
| endpoint_url, | ||
| ie_dataset, | ||
| args.num_prompts, | ||
| IE_WORKERS, |
There was a problem hiding this comment.
The number of workers for the inference-endpoint benchmark is hardcoded to IE_WORKERS. However, the script's docstring (lines 24-25) and the README.md file indicate that a --workers command-line argument should be supported. Please add this argument to argparse to allow configuring the number of workers, and use it here.
For example, you can add the following to your argument parser:
parser.add_argument(
"--workers",
type=int,
default=IE_WORKERS,
help=f"Number of workers for inference-endpoint benchmark (default: {IE_WORKERS})",
)| IE_WORKERS, | |
| args.workers, |
| await _send_json_response(send, {"error": f"Invalid JSON: {e}"}, status=400) | ||
| except Exception as e: | ||
| await _send_json_response( | ||
| send, {"error": f"error encountered: {e}"}, status=400 |
There was a problem hiding this comment.
For unexpected server-side errors caught by this generic except Exception block, it's more appropriate to return a 500 Internal Server Error status code instead of 400 Bad Request. This helps differentiate between client-side errors (like malformed JSON, which correctly returns 400) and actual server-side problems.
| send, {"error": f"error encountered: {e}"}, status=400 | |
| send, {"error": f"error encountered: {e}"}, status=500 |
|
|
||
| except Exception as e: | ||
| await _send_json_response( | ||
| send, {"error": f"error encountered: {e}"}, status=400 |
There was a problem hiding this comment.
Similar to the other handler, unexpected server-side errors caught by this generic except Exception block should return a 500 Internal Server Error status code instead of 400 Bad Request to correctly signal a server-side issue.
| send, {"error": f"error encountered: {e}"}, status=400 | |
| send, {"error": f"error encountered: {e}"}, status=500 |
| elif message["type"] == "lifespan.shutdown": | ||
| await send({"type": "lifespan.shutdown.complete"}) | ||
| return | ||
| return |
|
|
||
| # Read configuration from environment variables (set by parent process) | ||
| _MAX_OSL: int | None = None | ||
| _RESPONSE_MODULE: str | None = None |
| # Read configuration from environment variables (set by parent process) | ||
| _MAX_OSL: int | None = None | ||
| _RESPONSE_MODULE: str | None = None | ||
| _RESPONSE_DATA: Any = None |
| max_osl_str = os.environ.get("_ECHO_SERVER_MAX_OSL", "") | ||
| _MAX_OSL = int(max_osl_str) if max_osl_str else None | ||
|
|
||
| _RESPONSE_MODULE = os.environ.get("_ECHO_SERVER_RESPONSE_MODULE") |
| _RESPONSE_MODULE = os.environ.get("_ECHO_SERVER_RESPONSE_MODULE") | ||
| response_data_str = os.environ.get("_ECHO_SERVER_RESPONSE_DATA") | ||
| if response_data_str: | ||
| _RESPONSE_DATA = orjson.loads(response_data_str) |
| try: | ||
| # Send SIGTERM to the process group | ||
| os.killpg(os.getpgid(self._process.pid), signal.SIGTERM) | ||
| except (ProcessLookupError, OSError): |
| except subprocess.TimeoutExpired: | ||
| try: | ||
| os.killpg(os.getpgid(self._process.pid), signal.SIGKILL) | ||
| except (ProcessLookupError, OSError): |
No description provided.