Fix MCP tool-call latency for non-queued events#12961
Fix MCP tool-call latency for non-queued events#12961freddyaboulton merged 10 commits intogradio-app:mainfrom
Conversation
…pback When `queue=False`, MCP `call_tool()` now calls `blocks.process_api()` directly instead of going through `gradio_client.Client.submit()`, which was making an HTTP POST back to the same server. This eliminates thread dispatches, TCP round-trips, SSE overhead, and queue serialization. For queued events (`queue=True`), the existing HTTP loopback path is preserved to maintain streaming updates, progress notifications, and queue-based features.
🪼 branch checks and previews
Install Gradio from this PR pip install https://gradio-pypi-previews.s3.amazonaws.com/6334868b4637fd24f2e2d32aed6b0dd7b69f5ac9/gradio-6.8.0-py3-none-any.whlInstall Gradio Python Client from this PR pip install "gradio-client @ git+https://github.com/gradio-app/gradio@6334868b4637fd24f2e2d32aed6b0dd7b69f5ac9#subdirectory=client/python"Install Gradio JS Client from this PR npm install https://gradio-npm-previews.s3.amazonaws.com/6334868b4637fd24f2e2d32aed6b0dd7b69f5ac9/gradio-client-2.1.0.tgz |
🦄 change detectedThis Pull Request includes changes to the following packages.
✅ Changeset approved by @freddyaboulton
|
| # This eliminates thread dispatches, TCP round-trips, and SSE | ||
| # overhead — reducing MCP tool-call latency significantly. | ||
| session_state = SessionState(self.blocks) | ||
| raw_output = await self.blocks.process_api( |
There was a problem hiding this comment.
We should pass the request here
There was a problem hiding this comment.
Done! Added request=self.mcp_server.request_context.request to the process_api() call in a96be5a.
|
Thanks @Mandark-droid ! This is great. Just one comment, we should pass the request to |
Address review feedback from @freddyaboulton: forward the request object to blocks.process_api() so that downstream handlers have access to the original HTTP request context.
abidlabs
left a comment
There was a problem hiding this comment.
Awesome @Mandark-droid! Tested and works great.
I would just add a section in the MCP docs mentioning that for performance, set queue=False. We already have a brief description here: https://www.gradio.app/guides/building-mcp-server-with-gradio#sending-progress-updates, but we can expand it a bit to reference the improved performance optimization
|
Work on FastMCP. Nice fix btw — 16x improvement is legit. FYI the FastMCP numbers in the table don't match what we see on our end (ask Claude to check out the benchmark branch and try to reproduce). Looks like the baseline was misconfigured. The Gradio improvement is the real story here anyway, congrats team! |
Description
When an MCP
call_tool()is invoked for a non-queued event (queue=False), the current implementation still routes the call throughgradio_client.Client.submit(), which performs a full HTTP loopback:This adds ~4 seconds of overhead per call for functions that take ~13ms to execute.
This PR bypasses the HTTP loopback for
queue=Falseevents by callingblocks.process_api()directly — the same internal function the HTTP route eventually reaches. For queued events (queue=True), the existing path is preserved to maintain streaming updates, progress notifications, and queue-based features.Relates to: #11961 (PR #12296 partially addressed this by skipping progress updates, but the HTTP loopback remained)
AI Disclosure
Benchmark Results
Benchmarked using mcp-server-bench — an open-source MCP benchmarking tool comparing Gradio vs FastMCP across identical tool implementations.
Three-Way Comparison: Gradio MCP Streamable Protocol
Format: Throughput (RPS) / p50 latency. VU = virtual users (concurrent connections).
Key Results
Benchmark Datasets (reproducible)
Testing and Formatting
Validated with the benchmark suite above. The change is scoped to
gradio/mcp.pyonly — no frontend changes.Changes
gradio/mcp.pyfrom gradio.state_holder import SessionStatecall_tool(): whenblock_fn.queueisFalse, callblocks.process_api()directly; otherwise use existing HTTP loopback path