Skip to content

Fix MCP tool-call latency for non-queued events#12961

Merged
freddyaboulton merged 10 commits intogradio-app:mainfrom
Mandark-droid:fix/mcp-direct-call-unqueued-events
Mar 6, 2026
Merged

Fix MCP tool-call latency for non-queued events#12961
freddyaboulton merged 10 commits intogradio-app:mainfrom
Mandark-droid:fix/mcp-direct-call-unqueued-events

Conversation

@Mandark-droid
Copy link
Contributor

@Mandark-droid Mandark-droid commented Mar 4, 2026

Description

When an MCP call_tool() is invoked for a non-queued event (queue=False), the current implementation still routes the call through gradio_client.Client.submit(), which performs a full HTTP loopback:

MCP request
  → run_sync(Client._get_or_create_client)   ← thread dispatch
  → client.submit(api_name=endpoint)          ← HTTP POST to own server
  → Gradio queue processing
  → run_sync(job.result)                      ← thread blocking for HTTP response

This adds ~4 seconds of overhead per call for functions that take ~13ms to execute.

This PR bypasses the HTTP loopback for queue=False events by calling blocks.process_api() directly — the same internal function the HTTP route eventually reaches. For queued events (queue=True), the existing path is preserved to maintain streaming updates, progress notifications, and queue-based features.

Relates to: #11961 (PR #12296 partially addressed this by skipping progress updates, but the HTTP loopback remained)

AI Disclosure

  • I used AI to assist with benchmarking analysis and drafting the PR description

Benchmark Results

Benchmarked using mcp-server-bench — an open-source MCP benchmarking tool comparing Gradio vs FastMCP across identical tool implementations.

Three-Way Comparison: Gradio MCP Streamable Protocol

Scenario Before (loopback) PR (queue=OFF) PR (queue=ON) FastMCP ref
echo VU=1 0.4 RPS / 4,133ms p50 54.3 RPS / 16ms p50 0.0 RPS (startup issue) 74.6 RPS / 13ms p50
echo VU=10 3.6 RPS / 4,133ms p50 151.4 RPS / 63ms p50 2.8 RPS / 4,123ms p50 43.6 RPS / 203ms p50
async_sleep VU=1 0.4 RPS / 4,141ms p50 16.6 RPS / 63ms p50 0.3 RPS / 4,149ms p50 13.9 RPS / 77ms p50
async_sleep VU=10 3.6 RPS / 4,084ms p50 126.6 RPS / 79ms p50 2.8 RPS / 4,129ms p50 52.9 RPS / 168ms p50

Format: Throughput (RPS) / p50 latency. VU = virtual users (concurrent connections).

Key Results

  • queue=OFF (this PR): p50 latency drops from ~4,130ms to ~16-79ms (50-250x improvement)
  • queue=ON (unchanged): Behavior identical to before — streaming/progress preserved
  • At VU=10 with queue=False, Gradio beats FastMCP on both throughput and latency (151 vs 44 RPS for echo, 127 vs 53 RPS for async_sleep)

Benchmark Datasets (reproducible)

Dataset Description
mcp-server-bench Before fix — 360 scenarios, all HTTP loopback
mcp-server-bench-gradio-optimized After fix (unconditional direct call) — 48 scenarios
mcp-server-bench-gradio-optimized-full-bench After fix (unconditional direct call) — 337 scenarios, full benchmark
mcp-server-bench-gradio This PR (conditional, queue=False only) — 12 scenarios

Testing and Formatting

Validated with the benchmark suite above. The change is scoped to gradio/mcp.py only — no frontend changes.

Changes

  • 1 file changed: gradio/mcp.py
  • Added from gradio.state_holder import SessionState
  • call_tool(): when block_fn.queue is False, call blocks.process_api() directly; otherwise use existing HTTP loopback path

Mandark-droid and others added 3 commits March 4, 2026 10:40
…pback

When `queue=False`, MCP `call_tool()` now calls `blocks.process_api()` directly
instead of going through `gradio_client.Client.submit()`, which was making an
HTTP POST back to the same server. This eliminates thread dispatches, TCP
round-trips, SSE overhead, and queue serialization.

For queued events (`queue=True`), the existing HTTP loopback path is preserved
to maintain streaming updates, progress notifications, and queue-based features.
@gradio-pr-bot
Copy link
Collaborator

gradio-pr-bot commented Mar 4, 2026

🪼 branch checks and previews

Name Status URL
Spaces ready! Spaces preview
Website ready! Website preview
🦄 Changes detected! Details

Install Gradio from this PR

pip install https://gradio-pypi-previews.s3.amazonaws.com/6334868b4637fd24f2e2d32aed6b0dd7b69f5ac9/gradio-6.8.0-py3-none-any.whl

Install Gradio Python Client from this PR

pip install "gradio-client @ git+https://github.com/gradio-app/gradio@6334868b4637fd24f2e2d32aed6b0dd7b69f5ac9#subdirectory=client/python"

Install Gradio JS Client from this PR

npm install https://gradio-npm-previews.s3.amazonaws.com/6334868b4637fd24f2e2d32aed6b0dd7b69f5ac9/gradio-client-2.1.0.tgz

@gradio-pr-bot
Copy link
Collaborator

gradio-pr-bot commented Mar 4, 2026

🦄 change detected

This Pull Request includes changes to the following packages.

Package Version
gradio patch

  • bypass HTTP loopback for non-queued MCP tool calls, calling blocks.process_api() directly to reduce latency

✅ Changeset approved by @freddyaboulton

  • Maintainers can remove approval by unchecking this checkbox.

Something isn't right?

  • Maintainers can change the version label to modify the version bump.
  • If the bot has failed to detect any changes, or if this pull request needs to update multiple packages to different versions or requires a more comprehensive changelog entry, maintainers can update the changelog file directly.

# This eliminates thread dispatches, TCP round-trips, and SSE
# overhead — reducing MCP tool-call latency significantly.
session_state = SessionState(self.blocks)
raw_output = await self.blocks.process_api(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should pass the request here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Added request=self.mcp_server.request_context.request to the process_api() call in a96be5a.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@freddyaboulton
Copy link
Collaborator

Thanks @Mandark-droid ! This is great. Just one comment, we should pass the request to call_process_api

freddyaboulton and others added 2 commits March 4, 2026 09:44
Address review feedback from @freddyaboulton: forward the request
object to blocks.process_api() so that downstream handlers have
access to the original HTTP request context.
Copy link
Member

@abidlabs abidlabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @Mandark-droid! Tested and works great.

I would just add a section in the MCP docs mentioning that for performance, set queue=False. We already have a brief description here: https://www.gradio.app/guides/building-mcp-server-with-gradio#sending-progress-updates, but we can expand it a bit to reference the improved performance optimization

@freddyaboulton freddyaboulton enabled auto-merge (squash) March 6, 2026 16:13
@aaazzam
Copy link

aaazzam commented Mar 6, 2026

Work on FastMCP. Nice fix btw — 16x improvement is legit.

FYI the FastMCP numbers in the table don't match what we see on our end (ask Claude to check out the benchmark branch and try to reproduce). Looks like the baseline was misconfigured. The Gradio improvement is the real story here anyway, congrats team!

@freddyaboulton freddyaboulton merged commit 0595d1b into gradio-app:main Mar 6, 2026
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants