Skip to content

StreamableClientTransport: Connection poisoned by transient errors #683

@Agustin-Jerusalinsky

Description

@Agustin-Jerusalinsky

Problem

The StreamableClientTransport permanently breaks after any transient error, making the connection unusable for all subsequent requests. This affects both stateful and stateless server modes.

Transient errors include:

  • Network timeouts (context deadline exceeded)
  • HTTP errors (503 Service Unavailable, 500 Internal Server Error, 502 Bad Gateway)
  • Network interruptions (connection refused, connection reset)

After a single transient error occurs, all subsequent requests fail with "client is closing", requiring a full reconnection. This affects production systems where temporary issues (server restarts, network glitches, load spikes) should be recoverable.


1. What did you do?

Created an MCP client using StreamableClientTransport and made three sequential tool calls, where the second call encounters a transient error (timeout, 503, network interruption, etc.):

httpClient := &http.Client{Timeout: /* some timeout */}
transport := &mcp.StreamableClientTransport{
    Endpoint:   serverURL,
    HTTPClient: httpClient,
}

session, _ := client.Connect(ctx, transport, nil)

// Call 1: Fast tool - should succeed
result1, err1 := session.CallTool(ctx, &mcp.CallToolParams{Name: "delay_tool"})

// Call 2: Encounters transient error (timeout, 503, etc.)
result2, err2 := session.CallTool(ctx, &mcp.CallToolParams{Name: "delay_tool"})

// Call 3: Fast tool - should succeed
result3, err3 := session.CallTool(ctx, &mcp.CallToolParams{Name: "delay_tool"})

2. What did you see?

Call #1: SUCCESS
Call #2: FAILED - calling "tools/call": sending "tools/call": Post "http://...": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Call #3: FAILED - connection closed: calling "tools/call": client is closing: EOF
Call #4+: FAILED - connection closed: calling "tools/call": client is closing: EOF

After Call #2 encounters a transient error, all subsequent calls fail permanently with "client is closing: EOF" errors. The connection becomes unusable and must be recreated.

3. What did you expect to see?

Call #1: SUCCESS ✅
Call #2: FAILED - transient error (expected)
Call #3: SUCCESS ✅ (connection should survive the transient error)
Call #4+: SUCCESS ✅

The connection should remain healthy after transient errors. Only fatal errors (authentication failures, protocol errors, session termination) should require reconnection.

4. What version of the Go MCP SDK are you using?

Bug reproduced in:

  • main branch (commit 272e0cd)
  • v1.1.0 (latest release)
  • v1.0.0 (initial release)

The bug exists in all released versions of the SDK.

5. What version of Go are you using (go version)?

go version go1.23.3 darwin/arm64

Root Cause

The streamableClientConn.Write() method returns transient errors without wrapping them in jsonrpc2.ErrRejected. This causes the underlying jsonrpc2.Connection to set its writeErr flag permanently (in internal/jsonrpc2/conn.go:794-810), which is never cleared and blocks all future operations.

The jsonrpc2 package already has a mechanism to handle recoverable errors via ErrRejected (see conn.go:788-792), but the streamable transport doesn't use it.

After any transient error, the connection becomes unusable for all MCP operations: CallTool(), ListTools(), ListResources(), etc. The entire session must be recreated, even though the error was temporary and should have been recoverable.

Related Issues

Reproduction Test Code

Below is a minimal reproduction test:

Click to expand test code
package gosdk

import (
    "context"
    "fmt"
    "net/http"
    "net/http/httptest"
    "sync/atomic"
    "testing"
    "time"

    "github.com/modelcontextprotocol/go-sdk/mcp"
)

// testDelays defines the sleep duration for each call
var testDelays = []time.Duration{
    500 * time.Millisecond, // Call 1: fast, should succeed
    3 * time.Second,        // Call 2: slow, will timeout
    500 * time.Millisecond, // Call 3: fast, should succeed but fails due to bug
}

// createDelayTool creates an MCP tool that sleeps for configurable durations
func createDelayTool(callCount *atomic.Int32) (*mcp.Server, *mcp.Tool) {
    server := mcp.NewServer(&mcp.Implementation{
        Name:    "test-server",
        Version: "1.0.0",
    }, nil)

    tool := &mcp.Tool{
        Name:        "delay_tool",
        Description: "Tool with configurable delays for testing",
    }

    handler := func(ctx context.Context, req *mcp.CallToolRequest, args any) (*mcp.CallToolResult, any, error) {
        callNum := int(callCount.Add(1))
        delay := testDelays[0] // default
        if callNum <= len(testDelays) {
            delay = testDelays[callNum-1]
        }

        time.Sleep(delay)

        return &mcp.CallToolResult{
            Content: []mcp.Content{
                &mcp.TextContent{
                    Text: fmt.Sprintf("Call #%d completed", callNum),
                },
            },
        }, nil, nil
    }

    mcp.AddTool(server, tool, handler)
    return server, tool
}

// setupTestServer creates and starts an HTTP test server with MCP handler
func setupTestServer(t *testing.T, server *mcp.Server, stateless bool) *httptest.Server {
    var opts *mcp.StreamableHTTPOptions
    if stateless {
        opts = &mcp.StreamableHTTPOptions{Stateless: true}
    }

    handler := mcp.NewStreamableHTTPHandler(func(req *http.Request) *mcp.Server {
        return server
    }, opts)

    httpServer := httptest.NewServer(handler)
    t.Cleanup(httpServer.Close)
    return httpServer
}

// createMCPSession creates and connects an MCP client session
func createMCPSession(t *testing.T, serverURL string, clientTimeout time.Duration) *mcp.ClientSession {
    ctx := context.Background()
    httpClient := &http.Client{Timeout: clientTimeout}

    client := mcp.NewClient(&mcp.Implementation{
        Name:    "test-client",
        Version: "1.0.0",
    }, nil)

    transport := &mcp.StreamableClientTransport{
        Endpoint:   serverURL,
        HTTPClient: httpClient,
    }

    session, err := client.Connect(ctx, transport, nil)
    if err != nil {
        t.Fatalf("Failed to connect: %v", err)
    }
    t.Cleanup(func() { session.Close() })

    return session
}

// callResult represents the result of a CallTool invocation
type callResult struct {
    num    int
    err    error
    result *mcp.CallToolResult
}

// performCallSequence executes three sequential tool calls and returns results
func performCallSequence(session *mcp.ClientSession) []callResult {
    results := make([]callResult, 3)

    for i := range 3 {
        ctx := context.Background() // Fresh context for each call
        result, err := session.CallTool(ctx, &mcp.CallToolParams{
            Name:      "delay_tool",
            Arguments: map[string]any{},
        })
        results[i] = callResult{num: i + 1, err: err, result: result}
    }

    return results
}

// reportResults logs the test results
func reportResults(t *testing.T, results []callResult) {
    t.Helper()

    for _, r := range results {
        if r.err != nil {
            t.Logf("Call #%d: FAILED - %v", r.num, r.err)
        } else {
            t.Logf("Call #%d: SUCCESS", r.num)
        }
    }

    // Expected behavior: Call 3 should succeed even after Call 2 timeout
    if results[2].err != nil {
        t.Errorf("Call #3 failed after transient error in Call #2")
    }
}

// TestTimeoutBugReproduction tests connection behavior after timeout in stateful mode.
// Expected: Call #3 should succeed even after Call #2 timeout.
func TestTimeoutBugReproduction(t *testing.T) {
    var callCount atomic.Int32
    server, _ := createDelayTool(&callCount)
    httpServer := setupTestServer(t, server, false) // stateful mode
    session := createMCPSession(t, httpServer.URL, 2*time.Second)

    results := performCallSequence(session)
    reportResults(t, results)
}

// TestTimeoutStateless tests connection behavior after timeout in stateless mode.
// Expected: Call #3 should succeed even after Call #2 timeout.
func TestTimeoutStateless(t *testing.T) {
    var callCount atomic.Int32
    server, _ := createDelayTool(&callCount)
    httpServer := setupTestServer(t, server, true) // stateless mode
    session := createMCPSession(t, httpServer.URL, 2*time.Second)

    results := performCallSequence(session)
    reportResults(t, results)
}

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions