Skip to content

[RFC] Standardizing MCP Error Handling for Autonomous Agent Resilience #1094

@Protocol-zero-0

Description

@Protocol-zero-0

As autonomous agents scale (e.g., high-frequency tool usage), distinguishing between transient network blips and permanent logic errors is critical for reliability.

Currently, many MCP implementations return generic errors, forcing consumer agents to guess whether to retry.

Proposal:
Adopt a standardized error schema (perhaps inspired by RFC 7807 Problem Details) for MCP responses that explicitly signals:

  1. Retry-ability: Should the agent try again? (e.g., Rate Limits, Temporary Downtime)
  2. Backoff Hint: retry_after_ms field.
  3. State Validity: Did the tool partially execute or fail atomically?

This would allow platforms like Klavis to implement robust, self-healing execution loops without custom handling for every tool provider.

Would love to see this discussed as part of the core resilience patterns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions