[RFC] Standardizing MCP Error Handling for Autonomous Agent Resilience

As autonomous agents scale (e.g., high-frequency tool usage), distinguishing between transient network blips and permanent logic errors is critical for reliability.

Currently, many MCP implementations return generic errors, forcing consumer agents to guess whether to retry.

**Proposal:**
Adopt a standardized error schema (perhaps inspired by RFC 7807 Problem Details) for MCP responses that explicitly signals:
1. **Retry-ability**: Should the agent try again? (e.g., Rate Limits, Temporary Downtime)
2. **Backoff Hint**: `retry_after_ms` field.
3. **State Validity**: Did the tool partially execute or fail atomically?

This would allow platforms like Klavis to implement robust, self-healing execution loops without custom handling for every tool provider.

Would love to see this discussed as part of the core resilience patterns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Standardizing MCP Error Handling for Autonomous Agent Resilience #1094

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Standardizing MCP Error Handling for Autonomous Agent Resilience #1094

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions