As autonomous agents scale (e.g., high-frequency tool usage), distinguishing between transient network blips and permanent logic errors is critical for reliability.
Currently, many MCP implementations return generic errors, forcing consumer agents to guess whether to retry.
Proposal:
Adopt a standardized error schema (perhaps inspired by RFC 7807 Problem Details) for MCP responses that explicitly signals:
- Retry-ability: Should the agent try again? (e.g., Rate Limits, Temporary Downtime)
- Backoff Hint:
retry_after_ms field.
- State Validity: Did the tool partially execute or fail atomically?
This would allow platforms like Klavis to implement robust, self-healing execution loops without custom handling for every tool provider.
Would love to see this discussed as part of the core resilience patterns.