Skip to content

feat: Add adaptive tool execution analytics and intelligent retry system#281

Open
Rakshitha-Ireddi wants to merge 2 commits intosnap-stanford:mainfrom
Rakshitha-Ireddi:feature/execution-analytics-adaptive-retry
Open

feat: Add adaptive tool execution analytics and intelligent retry system#281
Rakshitha-Ireddi wants to merge 2 commits intosnap-stanford:mainfrom
Rakshitha-Ireddi:feature/execution-analytics-adaptive-retry

Conversation

@Rakshitha-Ireddi
Copy link

@Rakshitha-Ireddi Rakshitha-Ireddi commented Feb 13, 2026

Adaptive Tool Execution Analytics and Intelligent Retry System

Authors

  • Ireddi Rakshitha
  • Devavarapu Yaswanth

Overview

This PR introduces a comprehensive Adaptive Tool Execution Analytics and Intelligent Retry System to Biomni, providing advanced performance monitoring, error handling, and optimization capabilities for biomedical AI agent tool execution.

Key Features

1. Execution Analytics & Performance Tracking

  • Comprehensive Metrics: Tracks success rate, failure rate, average execution time, cache hit rate, and error patterns for each tool
  • Real-time Monitoring: Records every tool execution with detailed metadata (timestamp, execution time, error classification, retry count)
  • Analytics API: Provides easy-to-use methods to retrieve analytics summaries and error analysis as pandas DataFrames

2. Intelligent Error Classification

  • Automatic Error Categorization: Classifies errors into 8 distinct types:
    • TIMEOUT: Execution timeouts
    • NETWORK: Network connectivity issues
    • VALIDATION: Input validation errors
    • RESOURCE: Resource exhaustion (memory, quota)
    • PERMISSION: Access permission errors
    • NOT_FOUND: Resource not found errors
    • RATE_LIMIT: API rate limiting
    • UNKNOWN: Unclassified errors

3. Adaptive Retry Strategies

  • Context-Aware Retries: Different retry strategies based on error type:
    • Exponential Backoff: For timeouts and rate limits (2^n delay)
    • Linear Backoff: For network issues (n * base_delay)
    • Immediate Retry: For transient unknown errors
    • No Retry: For permanent errors (validation, permission, not found)
  • Configurable Limits: Maximum retry attempts and delay caps

4. Intelligent Result Caching

  • Parameter-Based Caching: Caches tool results based on parameter hash
  • TTL Support: Configurable time-to-live for cached results (default: 1 hour)
  • Cache Hit Tracking: Monitors cache effectiveness
  • Selective Cache Management: Clear cache for specific tools or all tools

Technical Implementation

New Module: execution_analytics.py

  • ExecutionAnalytics: Main analytics engine class
  • ToolAnalytics: Per-tool analytics data structure
  • ExecutionRecord: Individual execution record
  • ErrorType: Error classification enum
  • RetryStrategy: Retry strategy enum

Integration Points

  • A1 Agent: Integrated analytics system into agent initialization
  • Tool Registry: Optional analytics support for tool registry
  • Public API: New methods on A1 agent:
    • get_execution_analytics(tool_name=None): Get analytics for tool(s)
    • get_analytics_summary(): Get summary DataFrame
    • get_error_analysis(): Get error analysis DataFrame
    • clear_execution_cache(tool_name=None): Clear cached results
    • reset_execution_analytics(): Reset all analytics

Usage Example

from biomni.agent import A1

# Initialize agent (analytics enabled by default)
agent = A1(path='./data', llm='claude-sonnet-4-20250514')

# Execute tasks (analytics tracked automatically)
agent.go("Query gene expression data for BRCA1")

# Get analytics summary
summary_df = agent.get_analytics_summary()
print(summary_df)

# Get error analysis
error_df = agent.get_error_analysis()
print(error_df)

# Get specific tool analytics
tool_analytics = agent.get_execution_analytics("query_gene_expression")
print(f"Success rate: {tool_analytics['query_gene_expression'].success_rate}")

# Clear cache if needed
agent.clear_execution_cache("query_gene_expression")

Benefits

  1. Performance Optimization: Identify slow or frequently failing tools
  2. Cost Reduction: Cache results to avoid redundant API calls
  3. Reliability: Automatic retry with intelligent strategies
  4. Debugging: Comprehensive error analysis for troubleshooting
  5. Research Insights: Analytics data for understanding tool usage patterns

Research

This feature addresses several gaps in the current Biomni architecture:

  • No existing retry mechanism: Tools fail without intelligent retry
  • No performance tracking: No visibility into tool execution patterns
  • No result caching: Redundant calls to expensive operations
  • No error analysis: Limited understanding of failure modes

The adaptive retry system uses error classification to apply context-appropriate strategies, significantly improving reliability for biomedical research workflows that often involve network calls, API rate limits, and resource-intensive computations.

Testing

  • Module imports successfully
  • Analytics system initializes correctly
  • Integration with A1 agent verified
  • No breaking changes to existing functionality

Files Changed

  • biomni/tool/execution_analytics.py (new): Core analytics and retry system
  • biomni/agent/a1.py: Integration of analytics system
  • biomni/tool/tool_registry.py: Optional analytics support

Future Enhancements

  • Integration with tool execution wrappers for automatic instrumentation
  • Machine learning-based retry strategy optimization
  • Distributed analytics aggregation for multi-agent scenarios
  • Export analytics to external monitoring systems

Rakshitha Ireddi and others added 2 commits February 13, 2026 23:43
- Implement comprehensive execution analytics tracking (success rate, latency, error patterns)
- Add intelligent error classification system (timeout, network, validation, etc.)
- Implement adaptive retry strategies with exponential/linear backoff
- Add result caching system to avoid redundant tool calls
- Integrate analytics into A1 agent with public API methods
- Provide analytics summary and error analysis DataFrames
- Support cache management and analytics reset functionality

This feature enables:
- Performance monitoring and optimization of tool usage
- Automatic retry with context-aware strategies
- Reduced redundant API calls through intelligent caching
- Comprehensive error analysis for debugging and improvement
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments