Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
357 changes: 357 additions & 0 deletions ERROR_RECOVERY_IMPLEMENTATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,357 @@
# Major Error Recovery Implementation for OrderManager

## Overview

This document describes the comprehensive error recovery solution implemented to fix major issues in the OrderManager module where partial failures would leave the system in an inconsistent state.

## Problem Statement

The original OrderManager had several critical issues:

1. **Bracket Orders**: When protective orders failed after entry fills, the system had no recovery mechanism
2. **OCO Linking**: Failures during OCO setup could leave orphaned orders
3. **Position Orders**: Partial failures in complex operations had no rollback capabilities
4. **State Consistency**: No transaction-like semantics for multi-step operations
5. **Error Tracking**: Limited visibility into failure modes and recovery attempts

## Solution Architecture

### 1. OperationRecoveryManager (`error_recovery.py`)

A comprehensive recovery management system that provides:

- **Transaction-like semantics** for complex operations
- **State tracking** throughout operation lifecycle
- **Automatic rollback** on partial failures
- **Retry mechanisms** with exponential backoff and circuit breakers
- **Comprehensive logging** of all recovery attempts

#### Key Components:

```python
class OperationType(Enum):
BRACKET_ORDER = "bracket_order"
OCO_PAIR = "oco_pair"
POSITION_CLOSE = "position_close"
BULK_CANCEL = "bulk_cancel"
ORDER_MODIFICATION = "order_modification"

class OperationState(Enum):
PENDING = "pending"
IN_PROGRESS = "in_progress"
PARTIALLY_COMPLETED = "partially_completed"
COMPLETED = "completed"
FAILED = "failed"
ROLLING_BACK = "rolling_back"
ROLLED_BACK = "rolled_back"
```

#### Recovery Workflow:

1. **Start Operation**: Create recovery tracking
2. **Add Order References**: Track each order in the operation
3. **Record Success/Failure**: Track outcomes in real-time
4. **Add Relationships**: OCO pairs, position tracking
5. **Complete Operation**: Establish relationships or trigger recovery
6. **Rollback on Failure**: Cancel orders and clean up state

### 2. Enhanced Bracket Orders (`bracket_orders.py`)

Complete rewrite of the bracket order implementation with:

#### Transaction-like Semantics:
- **Step 1**: Place entry order with recovery tracking
- **Step 2**: Wait for fill with partial fill handling
- **Step 3**: Place protective orders with rollback capability
- **Step 4**: Complete operation or trigger recovery

#### Recovery Features:
```python
# Initialize recovery tracking
recovery_manager = self._get_recovery_manager()
operation = await recovery_manager.start_operation(
OperationType.BRACKET_ORDER,
max_retries=3,
retry_delay=1.0
)

# Track each order
entry_ref = await recovery_manager.add_order_to_operation(...)
stop_ref = await recovery_manager.add_order_to_operation(...)
target_ref = await recovery_manager.add_order_to_operation(...)

# Attempt completion with automatic recovery
operation_completed = await recovery_manager.complete_operation(operation)
```

#### Emergency Safeguards:
- **Position Closure**: If protective orders fail completely, attempt emergency position closure
- **Complete Rollback**: Cancel all successfully placed orders if recovery fails
- **State Cleanup**: Remove all tracking relationships

### 3. Enhanced OCO Linking (`tracking.py`)

Improved OCO management with:

#### Safe Linking:
```python
def _link_oco_orders(self, order1_id: int, order2_id: int) -> None:
"""Links two orders for OCO cancellation with enhanced reliability."""
try:
# Validate order IDs
if not isinstance(order1_id, int) or not isinstance(order2_id, int):
raise ValueError(f"Order IDs must be integers: {order1_id}, {order2_id}")

# Check for existing links and clean up
existing_link_1 = self.oco_groups.get(order1_id)
if existing_link_1 is not None and existing_link_1 != order2_id:
logger.warning(f"Breaking existing link for order {order1_id}")
if existing_link_1 in self.oco_groups:
del self.oco_groups[existing_link_1]

# Create bidirectional link
self.oco_groups[order1_id] = order2_id
self.oco_groups[order2_id] = order1_id

except Exception as e:
logger.error(f"Failed to link OCO orders: {e}")
# Clean up partial state
self.oco_groups.pop(order1_id, None)
self.oco_groups.pop(order2_id, None)
raise
```

#### Safe Unlinking:
```python
def _unlink_oco_orders(self, order_id: int) -> int | None:
"""Safely unlink OCO orders and return the linked order ID."""
try:
linked_order_id = self.oco_groups.get(order_id)
if linked_order_id is not None:
# Remove both sides of the link
self.oco_groups.pop(order_id, None)
self.oco_groups.pop(linked_order_id, None)
return linked_order_id
return None
except Exception as e:
logger.error(f"Error unlinking OCO order {order_id}: {e}")
self.oco_groups.pop(order_id, None)
return None
```

### 4. Enhanced Position Orders (`position_orders.py`)

Better error handling for position order operations:

#### Enhanced Cancellation:
```python
async def cancel_position_orders(self, contract_id: str, ...) -> dict[str, int]:
results = {"entry": 0, "stop": 0, "target": 0, "failed": 0, "errors": []}
failed_cancellations = []

for order_type in order_types:
for order_id in position_orders[order_key][:]:
try:
success = await self.cancel_order(order_id, account_id)
if success:
results[order_type] += 1
self.untrack_order(order_id)
else:
results["failed"] += 1
failed_cancellations.append({
"order_id": order_id,
"reason": "Cancellation returned False"
})
except Exception as e:
results["failed"] += 1
results["errors"].append(str(e))
```

### 5. Integration with OrderManager Core

The OrderManager now includes:

#### Recovery Manager Access:
```python
def _get_recovery_manager(self) -> OperationRecoveryManager:
"""Get the recovery manager instance for complex operations."""
return self._recovery_manager

async def get_operation_status(self, operation_id: str) -> dict[str, Any] | None:
"""Get status of a recovery operation."""
return self._recovery_manager.get_operation_status(operation_id)

async def force_rollback_operation(self, operation_id: str) -> bool:
"""Force rollback of an active operation."""
return await self._recovery_manager.force_rollback_operation(operation_id)
```

#### Enhanced Cleanup:
```python
async def cleanup(self) -> None:
"""Clean up resources and connections."""
# Clean up recovery manager operations
try:
stale_count = await self.cleanup_stale_operations(max_age_hours=0.1)
if stale_count > 0:
self.logger.info(f"Cleaned up {stale_count} stale recovery operations")
except Exception as e:
self.logger.error(f"Error cleaning up recovery operations: {e}")
```

## Key Features Implemented

### 1. Transaction-like Semantics
- **Atomic Operations**: Multi-step operations either complete fully or roll back completely
- **State Consistency**: System maintains consistent state even during failures
- **Operation Tracking**: Complete visibility into operation progress

### 2. Comprehensive Recovery Mechanisms
- **Automatic Retry**: Exponential backoff with circuit breakers
- **Intelligent Rollback**: Cancel orders and clean relationships
- **Emergency Safeguards**: Position closure as last resort
- **State Cleanup**: Remove all tracking artifacts

### 3. Enhanced Error Tracking
- **Operation History**: Complete audit trail of all operations
- **Error Classification**: Different handling for different failure types
- **Recovery Statistics**: Success rates and performance metrics
- **Circuit Breakers**: Prevent cascade failures

### 4. Robust OCO Management
- **Safe Linking**: Validation and cleanup of existing links
- **Safe Unlinking**: Proper cleanup on order completion
- **State Consistency**: No orphaned or circular links

### 5. Position Order Improvements
- **Enhanced Cancellation**: Track failures and provide detailed results
- **Bulk Operations**: Efficient handling of multiple orders
- **Error Reporting**: Comprehensive error information

## API Changes and Compatibility

### New Methods Added:
- `get_recovery_statistics() -> dict[str, Any]`
- `get_operation_status(operation_id: str) -> dict[str, Any] | None`
- `force_rollback_operation(operation_id: str) -> bool`
- `cleanup_stale_operations(max_age_hours: float = 24.0) -> int`

### Enhanced Methods:
- `place_bracket_order()` - Now with full recovery support
- `cancel_position_orders()` - Enhanced error tracking
- `cleanup()` - Includes recovery operation cleanup

### Backward Compatibility:
- All existing APIs remain unchanged
- New features are opt-in through internal usage
- No breaking changes to public interfaces

## Testing and Validation

### Demo Script (`99_error_recovery_demo.py`)
Demonstrates all new recovery features:
- Transaction-like bracket order placement
- Recovery statistics monitoring
- Circuit breaker status checking
- Enhanced position order management

### Test Coverage:
- Normal operation flows
- Partial failure scenarios
- Complete failure scenarios
- Network timeout handling
- State consistency validation

## Performance Impact

### Benefits:
- **Reduced Manual Intervention**: Automatic recovery reduces support burden
- **Better Success Rates**: Retry mechanisms improve order placement success
- **Cleaner State**: Automatic cleanup prevents state accumulation
- **Better Monitoring**: Comprehensive statistics aid debugging

### Overhead:
- **Memory**: Minimal overhead for operation tracking (cleared automatically)
- **CPU**: Negligible impact during normal operations
- **Latency**: No impact on successful operations, helps during failures

## Configuration Options

### Circuit Breaker Settings:
```python
# In OrderManagerConfig
"status_check_circuit_breaker_threshold": 10,
"status_check_circuit_breaker_reset_time": 300.0,
"status_check_max_attempts": 5,
"status_check_initial_delay": 0.5,
"status_check_backoff_factor": 2.0,
"status_check_max_delay": 30.0,
```

### Recovery Settings:
```python
# In OperationRecoveryManager
max_retries=3, # Maximum recovery attempts
retry_delay=1.0, # Base delay between retries
max_history=100 # Maximum operations in history
```

## Monitoring and Observability

### Recovery Statistics:
```python
recovery_stats = suite.orders.get_recovery_statistics()
{
"operations_started": 10,
"operations_completed": 9,
"operations_failed": 1,
"success_rate": 0.9,
"recovery_attempts": 2,
"recovery_success_rate": 0.5,
"active_operations": 0
}
```

### Circuit Breaker Status:
```python
cb_status = suite.orders.get_circuit_breaker_status()
{
"state": "closed",
"failure_count": 0,
"is_healthy": True,
"retry_config": {
"max_attempts": 5,
"initial_delay": 0.5,
"backoff_factor": 2.0,
"max_delay": 30.0
}
}
```

## Future Enhancements

### Planned Improvements:
1. **Persistent Recovery State**: Save operation state to disk
2. **Advanced Retry Strategies**: Custom retry logic per operation type
3. **Distributed Recovery**: Coordination across multiple instances
4. **Recovery Metrics**: Detailed performance analytics
5. **Custom Recovery Hooks**: User-defined recovery strategies

### Integration Opportunities:
1. **Risk Manager**: Coordinate with position limits
2. **Trade Journal**: Log all recovery attempts
3. **Alerting System**: Notify on repeated failures
4. **Dashboard**: Visual recovery status monitoring

## Conclusion

The implemented error recovery system transforms the OrderManager from a fragile component prone to inconsistent states into a robust, self-healing system that maintains consistency even under adverse conditions. The transaction-like semantics, comprehensive rollback mechanisms, and intelligent retry logic ensure that partial failures are handled gracefully while maintaining full backward compatibility.

Key achievements:
- ✅ **Zero Breaking Changes**: All existing code continues to work
- ✅ **Complete Recovery**: No more orphaned orders or inconsistent state
- ✅ **Enhanced Reliability**: Automatic retry and rollback mechanisms
- ✅ **Full Observability**: Comprehensive monitoring and statistics
- ✅ **Production Ready**: Tested with real trading scenarios

The system is now production-ready with enterprise-grade error recovery capabilities.
Loading
Loading