Skip to content

Commit 6b3c3c0

Browse files
authored
Merge pull request #51 from TexasCoding/fix/order-manager-critical-issues
2 parents ce8b57e + 450b8d0 commit 6b3c3c0

29 files changed

+8554
-306
lines changed

ERROR_RECOVERY_IMPLEMENTATION.md

Lines changed: 357 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,357 @@
1+
# Major Error Recovery Implementation for OrderManager
2+
3+
## Overview
4+
5+
This document describes the comprehensive error recovery solution implemented to fix major issues in the OrderManager module where partial failures would leave the system in an inconsistent state.
6+
7+
## Problem Statement
8+
9+
The original OrderManager had several critical issues:
10+
11+
1. **Bracket Orders**: When protective orders failed after entry fills, the system had no recovery mechanism
12+
2. **OCO Linking**: Failures during OCO setup could leave orphaned orders
13+
3. **Position Orders**: Partial failures in complex operations had no rollback capabilities
14+
4. **State Consistency**: No transaction-like semantics for multi-step operations
15+
5. **Error Tracking**: Limited visibility into failure modes and recovery attempts
16+
17+
## Solution Architecture
18+
19+
### 1. OperationRecoveryManager (`error_recovery.py`)
20+
21+
A comprehensive recovery management system that provides:
22+
23+
- **Transaction-like semantics** for complex operations
24+
- **State tracking** throughout operation lifecycle
25+
- **Automatic rollback** on partial failures
26+
- **Retry mechanisms** with exponential backoff and circuit breakers
27+
- **Comprehensive logging** of all recovery attempts
28+
29+
#### Key Components:
30+
31+
```python
32+
class OperationType(Enum):
33+
BRACKET_ORDER = "bracket_order"
34+
OCO_PAIR = "oco_pair"
35+
POSITION_CLOSE = "position_close"
36+
BULK_CANCEL = "bulk_cancel"
37+
ORDER_MODIFICATION = "order_modification"
38+
39+
class OperationState(Enum):
40+
PENDING = "pending"
41+
IN_PROGRESS = "in_progress"
42+
PARTIALLY_COMPLETED = "partially_completed"
43+
COMPLETED = "completed"
44+
FAILED = "failed"
45+
ROLLING_BACK = "rolling_back"
46+
ROLLED_BACK = "rolled_back"
47+
```
48+
49+
#### Recovery Workflow:
50+
51+
1. **Start Operation**: Create recovery tracking
52+
2. **Add Order References**: Track each order in the operation
53+
3. **Record Success/Failure**: Track outcomes in real-time
54+
4. **Add Relationships**: OCO pairs, position tracking
55+
5. **Complete Operation**: Establish relationships or trigger recovery
56+
6. **Rollback on Failure**: Cancel orders and clean up state
57+
58+
### 2. Enhanced Bracket Orders (`bracket_orders.py`)
59+
60+
Complete rewrite of the bracket order implementation with:
61+
62+
#### Transaction-like Semantics:
63+
- **Step 1**: Place entry order with recovery tracking
64+
- **Step 2**: Wait for fill with partial fill handling
65+
- **Step 3**: Place protective orders with rollback capability
66+
- **Step 4**: Complete operation or trigger recovery
67+
68+
#### Recovery Features:
69+
```python
70+
# Initialize recovery tracking
71+
recovery_manager = self._get_recovery_manager()
72+
operation = await recovery_manager.start_operation(
73+
OperationType.BRACKET_ORDER,
74+
max_retries=3,
75+
retry_delay=1.0
76+
)
77+
78+
# Track each order
79+
entry_ref = await recovery_manager.add_order_to_operation(...)
80+
stop_ref = await recovery_manager.add_order_to_operation(...)
81+
target_ref = await recovery_manager.add_order_to_operation(...)
82+
83+
# Attempt completion with automatic recovery
84+
operation_completed = await recovery_manager.complete_operation(operation)
85+
```
86+
87+
#### Emergency Safeguards:
88+
- **Position Closure**: If protective orders fail completely, attempt emergency position closure
89+
- **Complete Rollback**: Cancel all successfully placed orders if recovery fails
90+
- **State Cleanup**: Remove all tracking relationships
91+
92+
### 3. Enhanced OCO Linking (`tracking.py`)
93+
94+
Improved OCO management with:
95+
96+
#### Safe Linking:
97+
```python
98+
def _link_oco_orders(self, order1_id: int, order2_id: int) -> None:
99+
"""Links two orders for OCO cancellation with enhanced reliability."""
100+
try:
101+
# Validate order IDs
102+
if not isinstance(order1_id, int) or not isinstance(order2_id, int):
103+
raise ValueError(f"Order IDs must be integers: {order1_id}, {order2_id}")
104+
105+
# Check for existing links and clean up
106+
existing_link_1 = self.oco_groups.get(order1_id)
107+
if existing_link_1 is not None and existing_link_1 != order2_id:
108+
logger.warning(f"Breaking existing link for order {order1_id}")
109+
if existing_link_1 in self.oco_groups:
110+
del self.oco_groups[existing_link_1]
111+
112+
# Create bidirectional link
113+
self.oco_groups[order1_id] = order2_id
114+
self.oco_groups[order2_id] = order1_id
115+
116+
except Exception as e:
117+
logger.error(f"Failed to link OCO orders: {e}")
118+
# Clean up partial state
119+
self.oco_groups.pop(order1_id, None)
120+
self.oco_groups.pop(order2_id, None)
121+
raise
122+
```
123+
124+
#### Safe Unlinking:
125+
```python
126+
def _unlink_oco_orders(self, order_id: int) -> int | None:
127+
"""Safely unlink OCO orders and return the linked order ID."""
128+
try:
129+
linked_order_id = self.oco_groups.get(order_id)
130+
if linked_order_id is not None:
131+
# Remove both sides of the link
132+
self.oco_groups.pop(order_id, None)
133+
self.oco_groups.pop(linked_order_id, None)
134+
return linked_order_id
135+
return None
136+
except Exception as e:
137+
logger.error(f"Error unlinking OCO order {order_id}: {e}")
138+
self.oco_groups.pop(order_id, None)
139+
return None
140+
```
141+
142+
### 4. Enhanced Position Orders (`position_orders.py`)
143+
144+
Better error handling for position order operations:
145+
146+
#### Enhanced Cancellation:
147+
```python
148+
async def cancel_position_orders(self, contract_id: str, ...) -> dict[str, int]:
149+
results = {"entry": 0, "stop": 0, "target": 0, "failed": 0, "errors": []}
150+
failed_cancellations = []
151+
152+
for order_type in order_types:
153+
for order_id in position_orders[order_key][:]:
154+
try:
155+
success = await self.cancel_order(order_id, account_id)
156+
if success:
157+
results[order_type] += 1
158+
self.untrack_order(order_id)
159+
else:
160+
results["failed"] += 1
161+
failed_cancellations.append({
162+
"order_id": order_id,
163+
"reason": "Cancellation returned False"
164+
})
165+
except Exception as e:
166+
results["failed"] += 1
167+
results["errors"].append(str(e))
168+
```
169+
170+
### 5. Integration with OrderManager Core
171+
172+
The OrderManager now includes:
173+
174+
#### Recovery Manager Access:
175+
```python
176+
def _get_recovery_manager(self) -> OperationRecoveryManager:
177+
"""Get the recovery manager instance for complex operations."""
178+
return self._recovery_manager
179+
180+
async def get_operation_status(self, operation_id: str) -> dict[str, Any] | None:
181+
"""Get status of a recovery operation."""
182+
return self._recovery_manager.get_operation_status(operation_id)
183+
184+
async def force_rollback_operation(self, operation_id: str) -> bool:
185+
"""Force rollback of an active operation."""
186+
return await self._recovery_manager.force_rollback_operation(operation_id)
187+
```
188+
189+
#### Enhanced Cleanup:
190+
```python
191+
async def cleanup(self) -> None:
192+
"""Clean up resources and connections."""
193+
# Clean up recovery manager operations
194+
try:
195+
stale_count = await self.cleanup_stale_operations(max_age_hours=0.1)
196+
if stale_count > 0:
197+
self.logger.info(f"Cleaned up {stale_count} stale recovery operations")
198+
except Exception as e:
199+
self.logger.error(f"Error cleaning up recovery operations: {e}")
200+
```
201+
202+
## Key Features Implemented
203+
204+
### 1. Transaction-like Semantics
205+
- **Atomic Operations**: Multi-step operations either complete fully or roll back completely
206+
- **State Consistency**: System maintains consistent state even during failures
207+
- **Operation Tracking**: Complete visibility into operation progress
208+
209+
### 2. Comprehensive Recovery Mechanisms
210+
- **Automatic Retry**: Exponential backoff with circuit breakers
211+
- **Intelligent Rollback**: Cancel orders and clean relationships
212+
- **Emergency Safeguards**: Position closure as last resort
213+
- **State Cleanup**: Remove all tracking artifacts
214+
215+
### 3. Enhanced Error Tracking
216+
- **Operation History**: Complete audit trail of all operations
217+
- **Error Classification**: Different handling for different failure types
218+
- **Recovery Statistics**: Success rates and performance metrics
219+
- **Circuit Breakers**: Prevent cascade failures
220+
221+
### 4. Robust OCO Management
222+
- **Safe Linking**: Validation and cleanup of existing links
223+
- **Safe Unlinking**: Proper cleanup on order completion
224+
- **State Consistency**: No orphaned or circular links
225+
226+
### 5. Position Order Improvements
227+
- **Enhanced Cancellation**: Track failures and provide detailed results
228+
- **Bulk Operations**: Efficient handling of multiple orders
229+
- **Error Reporting**: Comprehensive error information
230+
231+
## API Changes and Compatibility
232+
233+
### New Methods Added:
234+
- `get_recovery_statistics() -> dict[str, Any]`
235+
- `get_operation_status(operation_id: str) -> dict[str, Any] | None`
236+
- `force_rollback_operation(operation_id: str) -> bool`
237+
- `cleanup_stale_operations(max_age_hours: float = 24.0) -> int`
238+
239+
### Enhanced Methods:
240+
- `place_bracket_order()` - Now with full recovery support
241+
- `cancel_position_orders()` - Enhanced error tracking
242+
- `cleanup()` - Includes recovery operation cleanup
243+
244+
### Backward Compatibility:
245+
- All existing APIs remain unchanged
246+
- New features are opt-in through internal usage
247+
- No breaking changes to public interfaces
248+
249+
## Testing and Validation
250+
251+
### Demo Script (`99_error_recovery_demo.py`)
252+
Demonstrates all new recovery features:
253+
- Transaction-like bracket order placement
254+
- Recovery statistics monitoring
255+
- Circuit breaker status checking
256+
- Enhanced position order management
257+
258+
### Test Coverage:
259+
- Normal operation flows
260+
- Partial failure scenarios
261+
- Complete failure scenarios
262+
- Network timeout handling
263+
- State consistency validation
264+
265+
## Performance Impact
266+
267+
### Benefits:
268+
- **Reduced Manual Intervention**: Automatic recovery reduces support burden
269+
- **Better Success Rates**: Retry mechanisms improve order placement success
270+
- **Cleaner State**: Automatic cleanup prevents state accumulation
271+
- **Better Monitoring**: Comprehensive statistics aid debugging
272+
273+
### Overhead:
274+
- **Memory**: Minimal overhead for operation tracking (cleared automatically)
275+
- **CPU**: Negligible impact during normal operations
276+
- **Latency**: No impact on successful operations, helps during failures
277+
278+
## Configuration Options
279+
280+
### Circuit Breaker Settings:
281+
```python
282+
# In OrderManagerConfig
283+
"status_check_circuit_breaker_threshold": 10,
284+
"status_check_circuit_breaker_reset_time": 300.0,
285+
"status_check_max_attempts": 5,
286+
"status_check_initial_delay": 0.5,
287+
"status_check_backoff_factor": 2.0,
288+
"status_check_max_delay": 30.0,
289+
```
290+
291+
### Recovery Settings:
292+
```python
293+
# In OperationRecoveryManager
294+
max_retries=3, # Maximum recovery attempts
295+
retry_delay=1.0, # Base delay between retries
296+
max_history=100 # Maximum operations in history
297+
```
298+
299+
## Monitoring and Observability
300+
301+
### Recovery Statistics:
302+
```python
303+
recovery_stats = suite.orders.get_recovery_statistics()
304+
{
305+
"operations_started": 10,
306+
"operations_completed": 9,
307+
"operations_failed": 1,
308+
"success_rate": 0.9,
309+
"recovery_attempts": 2,
310+
"recovery_success_rate": 0.5,
311+
"active_operations": 0
312+
}
313+
```
314+
315+
### Circuit Breaker Status:
316+
```python
317+
cb_status = suite.orders.get_circuit_breaker_status()
318+
{
319+
"state": "closed",
320+
"failure_count": 0,
321+
"is_healthy": True,
322+
"retry_config": {
323+
"max_attempts": 5,
324+
"initial_delay": 0.5,
325+
"backoff_factor": 2.0,
326+
"max_delay": 30.0
327+
}
328+
}
329+
```
330+
331+
## Future Enhancements
332+
333+
### Planned Improvements:
334+
1. **Persistent Recovery State**: Save operation state to disk
335+
2. **Advanced Retry Strategies**: Custom retry logic per operation type
336+
3. **Distributed Recovery**: Coordination across multiple instances
337+
4. **Recovery Metrics**: Detailed performance analytics
338+
5. **Custom Recovery Hooks**: User-defined recovery strategies
339+
340+
### Integration Opportunities:
341+
1. **Risk Manager**: Coordinate with position limits
342+
2. **Trade Journal**: Log all recovery attempts
343+
3. **Alerting System**: Notify on repeated failures
344+
4. **Dashboard**: Visual recovery status monitoring
345+
346+
## Conclusion
347+
348+
The implemented error recovery system transforms the OrderManager from a fragile component prone to inconsistent states into a robust, self-healing system that maintains consistency even under adverse conditions. The transaction-like semantics, comprehensive rollback mechanisms, and intelligent retry logic ensure that partial failures are handled gracefully while maintaining full backward compatibility.
349+
350+
Key achievements:
351+
-**Zero Breaking Changes**: All existing code continues to work
352+
-**Complete Recovery**: No more orphaned orders or inconsistent state
353+
-**Enhanced Reliability**: Automatic retry and rollback mechanisms
354+
-**Full Observability**: Comprehensive monitoring and statistics
355+
-**Production Ready**: Tested with real trading scenarios
356+
357+
The system is now production-ready with enterprise-grade error recovery capabilities.

0 commit comments

Comments
 (0)