diff --git a/src/forge/controller/service/service.md b/src/forge/controller/service/service.md deleted file mode 100644 index 3a8134fa3..000000000 --- a/src/forge/controller/service/service.md +++ /dev/null @@ -1,301 +0,0 @@ -# Service - Distributed Actor Service Controller - -A robust service orchestration system for managing distributed actor-based workloads with fault tolerance and intelligent load balancing. - -## Overview - -The Service class provides a unified interface for deploying and managing multiple replicas of actor-based services across distributed compute resources. It automatically handles replica lifecycle, request routing, and session management. - -## Key Features - -### **Fault Tolerance** -- **Health Monitoring**: Continuous health checks with automatic replica recovery -- **Request Migration**: Seamless migration of requests from failed replicas -- **Session Preservation**: Maintains session state during replica failures -- **Graceful Degradation**: Continues operation with reduced capacity - -### **Load Balancing** -- **Round-Robin**: Default load distribution across healthy replicas -- **Least-Loaded**: Session assignment to replicas with lowest load -- **Session Affinity**: Sticky sessions for stateful workloads -- **Custom Routing**: Extensible routing logic for specialized use cases - -### **Comprehensive Metrics** -- **Request Metrics**: Throughput, latency, success/failure rates -- **Capacity Metrics**: Utilization, queue depth, active requests -- **Service Metrics**: Session counts, replica health, scaling events -- **Real-time Monitoring**: Sliding window metrics for responsive scaling - -### **Session Management** -- **Context-Aware Sessions**: Automatic session context propagation -- **Session Lifecycle**: Managed session creation and cleanup -- **Routing Hints**: Custom session routing based on workload characteristics - -## Architecture - -``` -┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ -│ Client API │───▶│ Service Layer │───▶│ Replica Pool │ -└─────────────────┘ └──────────────────┘ └─────────────────┘ - │ │ - ▼ ▼ - ┌──────────────┐ ┌─────────────┐ - │ Autoscaler │ │ Actor Mesh │ - └──────────────┘ └─────────────┘ - │ │ - ▼ ▼ - ┌──────────────┐ ┌─────────────┐ - │ Metrics │ │ Health │ - │ Collector │ │ Monitor │ - └──────────────┘ └─────────────┘ -``` - -## Usage - -### Basic Service Setup - -```python -from forge.controller.service import Service, ServiceConfig - -# Configure service parameters -config = ServiceConfig( - gpus_per_replica=1, - min_replicas=2, - max_replicas=10, - default_replicas=3, - replica_max_concurrent_requests=10, -) - -# Create service with your actor definition -service = Service(config, MyActorClass, *actor_args, **actor_kwargs) -await service.__initialize__() -``` - -### Session-Based Calls - -```python -# Context manager for session lifecycle -async with service.session() as session: - result1 = await service.my_endpoint(arg1, arg2) - result2 = await service.another_endpoint(arg3) - # Session automatically terminated on exit - -# Manual session management -session_id = await service.start_session() -result = await service.my_endpoint(session_id, arg1, arg2) -await service.terminate_session(session_id) -``` - -### Stateless Calls - -```python -# Direct calls without sessions (uses round-robin load balancing) -result = await service.my_endpoint(arg1, arg2) -``` - -### Custom Routing - -```python -# Override _custom_replica_routing for specialized routing logic -class CustomService(Service): - async def _custom_replica_routing(self, sess_id: str | None, **kwargs) -> Optional[Replica]: - # Custom routing based on request characteristics - if kwargs.get('priority') == 'high': - return self._get_least_loaded_replica() - return None # Fall back to default routing - -# Use with routing hints -async with service.session(priority='high') as session: - result = await service.my_endpoint(arg1, arg2) -``` - -### Monitoring and Metrics - -```python -# Get detailed metrics -metrics = service.get_metrics() -print(f"Total request rate: {metrics.get_total_request_rate()}") -print(f"Average queue depth: {metrics.get_avg_queue_depth()}") -print(f"Capacity utilization: {metrics.get_avg_capacity_utilization(service._replicas)}") - -# Get summary for monitoring dashboards -summary = service.get_metrics_summary() -print(f"Healthy replicas: {summary['service']['healthy_replicas']}") -print(f"Total sessions: {summary['service']['total_sessions']}") - -# Per-replica metrics -for replica_idx, replica_metrics in summary['replicas'].items(): - print(f"Replica {replica_idx}: {replica_metrics['request_rate']:.1f} req/s") -``` - -### Graceful Shutdown - -```python -# Stop the service and all replicas -await service.stop() -``` - -## Configuration - -### ServiceConfig - -| Parameter | Type | Description | -|-----------|------|-------------| -| `gpus_per_replica` | int | Number of GPUs allocated per replica | -| `min_replicas` | int | Minimum number of replicas to maintain | -| `max_replicas` | int | Maximum number of replicas allowed | -| `default_replicas` | int | Initial number of replicas to start | -| `replica_max_concurrent_requests` | int | Maximum concurrent requests per replica | -| `health_poll_rate` | float | Health check frequency in seconds | -| `return_first_rank_result` | bool | Auto-unwrap ValueMesh to first rank's result | -| `autoscaling` | AutoscalingConfig | Autoscaling configuration | - -### AutoscalingConfig - -#### Scale Up Triggers -| Parameter | Default | Description | -|-----------|---------|-------------| -| `scale_up_queue_depth_threshold` | 5.0 | Average queue depth to trigger scale up | -| `scale_up_capacity_threshold` | 0.8 | Capacity utilization to trigger scale up | -| `scale_up_request_rate_threshold` | 10.0 | Requests/sec to trigger scale up | - -#### Scale Down Triggers -| Parameter | Default | Description | -|-----------|---------|-------------| -| `scale_down_capacity_threshold` | 0.3 | Capacity utilization to trigger scale down | -| `scale_down_queue_depth_threshold` | 1.0 | Average queue depth to trigger scale down | -| `scale_down_idle_time_threshold` | 300.0 | Seconds of low utilization before scale down | - -#### Timing Controls -| Parameter | Default | Description | -|-----------|---------|-------------| -| `min_time_between_scale_events` | 60.0 | Minimum seconds between scaling events | -| `scale_up_cooldown` | 30.0 | Cooldown after scale up | -| `scale_down_cooldown` | 120.0 | Cooldown after scale down | - -#### Scaling Behavior -| Parameter | Default | Description | -|-----------|---------|-------------| -| `scale_up_step_size` | 1 | How many replicas to add at once | -| `scale_down_step_size` | 1 | How many replicas to remove at once | - -#### Safety Limits -| Parameter | Default | Description | -|-----------|---------|-------------| -| `max_queue_depth_emergency` | 20.0 | Emergency scale up threshold | -| `min_healthy_replicas_ratio` | 0.5 | Minimum ratio of healthy replicas | - -## Metrics - -### Service-Level Metrics -- **Total Sessions**: Number of active sessions -- **Healthy Replicas**: Number of operational replicas -- **Total Request Rate**: Requests per second across all replicas -- **Average Queue Depth**: Average pending requests per replica -- **Average Capacity Utilization**: Average resource usage across replicas -- **Sessions Per Replica**: Distribution of sessions across replicas - -### Replica-Level Metrics -- **Request Counts**: Total, successful, and failed requests -- **Request Rate**: Requests per second (sliding window) -- **Average Latency**: Response time (sliding window) -- **Active Requests**: Currently processing requests -- **Queue Depth**: Pending requests in queue -- **Assigned Sessions**: Number of sessions assigned to replica -- **Capacity Utilization**: Current load vs maximum capacity - -## Use Cases - -### ML Model Serving -```python -# High-throughput model inference with automatic scaling -config = ServiceConfig( - gpus_per_replica=1, - min_replicas=2, - max_replicas=20, - default_replicas=4, - replica_max_concurrent_requests=8, - autoscaling=AutoscalingConfig( - scale_up_capacity_threshold=0.7, - scale_up_queue_depth_threshold=3.0 - ) -) -service = Service(config, ModelInferenceActor, model_path="/path/to/model") -``` - -### Batch Processing -```python -# Parallel job execution with fault tolerance -config = ServiceConfig( - gpus_per_replica=2, - min_replicas=1, - max_replicas=10, - default_replicas=3, - replica_max_concurrent_requests=5, - autoscaling=AutoscalingConfig( - scale_up_queue_depth_threshold=10.0, - scale_down_idle_time_threshold=600.0 - ) -) -service = Service(config, BatchProcessorActor, batch_size=100) -``` - -### Real-time Analytics -```python -# Stream processing with session affinity -config = ServiceConfig( - gpus_per_replica=1, - min_replicas=3, - max_replicas=15, - default_replicas=5, - replica_max_concurrent_requests=20, - autoscaling=AutoscalingConfig( - scale_up_request_rate_threshold=50.0, - scale_up_capacity_threshold=0.6 - ) -) -service = Service(config, StreamProcessorActor, window_size=1000) -``` - -## Performance Characteristics - -- **Low Latency**: Sub-millisecond request routing overhead -- **High Throughput**: Concurrent request processing across replicas -- **Elastic Scaling**: Responsive to traffic patterns with configurable thresholds -- **Resource Efficient**: Intelligent replica management and load balancing -- **Fault Resilient**: Automatic recovery from replica failures -- **Session Aware**: Maintains state consistency for stateful workloads - -## Best Practices - -### Configuration -- Set `min_replicas` based on baseline load requirements -- Configure `max_replicas` based on resource constraints -- Tune autoscaling thresholds based on workload characteristics -- Use appropriate cooldown periods to prevent scaling oscillation - -### Session Management -- Use sessions for stateful workloads requiring consistency -- Prefer stateless calls for better load distribution -- Implement custom routing for specialized workload requirements - -### Monitoring -- Monitor key metrics: request rate, queue depth, capacity utilization -- Set up alerts for unhealthy replicas and scaling events -- Track session distribution for load balancing effectiveness - -### Error Handling -- Implement proper error handling in actor endpoints -- Use try-catch blocks around service calls -- Monitor failed request rates for service health - -## Dependencies - -- `monarch.actor`: Actor framework for distributed computing -- `recoverable_mesh`: Fault-tolerant process mesh management -- `asyncio`: Asynchronous I/O support -- `contextvars`: Context variable support for session management - -## Thread Safety - -The Service class is designed for use in asyncio environments and is not thread-safe. All operations should be performed within the same event loop.