[skyrl-train][RFC] Unified InferenceHandle Abstraction

## TL;DR

Replace SkyRL's fragmented inference abstractions (`InferenceEngineClient`, `RayWrappedInferenceEngine`, `RemoteInferenceEngine`) with a single `InferenceHandle` interface that standardizes on HTTP endpoints. This enables seamless integration with frameworks like Ray serve LLM for easy deployments on WideEP, and Prefill-Decode disaggregation.

## Motivation

### Current Problems

| Problem | Impact |
|---------|--------|
| **Dual code paths** | `run_engines_locally` creates two completely different flows (Ray actors vs HTTP), doubling maintenance burden |
| **Scattered weight sync** | Trainer manually orchestrates NCCL groups, rank offsets, IPC handles—complexity that should be server-side |
| **Backend-specific logic** | `if engine_backend == "vllm"` / `"sglang"` checks scattered across 5+ files |
| **No clear API boundary** | Generator and Trainer both reach into `InferenceEngineClient` internals |

### Why Now?

Ray Serve LLM already provides multi-replica orchestration, fault tolerance, and OpenAI-compatible endpoints. Extending it with weight sync and lifecycle APIs lets SkyRL inherit these capabilities without reimplementing them. It is also a simple path to inheriting wide-EP and PD during rollouts which are essential for high throughput trajectory generation.

## Proposal

### Single Abstraction

```python
class InferenceHandle(ABC):
    """Unified interface for inference interactions."""
    
    # Data Plane (Generator)
    async def chat_completion(request_payload: Dict) -> Dict
    async def completion(request_payload: Dict) -> Dict
    async def generate(input_batch: InferenceEngineInput) -> InferenceEngineOutput
    
    # Control Plane (Trainer)  
    async def init_weight_transfer(master_addr, master_port, rank_offset, world_size) -> Dict
    async def collective_rpc(model, method, args, kwargs) -> Dict
    async def pause() / resume() / sleep() / wakeup()
    async def server_info() -> Dict
```

### Client Implementation

All server patterns use the same client: `RemoteInferenceHandle` (HTTP). For in-process Ray Serve, `RayServeInferenceHandle` bypasses HTTP overhead (optional)

### Two Server Patterns

| Pattern | Server | Features | Best For |
|---------|--------|----------|----------|
| **Ray Serve LLM** | `SkyRLIngress` via `serve.run()` | WideEP, PD, fault tolerance, sleep/wake | Production training |
| **Lightweight Router** | N × `vllm serve` + FastAPI router | Round-robin, control plane fan-out, weight sync | simple cases, tests, etc. |

### Standard Endpoints

All servers expose:

```
Data Plane:     /v1/chat/completions, /v1/completions, /tokenize
Control Plane:  /pause, /resume, /sleep, /wakeup, /reset_prefix_cache
Weight Sync:    /init_weight_update_communicator, /collective_rpc, /server_info
```

## Integration

### Driver Hook

```python
class BasePPOExp:
    def setup_inference_handle(self) -> InferenceHandle:
        # Default: connect to external server (works with any of the 3 patterns)
        return RemoteInferenceHandle(self.cfg.generator.inference_endpoint_url)
```

## Migration Path

| Phase | Tasks | Invasiveness |
|-------|-------|--------------|
| **1. Foundation** | Define `InferenceHandle` ABC, implement `RemoteInferenceHandle`, create `SkyRLIngress`, create Lightweight Router | Low (additive) |
| **2. Integration** | Add `setup_inference_handle()` hook, update Generator/Trainer | Medium |
| **3. Validation** | E2E tests, deprecate old layers | Low |

**Feature flag**: `_use_inference_handle = True` in driver class enables new path. Old code remains functional until deprecated.

## Benefits

- **Unified**: One interface for Generator and Trainer
- **HTTP-first**: Consistent API regardless of deployment
- **Server-managed**: Weight sync, lifecycle, caching complexity moves server-side
- **Backend-agnostic**: Same code works with vLLM, SGLang, Ray Serve LLM. If sglang is needed we can add them behind Ray serve LLM.
- **Future-proof**: Ready for WideEP, PD, and other advanced patterns

## Non-Goals

- Changes to Ray Serve LLM core to enable colocation (separate [RFC](https://github.com/ray-project/ray/issues/59857))
- New weight sync mechanisms (uses existing `collective_rpc`)
- Breaking changes to existing configs (feature-flagged)





Problem	Impact
Dual code paths	`run_engines_locally` creates two completely different flows (Ray actors vs HTTP), doubling maintenance burden
Scattered weight sync	Trainer manually orchestrates NCCL groups, rank offsets, IPC handles—complexity that should be server-side
Backend-specific logic	`if engine_backend == "vllm"` / `"sglang"` checks scattered across 5+ files
No clear API boundary	Generator and Trainer both reach into `InferenceEngineClient` internals

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[skyrl-train][RFC] Unified InferenceHandle Abstraction #845

TL;DR

Motivation

Current Problems

Why Now?

Proposal

Single Abstraction

Client Implementation

Two Server Patterns

Standard Endpoints

Integration

Driver Hook

Migration Path

Benefits

Non-Goals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pattern	Server	Features	Best For
Ray Serve LLM	`SkyRLIngress` via `serve.run()`	WideEP, PD, fault tolerance, sleep/wake	Production training
Lightweight Router	N × `vllm serve` + FastAPI router	Round-robin, control plane fan-out, weight sync	simple cases, tests, etc.

Phase	Tasks	Invasiveness
1. Foundation	Define `InferenceHandle` ABC, implement `RemoteInferenceHandle`, create `SkyRLIngress`, create Lightweight Router	Low (additive)
2. Integration	Add `setup_inference_handle()` hook, update Generator/Trainer	Medium
3. Validation	E2E tests, deprecate old layers	Low

[skyrl-train][RFC] Unified InferenceHandle Abstraction #845

Description

TL;DR

Motivation

Current Problems

Why Now?

Proposal

Single Abstraction

Client Implementation

Two Server Patterns

Standard Endpoints

Integration

Driver Hook

Migration Path

Benefits

Non-Goals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions