RFC: Add decode-first routing for short prompts to bypass remote prefill


Currently, all requests go through a remote prefill pod before being routed to a decode pod—even for very short prompts (e.g., "Hello", "Summarize this: ..."). This may adds unnecessary latency and network overhead.

We can optimize this by allowing **short prompts to be handled entirely by a decode pod**
#### Benefits

- **Lower latency** for short prompts
- **Reduced load** on prefill pods




### Use Case

This aligns with strategies used in other systems like Dynamo, where decode instances handle short prefill locally and only delegate long contexts.

Would love feedback on:
- Suggested default threshold (e.g., 256 or 512?) 🤔 


### Proposed Solution

#### Proposed Change

In `pdRouter.Route()`, add a token-length check early in the routing path:

```go
tokens, err := r.tokenizer.TokenizeInputText(routingCtx.Message)
if err != nil {
    return "", err
}

if len(tokens) <= r.config.ShortPromptTokenThreshold {
    // Bypass prefill: route directly to a decode-only pod
    decodePod := r.selectDecodePodForDirectInference(routingCtx, readyPodList.All())
    if decodePod == nil {
        return "", fmt.Errorf("no suitable decode pod available for direct inference")
    }
    ctx.SetTargetPod(decodePod)
    return ctx.TargetAddress(), nil
} else {
    // Existing prefill → decode flow
    ...
}
```

#### Configuration

- New env var: `AIBRIX_SHORT_PROMPT_THRESHOLD`
- When set to `N > 0`, prompts with `≤ N` tokens skip remote prefill.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Add decode-first routing for short prompts to bypass remote prefill #1798

Benefits

Use Case

Proposed Solution

Proposed Change

Configuration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Add decode-first routing for short prompts to bypass remote prefill #1798

Description

Benefits

Use Case

Proposed Solution

Proposed Change

Configuration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions