Skip to content

RFC: Add decode-first routing for short prompts to bypass remote prefill #1798

@googs1025

Description

@googs1025

Currently, all requests go through a remote prefill pod before being routed to a decode pod—even for very short prompts (e.g., "Hello", "Summarize this: ..."). This may adds unnecessary latency and network overhead.

We can optimize this by allowing short prompts to be handled entirely by a decode pod

Benefits

  • Lower latency for short prompts
  • Reduced load on prefill pods

Use Case

This aligns with strategies used in other systems like Dynamo, where decode instances handle short prefill locally and only delegate long contexts.

Would love feedback on:

  • Suggested default threshold (e.g., 256 or 512?) 🤔

Proposed Solution

Proposed Change

In pdRouter.Route(), add a token-length check early in the routing path:

tokens, err := r.tokenizer.TokenizeInputText(routingCtx.Message)
if err != nil {
    return "", err
}

if len(tokens) <= r.config.ShortPromptTokenThreshold {
    // Bypass prefill: route directly to a decode-only pod
    decodePod := r.selectDecodePodForDirectInference(routingCtx, readyPodList.All())
    if decodePod == nil {
        return "", fmt.Errorf("no suitable decode pod available for direct inference")
    }
    ctx.SetTargetPod(decodePod)
    return ctx.TargetAddress(), nil
} else {
    // Existing prefill → decode flow
    ...
}

Configuration

  • New env var: AIBRIX_SHORT_PROMPT_THRESHOLD
  • When set to N > 0, prompts with ≤ N tokens skip remote prefill.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions