-
Notifications
You must be signed in to change notification settings - Fork 500
Open
Description
Currently, all requests go through a remote prefill pod before being routed to a decode pod—even for very short prompts (e.g., "Hello", "Summarize this: ..."). This may adds unnecessary latency and network overhead.
We can optimize this by allowing short prompts to be handled entirely by a decode pod
Benefits
- Lower latency for short prompts
- Reduced load on prefill pods
Use Case
This aligns with strategies used in other systems like Dynamo, where decode instances handle short prefill locally and only delegate long contexts.
Would love feedback on:
- Suggested default threshold (e.g., 256 or 512?) 🤔
Proposed Solution
Proposed Change
In pdRouter.Route(), add a token-length check early in the routing path:
tokens, err := r.tokenizer.TokenizeInputText(routingCtx.Message)
if err != nil {
return "", err
}
if len(tokens) <= r.config.ShortPromptTokenThreshold {
// Bypass prefill: route directly to a decode-only pod
decodePod := r.selectDecodePodForDirectInference(routingCtx, readyPodList.All())
if decodePod == nil {
return "", fmt.Errorf("no suitable decode pod available for direct inference")
}
ctx.SetTargetPod(decodePod)
return ctx.TargetAddress(), nil
} else {
// Existing prefill → decode flow
...
}Configuration
- New env var:
AIBRIX_SHORT_PROMPT_THRESHOLD - When set to
N > 0, prompts with≤ Ntokens skip remote prefill.
nurali-techie and DengHom
Metadata
Metadata
Assignees
Labels
No labels