-
Notifications
You must be signed in to change notification settings - Fork 72
feat: Support Variable Batch Sizes Across Pipeline Stages for Better Throughput #644
Description
Describe the feature
Currently, Mosec requires consistent batch sizes across pipeline stages, which can be limiting for certain NLP tasks like reranking where input sizes vary significantly between requests.
For example, in a reranking service:
One request might have 10 texts to rerank
Another might have 100 texts
The optimal batch size differs per stage:
Preprocessing (flattening, tokenization) can handle large batches
Model inference needs smaller batches
Postprocessing needs to regroup by original request
Why do you need this feature?
1. Real-World NLP Tasks Often Have Variable-Length Inputs
In tasks like reranking, retrieval-augmented generation (RAG), or batch inference:
-
Each query may match a different number of candidate texts (e.g., 10 vs. 100).
-
Forcing fixed batch sizes either:
- Wastes compute (padding small batches → inefficient GPU use)
- Slows down latency (processing tiny batches sequentially → underutilization)
2. Different Pipeline Stages Have Different Optimal Batch Sizes
-
Tokenization (CPU-bound) → Benefits from large batches (e.g., 128+ items)
-
Model Inference (GPU-bound) → Needs smaller batches (e.g., 8–32) to avoid OOM
-
Postprocessing → Needs to regroup by original request
Additional context
Proposed Solution:
Enable each pipeline stage to:
Process different batch sizes independently
Maintain request context (e.g., task IDs) to properly regroup results
Allow stages to dynamically split/merge batches based on their optimal sizes
Request A (n=10 texts) ──┐
Request B (n=100 texts)──┼──> Flatten Stage (batch=128) ──> Tokenize (batch=128) ──┐
│ │
└─────────────────── Task ID Tracking ────────────────────┘