feat: Support Variable Batch Sizes Across Pipeline Stages for Better Throughput #652
Replies: 3 comments
-
|
So far, different stages can have different For the variable-length input, I guess you mean something like #383? |
Beta Was this translation helpful? Give feedback.
-
Core Requirement Clarification, per-stage Input/Output Batch Size FlexibilityWhile supporting varying batch sizes across stages is valuable, we need dynamic batch size transformation within a single stage to enable output flattening. Specifically:
Use Case Constraints in Current DesignThe reranking workflow exemplifies current limitations: Input batch (N=2 requests):
[
{"query": "Q1", "texts": ["A", "B", "C"]}, # K=3
{"query": "Q2", "texts": ["D", "E"]} # K=2
]
Required flattened output batch:
["Q1:A", "Q1:B", "Q1:C", "Q2:D", "Q2:E"] # M=5 (N*K=2*2.5 avg) |
Beta Was this translation helpful? Give feedback.
-
|
Got it. This looks like a DAG scheduler that supports "flatten" and "gather by req ID" operations. For your rerank use case, I assume you mean the cross-encoder models. So actually, you could send query-doc pairs instead of query-docs, which means the "flatten" and "gather" is done on the user side. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Describe the feature
Currently, Mosec requires consistent batch sizes across pipeline stages, which can be limiting for certain NLP tasks like reranking where input sizes vary significantly between requests.
For example, in a reranking service:
One request might have 10 texts to rerank
Another might have 100 texts
The optimal batch size differs per stage:
Preprocessing (flattening, tokenization) can handle large batches
Model inference needs smaller batches
Postprocessing needs to regroup by original request
Why do you need this feature?
1. Real-World NLP Tasks Often Have Variable-Length Inputs
In tasks like reranking, retrieval-augmented generation (RAG), or batch inference:
Each query may match a different number of candidate texts (e.g., 10 vs. 100).
Forcing fixed batch sizes either:
2. Different Pipeline Stages Have Different Optimal Batch Sizes
Tokenization (CPU-bound) → Benefits from large batches (e.g., 128+ items)
Model Inference (GPU-bound) → Needs smaller batches (e.g., 8–32) to avoid OOM
Postprocessing → Needs to regroup by original request
Additional context
Proposed Solution:
Enable each pipeline stage to:
Process different batch sizes independently
Maintain request context (e.g., task IDs) to properly regroup results
Allow stages to dynamically split/merge batches based on their optimal sizes
Request A (n=10 texts) ──┐
Request B (n=100 texts)──┼──> Flatten Stage (batch=128) ──> Tokenize (batch=128) ──┐
│ │
└─────────────────── Task ID Tracking ────────────────────┘
Beta Was this translation helpful? Give feedback.
All reactions