feat: Support Variable Batch Sizes Across Pipeline Stages for Better Throughput #652

franklucky001 · 2025-04-17T03:06:39Z

franklucky001
Apr 17, 2025

Describe the feature

Currently, Mosec requires consistent batch sizes across pipeline stages, which can be limiting for certain NLP tasks like reranking where input sizes vary significantly between requests.

For example, in a reranking service:

One request might have 10 texts to rerank

Another might have 100 texts

The optimal batch size differs per stage:

Preprocessing (flattening, tokenization) can handle large batches

Model inference needs smaller batches

Postprocessing needs to regroup by original request

Why do you need this feature?

1. Real-World NLP Tasks Often Have Variable-Length Inputs

In tasks like reranking, retrieval-augmented generation (RAG), or batch inference:

Each query may match a different number of candidate texts (e.g., 10 vs. 100).
Forcing fixed batch sizes either:
- Wastes compute (padding small batches → inefficient GPU use)
- Slows down latency (processing tiny batches sequentially → underutilization)

2. Different Pipeline Stages Have Different Optimal Batch Sizes

Tokenization (CPU-bound) → Benefits from large batches (e.g., 128+ items)
Model Inference (GPU-bound) → Needs smaller batches (e.g., 8–32) to avoid OOM
Postprocessing → Needs to regroup by original request

Additional context

Proposed Solution:

Enable each pipeline stage to:

Process different batch sizes independently

Maintain request context (e.g., task IDs) to properly regroup results

Allow stages to dynamically split/merge batches based on their optimal sizes

Request A (n=10 texts) ──┐
Request B (n=100 texts)──┼──> Flatten Stage (batch=128) ──> Tokenize (batch=128) ──┐
│ │
└─────────────────── Task ID Tracking ────────────────────┘

kemingy · 2025-04-17T16:03:28Z

kemingy
Apr 17, 2025
Maintainer

So far, different stages can have different max_batch_size:

https://mosecorg.github.io/mosec/reference/interface.html#mosec.server.Server.append_worker

For the variable-length input, I guess you mean something like #383?

0 replies

franklucky001 · 2025-04-18T07:56:16Z

franklucky001
Apr 18, 2025
Author

@kemingy

Core Requirement Clarification， per-stage Input/Output Batch Size Flexibility

While supporting varying batch sizes across stages is valuable, we need ‌dynamic batch size transformation within a single stage‌ to enable output flattening. Specifically:

‌Input-Output Batch Mismatch‌
A single worker should process an input batch of size N and output a batch of size M ≠ N (e.g., flattening N requests each containing K texts into N*K items for parallel inference)‌:ml-citation{ref="5,6" data="citationList"}.
‌Context Preservation‌
Batch transformations must retain request context for post-processing reassembly (e.g., tracking original request IDs when splitting/merging batches)‌:ml-citation{ref="2,8" data="citationList"}.

Use Case Constraints in Current Design

The reranking workflow exemplifies current limitations:

Input batch (N=2 requests):
[
  {"query": "Q1", "texts": ["A", "B", "C"]},  # K=3
  {"query": "Q2", "texts": ["D", "E"]}        # K=2
]

Required flattened output batch: 
["Q1:A", "Q1:B", "Q1:C", "Q2:D", "Q2:E"]  # M=5 (N*K=2*2.5 avg)

0 replies

kemingy · 2025-04-19T08:04:32Z

kemingy
Apr 19, 2025
Maintainer

Got it.

This looks like a DAG scheduler that supports "flatten" and "gather by req ID" operations.

For your rerank use case, I assume you mean the cross-encoder models. So actually, you could send query-doc pairs instead of query-docs, which means the "flatten" and "gather" is done on the user side.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support Variable Batch Sizes Across Pipeline Stages for Better Throughput #652

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

feat: Support Variable Batch Sizes Across Pipeline Stages for Better Throughput #652

Uh oh!

Uh oh!

franklucky001 Apr 17, 2025

Describe the feature

Why do you need this feature?

1. Real-World NLP Tasks Often Have Variable-Length Inputs

2. Different Pipeline Stages Have Different Optimal Batch Sizes

Additional context

Proposed Solution:

Replies: 3 comments

Uh oh!

kemingy Apr 17, 2025 Maintainer

Uh oh!

franklucky001 Apr 18, 2025 Author

Core Requirement Clarification， per-stage Input/Output Batch Size Flexibility

Use Case Constraints in Current Design

Uh oh!

kemingy Apr 19, 2025 Maintainer

franklucky001
Apr 17, 2025

kemingy
Apr 17, 2025
Maintainer

franklucky001
Apr 18, 2025
Author

kemingy
Apr 19, 2025
Maintainer