-
Notifications
You must be signed in to change notification settings - Fork 351
Open
Description
Feature request
Currently, the batch processing channel in core/src/infer.rs has a hardcoded capacity, limiting the ability to achieve pipeline parallelism between batch formation and inference. This creates a bottleneck where:
- Sequential Processing: Only one batch can be in-flight at a time, forcing the batching task to wait for inference to complete before forming the next batch
- Reduced Throughput: Under high concurrency, requests queue up waiting for batches to be processed sequentially
- No Tuning: Users cannot optimize the latency/throughput trade-off for their specific workload
Motivation
With capacity=1, the batching task must wait for the backend to finish processing before it can send the next batch, preventing pipeline parallelism.
Key Metrics to Measure
- Queue Time: Time requests spend waiting in the batching queue
- Batch Size: Number of requests processed together
- Throughput: Requests per second
- Latency: End-to-end request time
Your contribution
Will submit PR with proposed change and benchmarking numbers.
Metadata
Metadata
Assignees
Labels
No labels