Configurable Batch Channel Capacity for Pipeline Parallelism

### Feature request

Currently, the batch processing channel in `core/src/infer.rs` has a hardcoded capacity, limiting the ability to achieve pipeline parallelism between batch formation and inference. This creates a bottleneck where:

1. **Sequential Processing**: Only one batch can be in-flight at a time, forcing the batching task to wait for inference to complete before forming the next batch
2. **Reduced Throughput**: Under high concurrency, requests queue up waiting for batches to be processed sequentially
3. **No Tuning**: Users cannot optimize the latency/throughput trade-off for their specific workload

### Motivation

With capacity=1, the batching task must wait for the backend to finish processing before it can send the next batch, preventing pipeline parallelism.

Key Metrics to Measure

- **Queue Time**: Time requests spend waiting in the batching queue
- **Batch Size**: Number of requests processed together
- **Throughput**: Requests per second
- **Latency**: End-to-end request time


### Your contribution

Will submit PR with proposed change and benchmarking numbers. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable Batch Channel Capacity for Pipeline Parallelism #798

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Configurable Batch Channel Capacity for Pipeline Parallelism #798

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions