Skip to content

Configurable Batch Channel Capacity for Pipeline Parallelism #798

@jovsa

Description

@jovsa

Feature request

Currently, the batch processing channel in core/src/infer.rs has a hardcoded capacity, limiting the ability to achieve pipeline parallelism between batch formation and inference. This creates a bottleneck where:

  1. Sequential Processing: Only one batch can be in-flight at a time, forcing the batching task to wait for inference to complete before forming the next batch
  2. Reduced Throughput: Under high concurrency, requests queue up waiting for batches to be processed sequentially
  3. No Tuning: Users cannot optimize the latency/throughput trade-off for their specific workload

Motivation

With capacity=1, the batching task must wait for the backend to finish processing before it can send the next batch, preventing pipeline parallelism.

Key Metrics to Measure

  • Queue Time: Time requests spend waiting in the batching queue
  • Batch Size: Number of requests processed together
  • Throughput: Requests per second
  • Latency: End-to-end request time

Your contribution

Will submit PR with proposed change and benchmarking numbers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions