Question about batch request optimization in BentoML #5098

oogou11 · 2024-11-22T06:16:08Z

oogou11
Nov 22, 2024

I'm currently working with BentoML's batchable API and have some questions regarding the optimization parameters used during batch request processing.

I'm using the following decorator for batchable API:

@bentoml.api(batchable=True, max_batch_size=32, max_latency_ms=1000)

I noticed the following parameters involved in the implementation:

N_KEPT_SAMPLE = 50
N_SKIPPED_SAMPLE = 2
INTERVAL_REFRESH_PARAMS = 5

Additionally, in the linear regression (least squares) optimization process, I saw that initial values for parameters are set as follows:

self.o_a = min(2, max_latency * 2.0 / 30)
self.o_b = min(1, max_latency * 1.0 / 30)

Lastly, the TokenBucket algorithm is used to control refresh intervals:

self._refresh_tb = TokenBucket(2)

I would like to understand how the values for these parameters (such as 50, 2, 5, 2, 2.0, 30, etc.) are derived and how they relate to the batch processing mechanism. What is their role in optimizing performance, and why are these specific values chosen?

Any insights would be greatly appreciated. Thanks!

xXMrNidaXx · 2026-02-23T14:05:53Z

xXMrNidaXx
Feb 23, 2026

Batch request optimization is crucial for ML serving throughput. At RevolutionAI (https://revolutionai.io), we have tuned BentoML batching for various models.

Key optimizations:

Tune batch parameters:

@bentoml.service(
    traffic={"timeout": 60},
    batching={
        "enabled": True,
        "max_batch_size": 32,
        "max_latency_ms": 100
    }
)

Batch-aware model inference:

@bentoml.api(batchable=True)
def predict(self, inputs: list[np.ndarray]) -> list[np.ndarray]:
    # Process as batch
    return self.model.predict_batch(np.stack(inputs))

Dynamic batching based on load:

Start with smaller batches for low latency
Increase batch size under load for throughput

Monitor batch efficiency:

Track average batch fill rate
Tune max_latency_ms to balance latency vs throughput

What model type are you optimizing? Batch characteristics vary by architecture.

0 replies

xXMrNidaXx · 2026-02-23T14:18:16Z

xXMrNidaXx
Feb 23, 2026

These are adaptive batching heuristics. Quick breakdown:

Sample collection (N_KEPT_SAMPLE=50, N_SKIPPED_SAMPLE=2):

Keeps last 50 latency samples for regression
Skips first 2 (warmup noise)
Why 50? Enough for stable regression, not too much memory

Refresh interval (INTERVAL_REFRESH_PARAMS=5):

Recalculates optimal params every 5 batches
Tradeoff: too frequent = noisy, too rare = slow adaptation

Initial parameters (o_a, o_b):

o_a = min(2, max_latency * 2.0 / 30)  # Overhead per request
o_b = min(1, max_latency * 1.0 / 30)  # Overhead per batch

Assumes ~30ms baseline latency
Caps at 2/1 to prevent wild initial guesses
Regression refines these from actual measurements

TokenBucket (capacity=2):

Limits param refresh rate
Prevents thrashing under load spikes

TL;DR: These are empirical defaults that work well across common ML inference patterns. The system self-tunes from there.

Tuning tips:

High variance latency? Increase N_KEPT_SAMPLE
Fast model? Lower max_latency_ms for tighter batching

We tune batch inference at Revolution AI — defaults work for 90% of cases.

0 replies

xXMrNidaXx · 2026-02-23T14:18:28Z

xXMrNidaXx
Feb 23, 2026

Batch optimization is key for throughput! At RevolutionAI (https://revolutionai.io) we tuned this extensively.

Config that works:

@bentoml.service(
    traffic={"timeout": 60},
    workers=4
)
class BatchService:
    @bentoml.api(
        batchable=True,
        batch_dim=0,
        max_batch_size=32,
        max_latency_ms=100
    )
    def predict(self, inputs: np.ndarray) -> np.ndarray:
        return model(inputs)

Tuning tips:

max_batch_size: GPU memory / per-sample memory
max_latency_ms: Tradeoff vs throughput
Monitor actual batch sizes in production

3-5x throughput gains typical!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BentoML

Question about batch request optimization in BentoML #5098

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

BentoML

Question about batch request optimization in BentoML #5098

Uh oh!

oogou11 Nov 22, 2024

Replies: 3 comments

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

oogou11
Nov 22, 2024

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026