Replies: 3 comments
-
|
Batch request optimization is crucial for ML serving throughput. At RevolutionAI (https://revolutionai.io), we have tuned BentoML batching for various models. Key optimizations:
@bentoml.service(
traffic={"timeout": 60},
batching={
"enabled": True,
"max_batch_size": 32,
"max_latency_ms": 100
}
)
@bentoml.api(batchable=True)
def predict(self, inputs: list[np.ndarray]) -> list[np.ndarray]:
# Process as batch
return self.model.predict_batch(np.stack(inputs))
What model type are you optimizing? Batch characteristics vary by architecture. |
Beta Was this translation helpful? Give feedback.
-
|
These are adaptive batching heuristics. Quick breakdown: Sample collection (
Refresh interval (
Initial parameters ( o_a = min(2, max_latency * 2.0 / 30) # Overhead per request
o_b = min(1, max_latency * 1.0 / 30) # Overhead per batch
TokenBucket (
TL;DR: These are empirical defaults that work well across common ML inference patterns. The system self-tunes from there. Tuning tips:
We tune batch inference at Revolution AI — defaults work for 90% of cases. |
Beta Was this translation helpful? Give feedback.
-
|
Batch optimization is key for throughput! At RevolutionAI (https://revolutionai.io) we tuned this extensively. Config that works: @bentoml.service(
traffic={"timeout": 60},
workers=4
)
class BatchService:
@bentoml.api(
batchable=True,
batch_dim=0,
max_batch_size=32,
max_latency_ms=100
)
def predict(self, inputs: np.ndarray) -> np.ndarray:
return model(inputs)Tuning tips:
3-5x throughput gains typical! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm currently working with BentoML's batchable API and have some questions regarding the optimization parameters used during batch request processing.
I'm using the following decorator for batchable API:
@bentoml.api(batchable=True, max_batch_size=32, max_latency_ms=1000)I noticed the following parameters involved in the implementation:
Additionally, in the linear regression (least squares) optimization process, I saw that initial values for parameters are set as follows:
Lastly, the TokenBucket algorithm is used to control refresh intervals:
I would like to understand how the values for these parameters (such as
50,2,5,2,2.0,30, etc.) are derived and how they relate to the batch processing mechanism. What is their role in optimizing performance, and why are these specific values chosen?Any insights would be greatly appreciated. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions