Inference Optimization

Inference optimization is a complex subject and will depend on your model and use case. This page provides various pieces of advice.

Custom Python Code Optimization

Using the Seldon python wrapper there are various optimization areas one needs to look at.

Seldon Protocol Payload Types with REST and gRPC

Note: Seldon has adopted the industry-standard Open Inference Protocol (OIP) and is no longer maintaining the Seldon and TensorFlow protocols. This transition allows for greater interoperability among various model serving runtimes, such as MLServer. To learn more about implementing OIP for model serving in Seldon Core 1, see MLServer.

We strongly encourage you to adopt the OIP, which provides seamless integration across diverse model serving runtimes, supports the development of versatile client and benchmarking tools, and ensures a high-performance, consistent, and unified inference experience.

Depending on whether you want to use REST or gRPC and want to send tensor data the format of the request will have a deserialization/serialization cost in the python wrapper. This is investigated in a python serialization notebook.

The conclusions are:

gRPC is faster than REST
tftensor is best for large batch size
ndarray with gRPC is bad for large batch size
simpler tensor/ndarray is better for small batch size

KMP_AFFINITY

If you are running inference on Intel CPUs with compatible libraries then correct usage of environment variables for KMP and OMP can be useful. Most of the advice on these subjects usually discusses a singel inference request and how to optimize for low latency. One must be careful when using KMP_AFFINITY when you expect to handle parallel inference requests as they may block in unexpected ways if CPU Affinity is being used. We provide an example benchmarking notebook.

There are many resources to loop deeper for your model case. Some we have found are:

gRPC multi-processing

From 1.10.0 release of Seldon Core the python wrapper gRPC server will also respect GUNICORN_NUM_WORKERS and be able to handle parallel gRPC requests.

Benchmarks

We provide links to various benchmarking notebooks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference Optimization

Custom Python Code Optimization

Seldon Protocol Payload Types with REST and gRPC

KMP_AFFINITY

gRPC multi-processing

Benchmarks

FilesExpand file tree

optimization.md

Latest commit

History

optimization.md

File metadata and controls

Inference Optimization

Custom Python Code Optimization

Seldon Protocol Payload Types with REST and gRPC

KMP_AFFINITY

gRPC multi-processing

Benchmarks