Inference optimization is a complex subject and will depend on your model and use case. This page provides various pieces of advice.
Using the Seldon python wrapper there are various optimization areas one needs to look at.
Note: Seldon has adopted the industry-standard Open Inference Protocol (OIP) and is no longer maintaining the Seldon and TensorFlow protocols. This transition allows for greater interoperability among various model serving runtimes, such as MLServer. To learn more about implementing OIP for model serving in Seldon Core 1, see MLServer.
We strongly encourage you to adopt the OIP, which provides seamless integration across diverse model serving runtimes, supports the development of versatile client and benchmarking tools, and ensures a high-performance, consistent, and unified inference experience.
Depending on whether you want to use REST or gRPC and want to send tensor data the format of the request will have a deserialization/serialization cost in the python wrapper. This is investigated in a python serialization notebook.
The conclusions are:
- gRPC is faster than REST
- tftensor is best for large batch size
- ndarray with gRPC is bad for large batch size
- simpler tensor/ndarray is better for small batch size
If you are running inference on Intel CPUs with compatible libraries then correct usage of environment variables for KMP and OMP can be useful. Most of the advice on these subjects usually discusses a singel inference request and how to optimize for low latency. One must be careful when using KMP_AFFINITY when you expect to handle parallel inference requests as they may block in unexpected ways if CPU Affinity is being used. We provide an example benchmarking notebook.
There are many resources to loop deeper for your model case. Some we have found are:
- Maximize TensorFlow Performance on CPU: Considerations and Recommendations for Inference Workloads
- Tensorflow Issue on KMP_AFFINITY
- Best Practicesfor ScalingDeep LearningTraining and Inference with TensorFlow* OnIntel® Xeon® Processor Based HPC Infrastructures
- Optimizing BERT model for Intel CPU Cores using ONNX runtime default execution provider
- Using Intel OpenMP Thread Affinity for Pinning
- Consider adjusting OMP_NUM_THREADS environment variable for containerized deployments
- AWS Deep Learning Containers
- General Best Practices for Intel® Optimization for TensorFlow
From 1.10.0 release of Seldon Core the python wrapper gRPC server will also respect GUNICORN_NUM_WORKERS and be able to handle parallel gRPC requests.
We provide links to various benchmarking notebooks.