Skip to content

Commit 0156929

Browse files
committed
Redirecting Pytorch Mobile Performance Recipes to ExecuTorch.
1 parent 1f4dae1 commit 0156929

File tree

1 file changed

+4
-353
lines changed

1 file changed

+4
-353
lines changed

recipes_source/mobile_perf.rst

Lines changed: 4 additions & 353 deletions
Original file line numberDiff line numberDiff line change
@@ -1,359 +1,10 @@
11
Pytorch Mobile Performance Recipes
22
==================================
33

4-
.. warning::
5-
PyTorch Mobile is no longer actively supported. Please check out `ExecuTorch <https://pytorch.org/executorch-overview>`_, PyTorch’s all-new on-device inference library. You can also learn more about `quantization <https://pytorch.org/executorch/stable/quantization-overview.html>`_, `Hardware acceleration (op fusion using hw) <https://pytorch.org/executorch/stable/examples-end-to-end-to-lower-model-to-delegate.html>`_, and `benchmarking <https://pytorch.org/executorch/stable/sdk-profiling.html>`_ on ExecuTorch’s documentation pages.
4+
PyTorch Mobile is no longer actively supported. Please check out ExecuTorch.
65

7-
Introduction
8-
----------------
9-
Performance (aka latency) is crucial to most, if not all,
10-
applications and use-cases of ML model inference on mobile devices.
6+
Redirecting in 3 seconds...
117

12-
Today, PyTorch executes the models on the CPU backend pending availability
13-
of other hardware backends such as GPU, DSP, and NPU.
8+
.. raw:: html
149

15-
In this recipe, you will learn:
16-
17-
- How to optimize your model to help decrease execution time (higher performance, lower latency) on the mobile device.
18-
- How to benchmark (to check if optimizations helped your use case).
19-
20-
21-
Model preparation
22-
-----------------
23-
24-
We will start with preparing to optimize your model to help decrease execution time
25-
(higher performance, lower latency) on the mobile device.
26-
27-
28-
Setup
29-
^^^^^^^
30-
31-
First we need to installed pytorch using conda or pip with version at least 1.5.0.
32-
33-
::
34-
35-
conda install pytorch torchvision -c pytorch
36-
37-
or
38-
39-
::
40-
41-
pip install torch torchvision
42-
43-
Code your model:
44-
45-
::
46-
47-
import torch
48-
from torch.utils.mobile_optimizer import optimize_for_mobile
49-
50-
class AnnotatedConvBnReLUModel(torch.nn.Module):
51-
def __init__(self):
52-
super(AnnotatedConvBnReLUModel, self).__init__()
53-
self.conv = torch.nn.Conv2d(3, 5, 3, bias=False).to(dtype=torch.float)
54-
self.bn = torch.nn.BatchNorm2d(5).to(dtype=torch.float)
55-
self.relu = torch.nn.ReLU(inplace=True)
56-
self.quant = torch.quantization.QuantStub()
57-
self.dequant = torch.quantization.DeQuantStub()
58-
59-
def forward(self, x):
60-
x = x.contiguous(memory_format=torch.channels_last)
61-
x = self.quant(x)
62-
x = self.conv(x)
63-
x = self.bn(x)
64-
x = self.relu(x)
65-
x = self.dequant(x)
66-
return x
67-
68-
model = AnnotatedConvBnReLUModel()
69-
70-
71-
``torch.quantization.QuantStub`` and ``torch.quantization.DeQuantStub()`` are no-op stubs, which will be used for quantization step.
72-
73-
74-
1. Fuse operators using ``torch.quantization.fuse_modules``
75-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
76-
77-
Do not be confused that fuse_modules is in the quantization package.
78-
It works for all ``torch.nn.Module``.
79-
80-
``torch.quantization.fuse_modules`` fuses a list of modules into a single module.
81-
It fuses only the following sequence of modules:
82-
83-
- Convolution, Batch normalization
84-
- Convolution, Batch normalization, Relu
85-
- Convolution, Relu
86-
- Linear, Relu
87-
88-
This script will fuse Convolution, Batch Normalization and Relu in previously declared model.
89-
90-
::
91-
92-
torch.quantization.fuse_modules(model, [['conv', 'bn', 'relu']], inplace=True)
93-
94-
95-
2. Quantize your model
96-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
97-
98-
You can find more about PyTorch quantization in
99-
`the dedicated tutorial <https://pytorch.org/blog/introduction-to-quantization-on-pytorch/>`_.
100-
101-
Quantization of the model not only moves computation to int8,
102-
but also reduces the size of your model on a disk.
103-
That size reduction helps to reduce disk read operations during the first load of the model and decreases the amount of RAM.
104-
Both of those resources can be crucial for the performance of mobile applications.
105-
This code does quantization, using stub for model calibration function, you can find more about it `here <https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html#post-training-static-quantization>`__.
106-
107-
::
108-
109-
model.qconfig = torch.quantization.get_default_qconfig('qnnpack')
110-
torch.quantization.prepare(model, inplace=True)
111-
# Calibrate your model
112-
def calibrate(model, calibration_data):
113-
# Your calibration code here
114-
return
115-
calibrate(model, [])
116-
torch.quantization.convert(model, inplace=True)
117-
118-
119-
120-
3. Use torch.utils.mobile_optimizer
121-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
122-
123-
Torch mobile_optimizer package does several optimizations with the scripted model,
124-
which will help to conv2d and linear operations.
125-
It pre-packs model weights in an optimized format and fuses ops above with relu
126-
if it is the next operation.
127-
128-
First we script the result model from previous step:
129-
130-
::
131-
132-
torchscript_model = torch.jit.script(model)
133-
134-
Next we call ``optimize_for_mobile`` and save model on the disk.
135-
136-
::
137-
138-
torchscript_model_optimized = optimize_for_mobile(torchscript_model)
139-
torch.jit.save(torchscript_model_optimized, "model.pt")
140-
141-
4. Prefer Using Channels Last Tensor memory format
142-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
143-
144-
Channels Last(NHWC) memory format was introduced in PyTorch 1.4.0. It is supported only for four-dimensional tensors. This memory format gives a better memory locality for most operators, especially convolution. Our measurements showed a 3x speedup of MobileNetV2 model compared with the default Channels First(NCHW) format.
145-
146-
At the moment of writing this recipe, PyTorch Android java API does not support using inputs in Channels Last memory format. But it can be used on the TorchScript model level, by adding the conversion to it for model inputs.
147-
148-
.. code-block:: python
149-
150-
def forward(self, x):
151-
x = x.contiguous(memory_format=torch.channels_last)
152-
...
153-
154-
155-
This conversion is zero cost if your input is already in Channels Last memory format. After it, all operators will work preserving ChannelsLast memory format.
156-
157-
5. Android - Reusing tensors for forward
158-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
159-
160-
This part of the recipe is Android only.
161-
162-
Memory is a critical resource for android performance, especially on old devices.
163-
Tensors can need a significant amount of memory.
164-
For example, standard computer vision tensor contains 1*3*224*224 elements,
165-
assuming that data type is float and will need 588Kb of memory.
166-
167-
::
168-
169-
FloatBuffer buffer = Tensor.allocateFloatBuffer(1*3*224*224);
170-
Tensor tensor = Tensor.fromBlob(buffer, new long[]{1, 3, 224, 224});
171-
172-
173-
Here we allocate native memory as ``java.nio.FloatBuffer`` and creating ``org.pytorch.Tensor`` which storage will be pointing to the memory of the allocated buffer.
174-
175-
For most of the use cases, we do not do model forward only once, repeating it with some frequency or as fast as possible.
176-
177-
If we are doing new memory allocation for every module forward - that will be suboptimal.
178-
Instead of this, we can reuse the same memory that we allocated on the previous step, fill it with new data, and run module forward again on the same tensor object.
179-
180-
You can check how it looks in code in `pytorch android application example <https://github.com/pytorch/android-demo-app/blob/master/PyTorchDemoApp/app/src/main/java/org/pytorch/demo/vision/ImageClassificationActivity.java#L174>`_.
181-
182-
::
183-
184-
protected AnalysisResult analyzeImage(ImageProxy image, int rotationDegrees) {
185-
if (mModule == null) {
186-
mModule = Module.load(moduleFileAbsoluteFilePath);
187-
mInputTensorBuffer =
188-
Tensor.allocateFloatBuffer(3 * 224 * 224);
189-
mInputTensor = Tensor.fromBlob(mInputTensorBuffer, new long[]{1, 3, 224, 224});
190-
}
191-
192-
TensorImageUtils.imageYUV420CenterCropToFloatBuffer(
193-
image.getImage(), rotationDegrees,
194-
224, 224,
195-
TensorImageUtils.TORCHVISION_NORM_MEAN_RGB,
196-
TensorImageUtils.TORCHVISION_NORM_STD_RGB,
197-
mInputTensorBuffer, 0);
198-
199-
Tensor outputTensor = mModule.forward(IValue.from(mInputTensor)).toTensor();
200-
}
201-
202-
Member fields ``mModule``, ``mInputTensorBuffer`` and ``mInputTensor`` are initialized only once
203-
and buffer is refilled using ``org.pytorch.torchvision.TensorImageUtils.imageYUV420CenterCropToFloatBuffer``.
204-
205-
6. Load time optimization
206-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
207-
**Available since Pytorch 1.13**
208-
209-
PyTorch Mobile also supports a FlatBuffer-based file format that is faster
210-
to load. Both flatbuffer and pickle-based model file can be load with the
211-
same ``_load_for_lite_interpreter`` (Python) or ``_load_for_mobile``(C++) API.
212-
213-
To use the FlatBuffer format, instead of creating the model file with
214-
``model._save_for_lite_interpreter('path/to/file.ptl')``, you can run the following command:
215-
216-
217-
One can save using
218-
219-
::
220-
221-
model._save_for_lite_interpreter('path/to/file.ptl', _use_flatbuffer=True)
222-
223-
224-
The extra argument ``_use_flatbuffer`` makes a FlatBuffer file instead of a
225-
zip file. The created file will be faster to load.
226-
227-
For example, using ResNet-50 and running the following script:
228-
229-
::
230-
231-
import torch
232-
from torch.jit import mobile
233-
import time
234-
model = torch.hub.load('pytorch/vision:v0.10.0', 'deeplabv3_resnet50', pretrained=True)
235-
model.eval()
236-
jit_model = torch.jit.script(model)
237-
238-
jit_model._save_for_lite_interpreter('/tmp/jit_model.ptl')
239-
jit_model._save_for_lite_interpreter('/tmp/jit_model.ff', _use_flatbuffer=True)
240-
241-
import timeit
242-
print('Load ptl file:')
243-
print(timeit.timeit('from torch.jit import mobile; mobile._load_for_lite_interpreter("/tmp/jit_model.ptl")',
244-
number=20))
245-
print('Load flatbuffer file:')
246-
print(timeit.timeit('from torch.jit import mobile; mobile._load_for_lite_interpreter("/tmp/jit_model.ff")',
247-
number=20))
248-
249-
250-
251-
you would get the following result:
252-
253-
::
254-
255-
Load ptl file:
256-
0.5387594579999999
257-
Load flatbuffer file:
258-
0.038842832999999466
259-
260-
While speed ups on actual mobile devices will be smaller, you can still expect
261-
3x - 6x load time reductions.
262-
263-
### Reasons to avoid using a FlatBuffer-based mobile model
264-
265-
However, FlatBuffer format also has some limitations that you might want to consider:
266-
267-
* It is only available in PyTorch 1.13 or later. Therefore, client devices compiled
268-
with earlier PyTorch versions might not be able to load it.
269-
* The Flatbuffer library imposes a 4GB limit for file sizes. So it is not suitable
270-
for large models.
271-
272-
Benchmarking
273-
------------
274-
275-
The best way to benchmark (to check if optimizations helped your use case) - is to measure your particular use case that you want to optimize, as performance behavior can vary in different environments.
276-
277-
PyTorch distribution provides a way to benchmark naked binary that runs the model forward,
278-
this approach can give more stable measurements rather than testing inside the application.
279-
280-
281-
Android - Benchmarking Setup
282-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
283-
284-
This part of the recipe is Android only.
285-
286-
For this you first need to build benchmark binary:
287-
288-
::
289-
290-
<from-your-root-pytorch-dir>
291-
rm -rf build_android
292-
BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DBUILD_BINARY=ON
293-
294-
You should have arm64 binary at: ``build_android/bin/speed_benchmark_torch``.
295-
This binary takes ``--model=<path-to-model>``, ``--input_dim="1,3,224,224"`` as dimension information for the input and ``--input_type="float"`` as the type of the input as arguments.
296-
297-
Once you have your android device connected,
298-
push speedbenchark_torch binary and your model to the phone:
299-
300-
::
301-
302-
adb push <speedbenchmark-torch> /data/local/tmp
303-
adb push <path-to-scripted-model> /data/local/tmp
304-
305-
306-
Now we are ready to benchmark your model:
307-
308-
::
309-
310-
adb shell "/data/local/tmp/speed_benchmark_torch --model=/data/local/tmp/model.pt" --input_dims="1,3,224,224" --input_type="float"
311-
----- output -----
312-
Starting benchmark.
313-
Running warmup runs.
314-
Main runs.
315-
Main run finished. Microseconds per iter: 121318. Iters per second: 8.24281
316-
317-
318-
iOS - Benchmarking Setup
319-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
320-
321-
For iOS, we'll be using our `TestApp <https://github.com/pytorch/pytorch/tree/master/ios/TestApp>`_ as the benchmarking tool.
322-
323-
To begin with, let's apply the ``optimize_for_mobile`` method to our python script located at `TestApp/benchmark/trace_model.py <https://github.com/pytorch/pytorch/blob/master/ios/TestApp/benchmark/trace_model.py>`_. Simply modify the code as below.
324-
325-
::
326-
327-
import torch
328-
import torchvision
329-
from torch.utils.mobile_optimizer import optimize_for_mobile
330-
331-
model = torchvision.models.mobilenet_v2(pretrained=True)
332-
model.eval()
333-
example = torch.rand(1, 3, 224, 224)
334-
traced_script_module = torch.jit.trace(model, example)
335-
torchscript_model_optimized = optimize_for_mobile(traced_script_module)
336-
torch.jit.save(torchscript_model_optimized, "model.pt")
337-
338-
Now let's run ``python trace_model.py``. If everything works well, we should be able to generate our optimized model in the benchmark directory.
339-
340-
Next, we're going to build the PyTorch libraries from source.
341-
342-
::
343-
344-
BUILD_PYTORCH_MOBILE=1 IOS_ARCH=arm64 ./scripts/build_ios.sh
345-
346-
Now that we have the optimized model and PyTorch ready, it's time to generate our XCode project and do benchmarking. To do that, we'll be using a ruby script - `setup.rb` which does the heavy lifting jobs of setting up the XCode project.
347-
348-
::
349-
350-
ruby setup.rb
351-
352-
Now open the `TestApp.xcodeproj` and plug in your iPhone, you're ready to go. Below is an example result from iPhoneX
353-
354-
::
355-
356-
TestApp[2121:722447] Main runs
357-
TestApp[2121:722447] Main run finished. Milliseconds per iter: 28.767
358-
TestApp[2121:722447] Iters per second: : 34.762
359-
TestApp[2121:722447] Done.
10+
<meta http-equiv="Refresh" content="3; url='https://pytorch.org/executorch/stable/index.html'" />

0 commit comments

Comments
 (0)