Skip to content
This repository was archived by the owner on Apr 28, 2023. It is now read-only.

Commit 253f7d9

Browse files
authored
Merge pull request #80 from facebookresearch/tutorial
Adding a tutorial using TensorDot operation
2 parents a5d9493 + 19b6be8 commit 253f7d9

File tree

5 files changed

+165
-0
lines changed

5 files changed

+165
-0
lines changed
265 KB
Loading

docs/source/framework/pytorch_integration/autotuning_layers.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,8 @@ my_layer.autotune
4545
.. autoclass:: TcUnit
4646
:members: autotune
4747

48+
.. _autotune_parameters:
49+
4850
Autotuning Parameters
4951
---------------------
5052

docs/source/index.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,3 +72,9 @@ Machine Learning.
7272
:caption: Support
7373

7474
contacts
75+
76+
.. toctree::
77+
:maxdepth: 1
78+
:caption: Tutorials Reference
79+
80+
tutorials/index

docs/source/tutorials/index.rst

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
Tensor Comprehensions Tutorials
2+
===============================
3+
4+
**Author**: `Priya Goyal <https://github.com/prigoyal>`_
5+
6+
Tensor Comprehensions (TC) is a framework agnostic library to **automatically**
7+
synthesize high-performance Machine Learning kernels. TC relies on
8+
`Halide <https://github.com/halide/Halide>`_ IR to express computation and analysis
9+
tools to reason about it. TC uses :code:`polyhedral` compilation techniques to
10+
(semi-)automatically decide how to perform this computation efficiently and produce
11+
fast code. We also provide TC integration with PyTorch and Caffe2.
12+
13+
To read more about Tensor Comprehensions, see our documentation available
14+
at https://facebookresearch.github.io/TensorComprehensions/ and C++ API documentation is
15+
available at https://facebookresearch.github.io/TensorComprehensions/api.
16+
17+
We provide many **python examples** for expressing and running various different ML layers
18+
with TC. The examples can be found `here <https://github.com/facebookresearch/TensorComprehensions/tree/master/test_python/layers>`_.
19+
20+
To read more about Framework integrations, checkout our documentation on `PyTorch <https://facebookresearch.github.io/TensorComprehensions/framework/pytorch_integration/getting_started.html>`_ integration
21+
and `Caffe2 <https://facebookresearch.github.io/TensorComprehensions/framework/caffe2_integration/integration_with_example.html>`_
22+
integration.
23+
24+
If you want to **integrate your framework** with TC, it's easy and the instructions are
25+
available at https://facebookresearch.github.io/TensorComprehensions/integrating_any_ml_framework.html
26+
27+
28+
.. toctree::
29+
:maxdepth: 1
30+
:caption: Tutorials
31+
32+
tutorial_tensordot_with_tc
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
Using TC to get fast CUDA code for Tensor Contraction
2+
=====================================================
3+
4+
In this tutorial, we will see how we can start from a random math operation,
5+
express it in TC language and easily get the fast CUDA code for it. We will also
6+
see how to tune the CUDA code to a better performance. All of this is possible with
7+
only 3-4 lines of code. Let's get started.
8+
9+
For this tutorial, you will need to install Tensor Comprehensions binary. You can
10+
get binary builds of Tensor Comprehensions with ``conda install -y -c pytorch -c prigoyal tensor_comprehensions``.
11+
12+
About TensorDot operation
13+
-------------------------
14+
15+
First, we find an operation that we want to generate fast CUDA code for. A lot of
16+
operations like convolution, pooling are standard and have CUDA code easily available, so
17+
rather we are going to pick a new and different operation. How do we find a new operation?
18+
19+
**Sources**: Maybe there is a research paper idea you have like KRU or there is a
20+
numpy operation that is interesting to you and is needed in Machine Learning model.
21+
As per Numpy docs on linear algebra, tensordot seems like an interesting operation
22+
`TensorDot <https://docs.scipy.org/doc/numpy/reference/generated/numpy.tensordot.html#numpy.tensordot>`_.
23+
24+
**The TensorDot operation**
25+
26+
Assume that we have two tensors, one with dimension :code:`(N, C1, C2, H, W)` and one with dimension
27+
:code:`(N, C2, C3, H, W)`, and we want to do a gemm-type computation on the :code:`C`
28+
dimensions to get an output of shape :code:`(N, C1, C3, H, W)`. Basically, for each
29+
:code:`(N, H, W)` combination, we want to do a reduction from :code:`(C1, C2) * (C2, C3) = (C1, C3)`.
30+
31+
So basically, this operation can be represented as `N x H x W` independent gemms and one could try to
32+
write batched gemm kernel for it. But does that guarantee good performance? What if the
33+
tensor sizes are like this: :code:`N=32, C1=512, C2=8, C3=2, H=28, W=28` i.e.
34+
the value of :code:`C1` is pretty large compared to :code:`C2` / :code:`C3`.
35+
36+
Let's see how we can get the CUDA kernel for such operation and then tune the kernel.
37+
38+
Step 1: Write TC for TensorDot Operation
39+
----------------------------------------
40+
41+
First step is to express the Tensordot operation in TC language. For more information on how to do
42+
so, you can refer to our `Documentation <https://facebookresearch.github.io/TensorComprehensions/index.html>`_
43+
and also find various TC examples `here <https://facebookresearch.github.io/TensorComprehensions/framework/pytorch_integration/layers_database.html>`_.
44+
45+
.. code-block:: python
46+
47+
# import tc and torch both
48+
import tensor_comprehensions as tc
49+
import torch
50+
# define the operation as TC language
51+
lang = """
52+
def tensordot(float(N, C1, C2, H, W) I0, float(N, C2, C3, H, W) I1) -> (O) {
53+
O(n, c1, c3, h, w) +=! I0(n, c1, c2, h, w) * I1(n, c2, c3, h, w)
54+
}
55+
"""
56+
57+
Step 2: Register operation with TC
58+
----------------------------------
59+
60+
Now, we will use the TC string and register it with the TC backend by calling :code:`tc.define`.
61+
62+
.. code-block:: python
63+
64+
# register the lang with TC backend
65+
tensordot = tc.define(lang, name="tensordot")
66+
67+
.. note::
68+
69+
The :code:`name` variable should match the name of the def in the :code:`lang`.
70+
71+
Step 3: Create input tensors and run TC
72+
---------------------------------------
73+
74+
Now that TC is registered, we will create the input tensors and run it.
75+
76+
.. code-block:: python
77+
78+
# create input cuda tensors
79+
N, C1, C2, C3, H, W = 32, 512, 8, 2, 28, 28
80+
I0, I1 = torch.randn(N, C1, C2, H, W).cuda(), torch.randn(N, C2, C3, H, W).cuda()
81+
# choose the options that resemble the operation and run
82+
out = tensordot(I0, I1, options=tc.Options("conv"))
83+
84+
.. note::
85+
86+
The :code:`options` can be obtained by autotuning the kernel using Autotuner
87+
(next step) or you can chose defaults provided. We strongly recommend to run
88+
the autotuner instead of manual options for better performance. See :ref:`must_pass_options`
89+
for more information about options.
90+
91+
Step 4: Autotune and get better performing kernel
92+
-------------------------------------------------
93+
94+
So, it was very quick and easy to define the TensorDot operation with TC and get it running.
95+
96+
But how about a better performing kernel?
97+
98+
TC provides a genetic algorithm based autotuner to tune the kernel performance. Let's
99+
autotune the kernel and get a better performance kernel. We will also cache the better
100+
kernel options by setting :code:`cache={filepath}` so that we can use these options
101+
later.
102+
103+
.. code-block:: python
104+
105+
# autotune the kernel
106+
best_options = tensordot.autotune(I0, I1, cache="tensordot_32_512_8_2_28.tc")
107+
# run the kernel with the autotuned options
108+
out = tensordot(I0, I1, options=best_options)
109+
110+
You can control the amount of autotuning by changing the autotuner parameters. See
111+
:ref:`autotune_parameters` for how to change the settings.
112+
113+
For the setting ``settings={"generations": 25, "pop_size": 100, "number_elites": 10}``, we
114+
get a decent kernel performance as shown in the screenshot below:
115+
116+
.. figure:: ../_static/img/autotuning-py.jpg
117+
:alt: python-autotuning-tensordot
118+
:align: center
119+
120+
Early stopping
121+
^^^^^^^^^^^^^^
122+
123+
If your kernel performance is good enough while the autotuning continues, you
124+
can stop autotuning by pressing :code:`Ctrl+C` and the autotuning cache will be saved
125+
and then the autotuning will stop.

0 commit comments

Comments
 (0)