|
| 1 | +Using TC to get fast CUDA code for Tensor Contraction |
| 2 | +===================================================== |
| 3 | + |
| 4 | +In this tutorial, we will see how we can start from a random math operation, |
| 5 | +express it in TC language and easily get the fast CUDA code for it. We will also |
| 6 | +see how to tune the CUDA code to a better performance. All of this is possible with |
| 7 | +only 3-4 lines of code. Let's get started. |
| 8 | + |
| 9 | +For this tutorial, you will need to install Tensor Comprehensions binary. You can |
| 10 | +get binary builds of Tensor Comprehensions with ``conda install -y -c pytorch -c prigoyal tensor_comprehensions``. |
| 11 | + |
| 12 | +About TensorDot operation |
| 13 | +------------------------- |
| 14 | + |
| 15 | +First, we find an operation that we want to generate fast CUDA code for. A lot of |
| 16 | +operations like convolution, pooling are standard and have CUDA code easily available, so |
| 17 | +rather we are going to pick a new and different operation. How do we find a new operation? |
| 18 | + |
| 19 | +**Sources**: Maybe there is a research paper idea you have like KRU or there is a |
| 20 | +numpy operation that is interesting to you and is needed in Machine Learning model. |
| 21 | +As per Numpy docs on linear algebra, tensordot seems like an interesting operation |
| 22 | +`TensorDot <https://docs.scipy.org/doc/numpy/reference/generated/numpy.tensordot.html#numpy.tensordot>`_. |
| 23 | + |
| 24 | +**The TensorDot operation** |
| 25 | + |
| 26 | +Assume that we have two tensors, one with dimension :code:`(N, C1, C2, H, W)` and one with dimension |
| 27 | +:code:`(N, C2, C3, H, W)`, and we want to do a gemm-type computation on the :code:`C` |
| 28 | +dimensions to get an output of shape :code:`(N, C1, C3, H, W)`. Basically, for each |
| 29 | +:code:`(N, H, W)` combination, we want to do a reduction from :code:`(C1, C2) * (C2, C3) = (C1, C3)`. |
| 30 | + |
| 31 | +So basically, this operation can be represented as `N x H x W` independent gemms and one could try to |
| 32 | +write batched gemm kernel for it. But does that guarantee good performance? What if the |
| 33 | +tensor sizes are like this: :code:`N=32, C1=512, C2=8, C3=2, H=28, W=28` i.e. |
| 34 | +the value of :code:`C1` is pretty large compared to :code:`C2` / :code:`C3`. |
| 35 | + |
| 36 | +Let's see how we can get the CUDA kernel for such operation and then tune the kernel. |
| 37 | + |
| 38 | +Step 1: Write TC for TensorDot Operation |
| 39 | +---------------------------------------- |
| 40 | + |
| 41 | +First step is to express the Tensordot operation in TC language. For more information on how to do |
| 42 | +so, you can refer to our `Documentation <https://facebookresearch.github.io/TensorComprehensions/index.html>`_ |
| 43 | +and also find various TC examples `here <https://facebookresearch.github.io/TensorComprehensions/framework/pytorch_integration/layers_database.html>`_. |
| 44 | + |
| 45 | +.. code-block:: python |
| 46 | +
|
| 47 | + # import tc and torch both |
| 48 | + import tensor_comprehensions as tc |
| 49 | + import torch |
| 50 | + # define the operation as TC language |
| 51 | + lang = """ |
| 52 | + def tensordot(float(N, C1, C2, H, W) I0, float(N, C2, C3, H, W) I1) -> (O) { |
| 53 | + O(n, c1, c3, h, w) +=! I0(n, c1, c2, h, w) * I1(n, c2, c3, h, w) |
| 54 | + } |
| 55 | + """ |
| 56 | +
|
| 57 | +Step 2: Register operation with TC |
| 58 | +---------------------------------- |
| 59 | + |
| 60 | +Now, we will use the TC string and register it with the TC backend by calling :code:`tc.define`. |
| 61 | + |
| 62 | +.. code-block:: python |
| 63 | +
|
| 64 | + # register the lang with TC backend |
| 65 | + tensordot = tc.define(lang, name="tensordot") |
| 66 | +
|
| 67 | +.. note:: |
| 68 | + |
| 69 | + The :code:`name` variable should match the name of the def in the :code:`lang`. |
| 70 | + |
| 71 | +Step 3: Create input tensors and run TC |
| 72 | +--------------------------------------- |
| 73 | + |
| 74 | +Now that TC is registered, we will create the input tensors and run it. |
| 75 | + |
| 76 | +.. code-block:: python |
| 77 | +
|
| 78 | + # create input cuda tensors |
| 79 | + N, C1, C2, C3, H, W = 32, 512, 8, 2, 28, 28 |
| 80 | + I0, I1 = torch.randn(N, C1, C2, H, W).cuda(), torch.randn(N, C2, C3, H, W).cuda() |
| 81 | + # choose the options that resemble the operation and run |
| 82 | + out = tensordot(I0, I1, options=tc.Options("conv")) |
| 83 | +
|
| 84 | +.. note:: |
| 85 | + |
| 86 | + The :code:`options` can be obtained by autotuning the kernel using Autotuner |
| 87 | + (next step) or you can chose defaults provided. We strongly recommend to run |
| 88 | + the autotuner instead of manual options for better performance. See :ref:`must_pass_options` |
| 89 | + for more information about options. |
| 90 | + |
| 91 | +Step 4: Autotune and get better performing kernel |
| 92 | +------------------------------------------------- |
| 93 | + |
| 94 | +So, it was very quick and easy to define the TensorDot operation with TC and get it running. |
| 95 | + |
| 96 | +But how about a better performing kernel? |
| 97 | + |
| 98 | +TC provides a genetic algorithm based autotuner to tune the kernel performance. Let's |
| 99 | +autotune the kernel and get a better performance kernel. We will also cache the better |
| 100 | +kernel options by setting :code:`cache={filepath}` so that we can use these options |
| 101 | +later. |
| 102 | + |
| 103 | +.. code-block:: python |
| 104 | +
|
| 105 | + # autotune the kernel |
| 106 | + best_options = tensordot.autotune(I0, I1, cache="tensordot_32_512_8_2_28.tc") |
| 107 | + # run the kernel with the autotuned options |
| 108 | + out = tensordot(I0, I1, options=best_options) |
| 109 | +
|
| 110 | +You can control the amount of autotuning by changing the autotuner parameters. See |
| 111 | +:ref:`autotune_parameters` for how to change the settings. |
| 112 | + |
| 113 | +For the setting ``settings={"generations": 25, "pop_size": 100, "number_elites": 10}``, we |
| 114 | +get a decent kernel performance as shown in the screenshot below: |
| 115 | + |
| 116 | +.. figure:: ../_static/img/autotuning-py.jpg |
| 117 | + :alt: python-autotuning-tensordot |
| 118 | + :align: center |
| 119 | + |
| 120 | +Early stopping |
| 121 | +^^^^^^^^^^^^^^ |
| 122 | + |
| 123 | +If your kernel performance is good enough while the autotuning continues, you |
| 124 | +can stop autotuning by pressing :code:`Ctrl+C` and the autotuning cache will be saved |
| 125 | +and then the autotuning will stop. |
0 commit comments