|
| 1 | +--- |
| 2 | +title: 'PyTorch TensorIterator Internals' |
| 3 | +published: April 13, 2020 |
| 4 | +author: sameer-deshmukh |
| 5 | +description: 'This post is a deep dive into how TensorIterator works, and is an essential part of learning to contribute to the PyTorch codebase since iterations over tensors in the C++ codebase are extremely commonplace. This post is aimed at someone who wants to contribute to PyTorch' |
| 6 | +category: [Machine Learning] |
| 7 | +featuredImage: |
| 8 | + src: /posts/pytorch-tensoriterator-internals/feature.png |
| 9 | + alt: 'Code snippet on creating a TensorIterator using a default constructor.' |
| 10 | +hero: |
| 11 | + imageSrc: /posts/pytorch-tensoriterator-internals/blog_hero_var1.svg |
| 12 | + imageAlt: 'An illustration of a dark brown hand holding up a microphone, with some graphical elements highlighting the top of the microphone.' |
| 13 | +--- |
| 14 | + |
| 15 | +> The history section of this post is still relevant, but `TensorIterator`'s |
| 16 | +> interface has changed significantly. For an update on the new API, please check |
| 17 | +> out [this new blog |
| 18 | +> post](https://labs.quansight.org/blog/2021/04/pytorch-tensoriterator-internals-update). |
| 19 | +
|
| 20 | +PyTorch is one of the leading frameworks for deep learning. Its core data |
| 21 | +structure is `Tensor`, a multi-dimensional array implementation with many |
| 22 | +advanced features like auto-differentiation. PyTorch is a massive |
| 23 | +codebase (approx. [a million lines](https://www.openhub.net/p/pytorch) of |
| 24 | +C++, Python and CUDA code), and having a method for iterating over tensors in a |
| 25 | +very efficient manner that is independent of data type, dimension, striding and |
| 26 | +hardware is a critical feature that can lead to a very massive simplification |
| 27 | +of the codebase and make distributed development much faster and smoother. The |
| 28 | +[`TensorIterator`](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIterator.cpp) |
| 29 | +C++ class within PyTorch is a complex yet useful class that is used for |
| 30 | +iterating over the elements of a tensor over any dimension and implicitly |
| 31 | +parallelizing various operations in a device independent manner. |
| 32 | + |
| 33 | +It does this through a C++ API that is independent of type and device of the |
| 34 | +tensor, freeing the programmer of having to worry about the datatype or device |
| 35 | +when writing iteration logic for PyTorch tensors. For those coming from the |
| 36 | +NumPy universe, `NpyIter` is a close cousin of `TensorIterator`. |
| 37 | + |
| 38 | +This post is a deep dive into how `TensorIterator` works, and is an essential |
| 39 | +part of learning to contribute to the PyTorch codebase since iterations over |
| 40 | +tensors in the C++ codebase are extremely commonplace. This post is aimed at |
| 41 | +someone who wants to contribute to PyTorch, and you should at least be familiar |
| 42 | +with some of the basic terminologies of the PyTorch codebase that can be found |
| 43 | +in Edward Yang's excellent [blog post](http://blog.ezyang.com/2019/05/pytorch-internals) |
| 44 | +on PyTorch internals. Although `TensorIterator` can be used for both CPUs and |
| 45 | +accelerators, this post has been written keeping in mind usage on the CPU. |
| 46 | +Although there can be some dissimilarities between the two, the overall |
| 47 | +concepts are the same. |
| 48 | + |
| 49 | +# History of TensorIterator |
| 50 | + |
| 51 | +## TH iterators |
| 52 | + |
| 53 | +TensorIterator was devised to simplify the implementation of PyTorch's tensor |
| 54 | +operations over the `TH` implementation. `TH` uses preprocessor macros to write |
| 55 | +type-independent loops over tensors, instead of C++ templates. For example, |
| 56 | +consider this simple `TH` loop for computing the product of all the numbers in |
| 57 | +a particular dimension (find the code |
| 58 | +[here](https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorMoreMath.cpp#L350)): |
| 59 | + |
| 60 | +``` C |
| 61 | +TH_TENSOR_DIM_APPLY2(scalar_t, t, scalar_t, r_, dimension, |
| 62 | + accreal prod = 1; |
| 63 | + int64_t i; |
| 64 | + for(i = 0; i < t_size; i++) |
| 65 | + prod *= t_data[i*t_stride]; |
| 66 | + *r__data = (scalar_t)prod; |
| 67 | +); |
| 68 | +``` |
| 69 | + |
| 70 | +The above loop works by following a particular convention for the naming of the |
| 71 | +types and variables. You specify the input type and output type of your tensors in the first |
| 72 | +and third arguments. `scalar_t` is a type that can generically be used for denoting a PyTorch |
| 73 | +scalar type such as `float`, `double`, `long` etc. Internally, PyTorch uses the `scalar_t` |
| 74 | +for compiling the file multiple times for different definitions of `scalar_t` (as in for different |
| 75 | +data types like `float`, `int`, etc.). The input tensor and output tensors are |
| 76 | +specified in the second and fourth arguments (in this case `t` and `r_`), and the dimension that |
| 77 | +we want to iterate over is specified as the fifth argument (`dimension`). |
| 78 | + |
| 79 | +We then follow these arguments with the main body of the iterator (which is accepted as the sixth |
| 80 | +argument into the macro), and denote the data, stride and size of the particular tensor dimension |
| 81 | +by using variables that are suffixed by `_data`, `_stride` and `_size` respectively after the |
| 82 | +variable name that represents the tensor inside the iterator body. For example, the size of the |
| 83 | +input tensor is denoted as `t_size` in the above example and the pointer to the data of the output |
| 84 | +tensor is denoted as `r__data`. The `accreal` in the second line is custom type that specifies |
| 85 | +a real number that is an accumulator (in this case for accumulating the product). |
| 86 | + |
| 87 | +Internally, the `TH_TENSOR_DIM_APPLY2` macro is expanded for generating various dispatch calls |
| 88 | +depending on the type of the tensor that needs to be iterated over. The implementation of |
| 89 | +`TH_TENSOR_DIM_APPLY2` can be found [here](https://github.com/pytorch/pytorch/blob/master/aten/src/TH/THTensorDimApply.h#L138). |
| 90 | + |
| 91 | +## Limitations of TH iterators |
| 92 | + |
| 93 | +Apart from the obvious complication that arises due to maintaining a codebase that is so dependent |
| 94 | +on such insanely complex macro expansions, TH iterators have some fundamental shortcomings. For |
| 95 | +one thing, they cannot be used for writing iterators in a device independent manner - you will |
| 96 | +need separate iterators for CPU and CUDA. Also, parallelization does not happen implicitly |
| 97 | +inside the iterator, you need to write the parallel looping logic yourself. Moreover, at a deeper |
| 98 | +level `TH` iterators do not collapse the dimensions of the tensor (as we'll see later in this |
| 99 | +post) therefore leading to looping that might not be as cache-optimized as possible. |
| 100 | + |
| 101 | +These limitations led to the creation of `TensorIterator`, which is used by the |
| 102 | +`ATen` tensor implementation for overcoming some of the shortcomings of the previous `TH` |
| 103 | +iterators. |
| 104 | + |
| 105 | +# Basics of TensorIterator |
| 106 | + |
| 107 | +A `TensorIterator` can be created using the default constructor. You must then add the tensors |
| 108 | +that you want as inputs or outputs. A good example can be found from the `TensorIterator::binary_op()` |
| 109 | +[method](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIterator.cpp#L652) that |
| 110 | +allows you to create `TensorIterator` objects for performing point-wise binary operations |
| 111 | +between two tensors. The important parts look like so: |
| 112 | + |
| 113 | +``` cpp |
| 114 | +auto iter = TensorIterator(); |
| 115 | + |
| 116 | +iter.add_output(out); |
| 117 | +iter.add_input(a); |
| 118 | +iter.add_input(b); |
| 119 | + |
| 120 | +iter.build(); |
| 121 | +``` |
| 122 | +As you can see, you add a tensor called `out` as the output tensors and `a` and `b` as the |
| 123 | +input tensors. Calling `build` is then mandatory for creating the object and letting |
| 124 | +the class perform other optimizations like collapsing dimensions. |
| 125 | + |
| 126 | +# Performing iterations |
| 127 | + |
| 128 | +Broadly, iterations using `TensorIterator` can be classified as point-wise iterations |
| 129 | +or reduction iterations. This plays a fundamental role in how iterations using `TensorIterator` |
| 130 | +are parallelized - point-wise iterations can be freely parallelized along any dimension |
| 131 | +and grain size while reduction operations have to be either parallelized along dimensions |
| 132 | +that you're not iterating over or by performing bisect and reduce operations along the |
| 133 | +dimension being iterated. Parallelization can also happen using vectorized operations. |
| 134 | + |
| 135 | +## Iteration details |
| 136 | + |
| 137 | +The simplest iteration operation can be performed using the |
| 138 | +[`for_each`](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIterator.cpp#L525) |
| 139 | +function. This function has two overloads: one takes a function object which iterates over a |
| 140 | +single dimension (`loop_t`); the other takes a function object which iterates over two |
| 141 | +dimensions simultaneously (`loop2d_t`). Find their definitions [here](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIterator.h#L166). The former can iterate over a loop |
| 142 | +of a single dimension whereas the latter can do so over two dimensions. The simplest |
| 143 | +way of using `for_each` is to pass it a lambda of type `loop_t` (or `loop2d_t`). |
| 144 | +A code snippet using it this way would look like so: |
| 145 | + |
| 146 | +``` cpp |
| 147 | +auto iter = TensorIterator(); |
| 148 | +iter.add_output(out); |
| 149 | +iter.add_input(a); |
| 150 | +iter.dont_resize_outputs(); // call if out is allocated. |
| 151 | +iter.dont_compute_common_dtype(); // call if inputs/outputs are of a different type. |
| 152 | +iter.build(); |
| 153 | + |
| 154 | +auto loop = [&](char **data, const int64_t* strides, int64_t n) { |
| 155 | + auto * out_data_bytes = data[0]; |
| 156 | + auto * in_data_bytes = data[1]; |
| 157 | + |
| 158 | + // assume float data type for this example. |
| 159 | + for (int i = 0; i < n; i++) { |
| 160 | + *reinterpret_cast<float*>(out_data_bytes) += |
| 161 | + *reinterpret_cast<float*>(in_data_bytes); |
| 162 | + |
| 163 | + out_data_bytes += strides[0]; |
| 164 | + in_data_bytes += strides[1]; |
| 165 | + } |
| 166 | +} |
| 167 | + |
| 168 | +iter.for_each(loop); |
| 169 | +``` |
| 170 | +In the above example, the `char** data` gives a pointer to the data within the |
| 171 | +tensor in the same order that you specify when you build the iterator. Note |
| 172 | +that in order to make the implementation agnostic of any particular data type, you |
| 173 | +will always receive the pointer typecast to `char` (think of it as a bunch of bytes). |
| 174 | + |
| 175 | +The second argument is `int64_t* strides` which is an array containing the strides of |
| 176 | +each tensor in the dimension that you're iterating over. We can add this stride to the |
| 177 | +pointer received in order to reach the next element in the tensor. The last argument is |
| 178 | +`int64_t n` which is the size of the dimension being iterated over. |
| 179 | + |
| 180 | +`for_each` implicitly parallelizes the operation by executing `loop` in parallel |
| 181 | +if the number of iterations is more than the value of `internal::GRAIN_SIZE`, which is a value |
| 182 | +that is determined as the 'right amount' of data to iterate over in order to gain a significant |
| 183 | +speedup using multi-threaded execution. If you want to explicitly specify that your |
| 184 | +operation _must_ run in serial, then use the `serial_for_each` loop. |
| 185 | + |
| 186 | +### Using kernels for iterations |
| 187 | + |
| 188 | +Frequently we want to create a kernel that applies a simple point-wise function onto entire tensors. |
| 189 | +`TensorIterator` |
| 190 | +provides various such generic kernels that can be used for iterating over the elements |
| 191 | +of a tensor without having to worry about the stride, data type of the operands or details |
| 192 | +of the parallelism. |
| 193 | + |
| 194 | +For example, say we want to build a function that performs the point-wise addition |
| 195 | +of two tensors and stores the result in a third tensor, we can use the `cpu_kernel` |
| 196 | +function. Note that in this example we assume a tensor of `float` but you can |
| 197 | +use the `AT_DISPATCH_ALL_TYPES_AND2` macro. |
| 198 | +``` cpp |
| 199 | +TensorIterator iter; |
| 200 | +iter.add_input(a_tensor); |
| 201 | +iter.add_input(b_tensor); |
| 202 | +iter.add_output(c_tensor); |
| 203 | +iter.build(); |
| 204 | +cpu_kernel(iter, [] (float a, float b) -> float { |
| 205 | + return a + b; |
| 206 | +}); |
| 207 | +``` |
| 208 | +Writing the kernel in this way ensures that the value returned by the lambda passed to |
| 209 | +`cpu_kernel` will populate the corresponding place in the target output tensor. |
| 210 | +
|
| 211 | +### Setting tensor iteration dimensions |
| 212 | +
|
| 213 | +The value of the sizes and strides will determine which dimension of the tensor you will iterate over. |
| 214 | +`TensorIterator` performs optimizations to make sure that at least |
| 215 | +most of the iterations happen on contiguos data to take advantage of hierarchical cache-based |
| 216 | +memory architectures (think dimension coalescing and reordering for maximum data locality). |
| 217 | +
|
| 218 | +Now a multi-dimensional tensor will have multiple stride values depending on the dimension |
| 219 | +you want to iterate over, so `TensorIterator` will directly compute the strides that |
| 220 | +get passed into the loop by |
| 221 | +by itself within the `build()` function. How exactly it computes the dimension |
| 222 | +to iterate over is something that should be properly understood in order to use `TensorIterator` |
| 223 | +effectively. |
| 224 | +
|
| 225 | +If you're performing a reduction operation (see the sum code in [ReduceOps.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/ReduceOps.cpp#L384)), |
| 226 | +`TensorIterator` will figure out the dimensions that will be reduced depending |
| 227 | +on the shape of the input and output tensor, which determines how the input will be broadcast |
| 228 | +over the output. If you're |
| 229 | +performing a simple pointwise operation between two tensors (like a `addcmul` from |
| 230 | +[PointwiseOps.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/PointwiseOps.cpp#L31)) |
| 231 | +the iteration will happen over the entire tensor, without providing a choice of the dimension. |
| 232 | +This will allow TensorIterator to freely parallelize the computation, without guarantees of |
| 233 | +the order of execution (since it does not matter anyway). |
| 234 | +
|
| 235 | +For something like a cumulative sum operation, where you want be able to choose the dimension |
| 236 | +to reduce but iterate over multiple non-reduced dimensions (possibly in parallel), you |
| 237 | +must first re-stride the tensors, and then use these tensors |
| 238 | +for creating a `TensorIterator`. In order to understand how this bit works, lets go over |
| 239 | +the code for the [kernel](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/ReduceOpsKernel.cpp#L21) that executes the [cumsum](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/ReduceOpsKernel.cpp#L71) function. |
| 240 | +
|
| 241 | +The important bits of this function are like so: |
| 242 | +
|
| 243 | +``` cpp |
| 244 | +auto self_sizes = ensure_nonempty_vec(self.sizes().vec()); |
| 245 | +self_sizes[dim] = 1; |
| 246 | +
|
| 247 | +auto result_restrided = restride_dim(result, dim, self_sizes); |
| 248 | +auto self_restrided = restride_dim(self, dim, self_sizes); |
| 249 | +
|
| 250 | +auto iter = TensorIterator(); |
| 251 | +iter.dont_compute_common_dtype(); |
| 252 | +iter.dont_resize_outputs(); |
| 253 | +iter.add_output(result_restrided); |
| 254 | +iter.add_input(self_restrided); |
| 255 | +iter.build(); |
| 256 | +``` |
| 257 | +You can see that we first change the size of the tensors to `1` on the |
| 258 | +reduction dimension so that the dimension collapsing logic inside |
| 259 | +`TensorIterator#build` will know which dimension to skip. |
| 260 | +Setting the dimension in this way is akin to telling `TensorIterator` |
| 261 | +to skip the dimension. We then restride the tensors using `restride_dim` and |
| 262 | +then use the restrided tensors for building the `TensorIterator`. You can |
| 263 | +set any size for inputs/outputs, then `TensorIterator` with check whether it |
| 264 | +can come up with a common broadcasted size |
| 265 | + |
| 266 | +# Conclusion |
| 267 | + |
| 268 | +This post was a very short introduction to what `TensorIterator` is actually |
| 269 | +capable of. If you want to learn more about how it works and what goes into |
| 270 | +things like collapsing the tensor size for optimizing memory access, a good |
| 271 | +place to start would be the `build()` function in |
| 272 | +[TensorIterator.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIterator.cpp#L1030). |
| 273 | +Also have a look at [this wiki page](https://github.com/pytorch/pytorch/wiki/How-to-use-TensorIterator) |
| 274 | +from the PyTorch team on using `TensorIterator.` |
0 commit comments