Skip to content

Commit a72cbc3

Browse files
authored
[BLOG] Add pytorch-tensoriterator-internals (#419)
## Links to other Quansight blogs: https://labs.quansight.org/blog/2021/04/pytorch-tensoriterator-internals-update was not updated no llc links found * Update pytorch-tensoriterator-internals * Update link * Remove old alert message formatting * Update apps/labs/posts/pytorch-tensoriterator-internals.md * Add block quotes * Update image file paths
1 parent 73cc112 commit a72cbc3

File tree

3 files changed

+275
-0
lines changed

3 files changed

+275
-0
lines changed
Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,274 @@
1+
---
2+
title: 'PyTorch TensorIterator Internals'
3+
published: April 13, 2020
4+
author: sameer-deshmukh
5+
description: 'This post is a deep dive into how TensorIterator works, and is an essential part of learning to contribute to the PyTorch codebase since iterations over tensors in the C++ codebase are extremely commonplace. This post is aimed at someone who wants to contribute to PyTorch'
6+
category: [Machine Learning]
7+
featuredImage:
8+
src: /posts/pytorch-tensoriterator-internals/feature.png
9+
alt: 'Code snippet on creating a TensorIterator using a default constructor.'
10+
hero:
11+
imageSrc: /posts/pytorch-tensoriterator-internals/blog_hero_var1.svg
12+
imageAlt: 'An illustration of a dark brown hand holding up a microphone, with some graphical elements highlighting the top of the microphone.'
13+
---
14+
15+
> The history section of this post is still relevant, but `TensorIterator`'s
16+
> interface has changed significantly. For an update on the new API, please check
17+
> out [this new blog
18+
> post](https://labs.quansight.org/blog/2021/04/pytorch-tensoriterator-internals-update).
19+
20+
PyTorch is one of the leading frameworks for deep learning. Its core data
21+
structure is `Tensor`, a multi-dimensional array implementation with many
22+
advanced features like auto-differentiation. PyTorch is a massive
23+
codebase (approx. [a million lines](https://www.openhub.net/p/pytorch) of
24+
C++, Python and CUDA code), and having a method for iterating over tensors in a
25+
very efficient manner that is independent of data type, dimension, striding and
26+
hardware is a critical feature that can lead to a very massive simplification
27+
of the codebase and make distributed development much faster and smoother. The
28+
[`TensorIterator`](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIterator.cpp)
29+
C++ class within PyTorch is a complex yet useful class that is used for
30+
iterating over the elements of a tensor over any dimension and implicitly
31+
parallelizing various operations in a device independent manner.
32+
33+
It does this through a C++ API that is independent of type and device of the
34+
tensor, freeing the programmer of having to worry about the datatype or device
35+
when writing iteration logic for PyTorch tensors. For those coming from the
36+
NumPy universe, `NpyIter` is a close cousin of `TensorIterator`.
37+
38+
This post is a deep dive into how `TensorIterator` works, and is an essential
39+
part of learning to contribute to the PyTorch codebase since iterations over
40+
tensors in the C++ codebase are extremely commonplace. This post is aimed at
41+
someone who wants to contribute to PyTorch, and you should at least be familiar
42+
with some of the basic terminologies of the PyTorch codebase that can be found
43+
in Edward Yang's excellent [blog post](http://blog.ezyang.com/2019/05/pytorch-internals)
44+
on PyTorch internals. Although `TensorIterator` can be used for both CPUs and
45+
accelerators, this post has been written keeping in mind usage on the CPU.
46+
Although there can be some dissimilarities between the two, the overall
47+
concepts are the same.
48+
49+
# History of TensorIterator
50+
51+
## TH iterators
52+
53+
TensorIterator was devised to simplify the implementation of PyTorch's tensor
54+
operations over the `TH` implementation. `TH` uses preprocessor macros to write
55+
type-independent loops over tensors, instead of C++ templates. For example,
56+
consider this simple `TH` loop for computing the product of all the numbers in
57+
a particular dimension (find the code
58+
[here](https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorMoreMath.cpp#L350)):
59+
60+
``` C
61+
TH_TENSOR_DIM_APPLY2(scalar_t, t, scalar_t, r_, dimension,
62+
accreal prod = 1;
63+
int64_t i;
64+
for(i = 0; i < t_size; i++)
65+
prod *= t_data[i*t_stride];
66+
*r__data = (scalar_t)prod;
67+
);
68+
```
69+
70+
The above loop works by following a particular convention for the naming of the
71+
types and variables. You specify the input type and output type of your tensors in the first
72+
and third arguments. `scalar_t` is a type that can generically be used for denoting a PyTorch
73+
scalar type such as `float`, `double`, `long` etc. Internally, PyTorch uses the `scalar_t`
74+
for compiling the file multiple times for different definitions of `scalar_t` (as in for different
75+
data types like `float`, `int`, etc.). The input tensor and output tensors are
76+
specified in the second and fourth arguments (in this case `t` and `r_`), and the dimension that
77+
we want to iterate over is specified as the fifth argument (`dimension`).
78+
79+
We then follow these arguments with the main body of the iterator (which is accepted as the sixth
80+
argument into the macro), and denote the data, stride and size of the particular tensor dimension
81+
by using variables that are suffixed by `_data`, `_stride` and `_size` respectively after the
82+
variable name that represents the tensor inside the iterator body. For example, the size of the
83+
input tensor is denoted as `t_size` in the above example and the pointer to the data of the output
84+
tensor is denoted as `r__data`. The `accreal` in the second line is custom type that specifies
85+
a real number that is an accumulator (in this case for accumulating the product).
86+
87+
Internally, the `TH_TENSOR_DIM_APPLY2` macro is expanded for generating various dispatch calls
88+
depending on the type of the tensor that needs to be iterated over. The implementation of
89+
`TH_TENSOR_DIM_APPLY2` can be found [here](https://github.com/pytorch/pytorch/blob/master/aten/src/TH/THTensorDimApply.h#L138).
90+
91+
## Limitations of TH iterators
92+
93+
Apart from the obvious complication that arises due to maintaining a codebase that is so dependent
94+
on such insanely complex macro expansions, TH iterators have some fundamental shortcomings. For
95+
one thing, they cannot be used for writing iterators in a device independent manner - you will
96+
need separate iterators for CPU and CUDA. Also, parallelization does not happen implicitly
97+
inside the iterator, you need to write the parallel looping logic yourself. Moreover, at a deeper
98+
level `TH` iterators do not collapse the dimensions of the tensor (as we'll see later in this
99+
post) therefore leading to looping that might not be as cache-optimized as possible.
100+
101+
These limitations led to the creation of `TensorIterator`, which is used by the
102+
`ATen` tensor implementation for overcoming some of the shortcomings of the previous `TH`
103+
iterators.
104+
105+
# Basics of TensorIterator
106+
107+
A `TensorIterator` can be created using the default constructor. You must then add the tensors
108+
that you want as inputs or outputs. A good example can be found from the `TensorIterator::binary_op()`
109+
[method](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIterator.cpp#L652) that
110+
allows you to create `TensorIterator` objects for performing point-wise binary operations
111+
between two tensors. The important parts look like so:
112+
113+
``` cpp
114+
auto iter = TensorIterator();
115+
116+
iter.add_output(out);
117+
iter.add_input(a);
118+
iter.add_input(b);
119+
120+
iter.build();
121+
```
122+
As you can see, you add a tensor called `out` as the output tensors and `a` and `b` as the
123+
input tensors. Calling `build` is then mandatory for creating the object and letting
124+
the class perform other optimizations like collapsing dimensions.
125+
126+
# Performing iterations
127+
128+
Broadly, iterations using `TensorIterator` can be classified as point-wise iterations
129+
or reduction iterations. This plays a fundamental role in how iterations using `TensorIterator`
130+
are parallelized - point-wise iterations can be freely parallelized along any dimension
131+
and grain size while reduction operations have to be either parallelized along dimensions
132+
that you're not iterating over or by performing bisect and reduce operations along the
133+
dimension being iterated. Parallelization can also happen using vectorized operations.
134+
135+
## Iteration details
136+
137+
The simplest iteration operation can be performed using the
138+
[`for_each`](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIterator.cpp#L525)
139+
function. This function has two overloads: one takes a function object which iterates over a
140+
single dimension (`loop_t`); the other takes a function object which iterates over two
141+
dimensions simultaneously (`loop2d_t`). Find their definitions [here](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIterator.h#L166). The former can iterate over a loop
142+
of a single dimension whereas the latter can do so over two dimensions. The simplest
143+
way of using `for_each` is to pass it a lambda of type `loop_t` (or `loop2d_t`).
144+
A code snippet using it this way would look like so:
145+
146+
``` cpp
147+
auto iter = TensorIterator();
148+
iter.add_output(out);
149+
iter.add_input(a);
150+
iter.dont_resize_outputs(); // call if out is allocated.
151+
iter.dont_compute_common_dtype(); // call if inputs/outputs are of a different type.
152+
iter.build();
153+
154+
auto loop = [&](char **data, const int64_t* strides, int64_t n) {
155+
auto * out_data_bytes = data[0];
156+
auto * in_data_bytes = data[1];
157+
158+
// assume float data type for this example.
159+
for (int i = 0; i < n; i++) {
160+
*reinterpret_cast<float*>(out_data_bytes) +=
161+
*reinterpret_cast<float*>(in_data_bytes);
162+
163+
out_data_bytes += strides[0];
164+
in_data_bytes += strides[1];
165+
}
166+
}
167+
168+
iter.for_each(loop);
169+
```
170+
In the above example, the `char** data` gives a pointer to the data within the
171+
tensor in the same order that you specify when you build the iterator. Note
172+
that in order to make the implementation agnostic of any particular data type, you
173+
will always receive the pointer typecast to `char` (think of it as a bunch of bytes).
174+
175+
The second argument is `int64_t* strides` which is an array containing the strides of
176+
each tensor in the dimension that you're iterating over. We can add this stride to the
177+
pointer received in order to reach the next element in the tensor. The last argument is
178+
`int64_t n` which is the size of the dimension being iterated over.
179+
180+
`for_each` implicitly parallelizes the operation by executing `loop` in parallel
181+
if the number of iterations is more than the value of `internal::GRAIN_SIZE`, which is a value
182+
that is determined as the 'right amount' of data to iterate over in order to gain a significant
183+
speedup using multi-threaded execution. If you want to explicitly specify that your
184+
operation _must_ run in serial, then use the `serial_for_each` loop.
185+
186+
### Using kernels for iterations
187+
188+
Frequently we want to create a kernel that applies a simple point-wise function onto entire tensors.
189+
`TensorIterator`
190+
provides various such generic kernels that can be used for iterating over the elements
191+
of a tensor without having to worry about the stride, data type of the operands or details
192+
of the parallelism.
193+
194+
For example, say we want to build a function that performs the point-wise addition
195+
of two tensors and stores the result in a third tensor, we can use the `cpu_kernel`
196+
function. Note that in this example we assume a tensor of `float` but you can
197+
use the `AT_DISPATCH_ALL_TYPES_AND2` macro.
198+
``` cpp
199+
TensorIterator iter;
200+
iter.add_input(a_tensor);
201+
iter.add_input(b_tensor);
202+
iter.add_output(c_tensor);
203+
iter.build();
204+
cpu_kernel(iter, [] (float a, float b) -> float {
205+
return a + b;
206+
});
207+
```
208+
Writing the kernel in this way ensures that the value returned by the lambda passed to
209+
`cpu_kernel` will populate the corresponding place in the target output tensor.
210+
211+
### Setting tensor iteration dimensions
212+
213+
The value of the sizes and strides will determine which dimension of the tensor you will iterate over.
214+
`TensorIterator` performs optimizations to make sure that at least
215+
most of the iterations happen on contiguos data to take advantage of hierarchical cache-based
216+
memory architectures (think dimension coalescing and reordering for maximum data locality).
217+
218+
Now a multi-dimensional tensor will have multiple stride values depending on the dimension
219+
you want to iterate over, so `TensorIterator` will directly compute the strides that
220+
get passed into the loop by
221+
by itself within the `build()` function. How exactly it computes the dimension
222+
to iterate over is something that should be properly understood in order to use `TensorIterator`
223+
effectively.
224+
225+
If you're performing a reduction operation (see the sum code in [ReduceOps.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/ReduceOps.cpp#L384)),
226+
`TensorIterator` will figure out the dimensions that will be reduced depending
227+
on the shape of the input and output tensor, which determines how the input will be broadcast
228+
over the output. If you're
229+
performing a simple pointwise operation between two tensors (like a `addcmul` from
230+
[PointwiseOps.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/PointwiseOps.cpp#L31))
231+
the iteration will happen over the entire tensor, without providing a choice of the dimension.
232+
This will allow TensorIterator to freely parallelize the computation, without guarantees of
233+
the order of execution (since it does not matter anyway).
234+
235+
For something like a cumulative sum operation, where you want be able to choose the dimension
236+
to reduce but iterate over multiple non-reduced dimensions (possibly in parallel), you
237+
must first re-stride the tensors, and then use these tensors
238+
for creating a `TensorIterator`. In order to understand how this bit works, lets go over
239+
the code for the [kernel](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/ReduceOpsKernel.cpp#L21) that executes the [cumsum](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/ReduceOpsKernel.cpp#L71) function.
240+
241+
The important bits of this function are like so:
242+
243+
``` cpp
244+
auto self_sizes = ensure_nonempty_vec(self.sizes().vec());
245+
self_sizes[dim] = 1;
246+
247+
auto result_restrided = restride_dim(result, dim, self_sizes);
248+
auto self_restrided = restride_dim(self, dim, self_sizes);
249+
250+
auto iter = TensorIterator();
251+
iter.dont_compute_common_dtype();
252+
iter.dont_resize_outputs();
253+
iter.add_output(result_restrided);
254+
iter.add_input(self_restrided);
255+
iter.build();
256+
```
257+
You can see that we first change the size of the tensors to `1` on the
258+
reduction dimension so that the dimension collapsing logic inside
259+
`TensorIterator#build` will know which dimension to skip.
260+
Setting the dimension in this way is akin to telling `TensorIterator`
261+
to skip the dimension. We then restride the tensors using `restride_dim` and
262+
then use the restrided tensors for building the `TensorIterator`. You can
263+
set any size for inputs/outputs, then `TensorIterator` with check whether it
264+
can come up with a common broadcasted size
265+
266+
# Conclusion
267+
268+
This post was a very short introduction to what `TensorIterator` is actually
269+
capable of. If you want to learn more about how it works and what goes into
270+
things like collapsing the tensor size for optimizing memory access, a good
271+
place to start would be the `build()` function in
272+
[TensorIterator.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIterator.cpp#L1030).
273+
Also have a look at [this wiki page](https://github.com/pytorch/pytorch/wiki/How-to-use-TensorIterator)
274+
from the PyTorch team on using `TensorIterator.`

apps/labs/public/posts/pytorch-tensoriterator-internals/blog_hero_var1.svg

Lines changed: 1 addition & 0 deletions
Loading
41.2 KB
Loading

0 commit comments

Comments
 (0)