|
1 | 1 | # -*- coding: utf-8 -*-
|
2 | 2 | """
|
3 |
| -A Gentle Introduction to ``torch.autograd`` |
4 |
| -=========================================== |
5 |
| -
|
6 |
| -``torch.autograd`` is PyTorch’s automatic differentiation engine that powers |
7 |
| -neural network training. In this section, you will get a conceptual |
8 |
| -understanding of how autograd helps a neural network train. |
9 |
| -
|
10 |
| -Background |
11 |
| -~~~~~~~~~~ |
12 |
| -Neural networks (NNs) are a collection of nested functions that are |
13 |
| -executed on some input data. These functions are defined by *parameters* |
14 |
| -(consisting of weights and biases), which in PyTorch are stored in |
15 |
| -tensors. |
16 |
| -
|
17 |
| -Training a NN happens in two steps: |
18 |
| -
|
19 |
| -**Forward Propagation**: In forward prop, the NN makes its best guess |
20 |
| -about the correct output. It runs the input data through each of its |
21 |
| -functions to make this guess. |
22 |
| -
|
23 |
| -**Backward Propagation**: In backprop, the NN adjusts its parameters |
24 |
| -proportionate to the error in its guess. It does this by traversing |
25 |
| -backwards from the output, collecting the derivatives of the error with |
26 |
| -respect to the parameters of the functions (*gradients*), and optimizing |
27 |
| -the parameters using gradient descent. For a more detailed walkthrough |
28 |
| -of backprop, check out this `video from |
29 |
| -3Blue1Brown <https://www.youtube.com/watch?v=tIeHLnjs5U8>`__. |
30 |
| -
|
| 3 | +:orphan: |
31 | 4 |
|
| 5 | +A Gentle Introduction to ``torch.autograd`` |
| 6 | +============================================== |
32 | 7 |
|
| 8 | +This tutorial has been deprecated because there is an identical basics tutorial. |
33 | 9 |
|
34 |
| -Usage in PyTorch |
35 |
| -~~~~~~~~~~~~~~~~ |
36 |
| -Let's take a look at a single training step. |
37 |
| -For this example, we load a pretrained resnet18 model from ``torchvision``. |
38 |
| -We create a random data tensor to represent a single image with 3 channels, and height & width of 64, |
39 |
| -and its corresponding ``label`` initialized to some random values. Label in pretrained models has |
40 |
| -shape (1,1000). |
| 10 | +Redirecting in 3 seconds... |
41 | 11 |
|
42 |
| -.. note:: |
43 |
| - This tutorial works only on the CPU and will not work on GPU devices (even if tensors are moved to CUDA). |
| 12 | +.. raw:: html |
44 | 13 |
|
| 14 | + <meta http-equiv="Refresh" content="3; url='https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html'" /> |
45 | 15 | """
|
46 |
| -import torch |
47 |
| -from torchvision.models import resnet18, ResNet18_Weights |
48 |
| -model = resnet18(weights=ResNet18_Weights.DEFAULT) |
49 |
| -data = torch.rand(1, 3, 64, 64) |
50 |
| -labels = torch.rand(1, 1000) |
51 |
| - |
52 |
| -############################################################ |
53 |
| -# Next, we run the input data through the model through each of its layers to make a prediction. |
54 |
| -# This is the **forward pass**. |
55 |
| -# |
56 |
| - |
57 |
| -prediction = model(data) # forward pass |
58 |
| - |
59 |
| -############################################################ |
60 |
| -# We use the model's prediction and the corresponding label to calculate the error (``loss``). |
61 |
| -# The next step is to backpropagate this error through the network. |
62 |
| -# Backward propagation is kicked off when we call ``.backward()`` on the error tensor. |
63 |
| -# Autograd then calculates and stores the gradients for each model parameter in the parameter's ``.grad`` attribute. |
64 |
| -# |
65 |
| - |
66 |
| -loss = (prediction - labels).sum() |
67 |
| -loss.backward() # backward pass |
68 |
| - |
69 |
| -############################################################ |
70 |
| -# Next, we load an optimizer, in this case SGD with a learning rate of 0.01 and `momentum <https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d>`__ of 0.9. |
71 |
| -# We register all the parameters of the model in the optimizer. |
72 |
| -# |
73 |
| - |
74 |
| -optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9) |
75 |
| - |
76 |
| -###################################################################### |
77 |
| -# Finally, we call ``.step()`` to initiate gradient descent. The optimizer adjusts each parameter by its gradient stored in ``.grad``. |
78 |
| -# |
79 |
| - |
80 |
| -optim.step() #gradient descent |
81 |
| - |
82 |
| -###################################################################### |
83 |
| -# At this point, you have everything you need to train your neural network. |
84 |
| -# The below sections detail the workings of autograd - feel free to skip them. |
85 |
| -# |
86 |
| - |
87 |
| - |
88 |
| -###################################################################### |
89 |
| -# -------------- |
90 |
| -# |
91 |
| - |
92 |
| - |
93 |
| -###################################################################### |
94 |
| -# Differentiation in Autograd |
95 |
| -# ~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
96 |
| -# Let's take a look at how ``autograd`` collects gradients. We create two tensors ``a`` and ``b`` with |
97 |
| -# ``requires_grad=True``. This signals to ``autograd`` that every operation on them should be tracked. |
98 |
| -# |
99 |
| - |
100 |
| -import torch |
101 |
| - |
102 |
| -a = torch.tensor([2., 3.], requires_grad=True) |
103 |
| -b = torch.tensor([6., 4.], requires_grad=True) |
104 |
| - |
105 |
| -###################################################################### |
106 |
| -# We create another tensor ``Q`` from ``a`` and ``b``. |
107 |
| -# |
108 |
| -# .. math:: |
109 |
| -# Q = 3a^3 - b^2 |
110 |
| - |
111 |
| -Q = 3*a**3 - b**2 |
112 |
| - |
113 |
| - |
114 |
| -###################################################################### |
115 |
| -# Let's assume ``a`` and ``b`` to be parameters of an NN, and ``Q`` |
116 |
| -# to be the error. In NN training, we want gradients of the error |
117 |
| -# w.r.t. parameters, i.e. |
118 |
| -# |
119 |
| -# .. math:: |
120 |
| -# \frac{\partial Q}{\partial a} = 9a^2 |
121 |
| -# |
122 |
| -# .. math:: |
123 |
| -# \frac{\partial Q}{\partial b} = -2b |
124 |
| -# |
125 |
| -# |
126 |
| -# When we call ``.backward()`` on ``Q``, autograd calculates these gradients |
127 |
| -# and stores them in the respective tensors' ``.grad`` attribute. |
128 |
| -# |
129 |
| -# We need to explicitly pass a ``gradient`` argument in ``Q.backward()`` because it is a vector. |
130 |
| -# ``gradient`` is a tensor of the same shape as ``Q``, and it represents the |
131 |
| -# gradient of Q w.r.t. itself, i.e. |
132 |
| -# |
133 |
| -# .. math:: |
134 |
| -# \frac{dQ}{dQ} = 1 |
135 |
| -# |
136 |
| -# Equivalently, we can also aggregate Q into a scalar and call backward implicitly, like ``Q.sum().backward()``. |
137 |
| -# |
138 |
| -external_grad = torch.tensor([1., 1.]) |
139 |
| -Q.backward(gradient=external_grad) |
140 |
| - |
141 |
| - |
142 |
| -####################################################################### |
143 |
| -# Gradients are now deposited in ``a.grad`` and ``b.grad`` |
144 |
| - |
145 |
| -# check if collected gradients are correct |
146 |
| -print(9*a**2 == a.grad) |
147 |
| -print(-2*b == b.grad) |
148 |
| - |
149 |
| - |
150 |
| -###################################################################### |
151 |
| -# Optional Reading - Vector Calculus using ``autograd`` |
152 |
| -# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
153 |
| -# |
154 |
| -# Mathematically, if you have a vector valued function |
155 |
| -# :math:`\vec{y}=f(\vec{x})`, then the gradient of :math:`\vec{y}` with |
156 |
| -# respect to :math:`\vec{x}` is a Jacobian matrix :math:`J`: |
157 |
| -# |
158 |
| -# .. math:: |
159 |
| -# |
160 |
| -# |
161 |
| -# J |
162 |
| -# = |
163 |
| -# \left(\begin{array}{cc} |
164 |
| -# \frac{\partial \bf{y}}{\partial x_{1}} & |
165 |
| -# ... & |
166 |
| -# \frac{\partial \bf{y}}{\partial x_{n}} |
167 |
| -# \end{array}\right) |
168 |
| -# = |
169 |
| -# \left(\begin{array}{ccc} |
170 |
| -# \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\ |
171 |
| -# \vdots & \ddots & \vdots\\ |
172 |
| -# \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}} |
173 |
| -# \end{array}\right) |
174 |
| -# |
175 |
| -# Generally speaking, ``torch.autograd`` is an engine for computing |
176 |
| -# vector-Jacobian product. That is, given any vector :math:`\vec{v}`, compute the product |
177 |
| -# :math:`J^{T}\cdot \vec{v}` |
178 |
| -# |
179 |
| -# If :math:`\vec{v}` happens to be the gradient of a scalar function :math:`l=g\left(\vec{y}\right)`: |
180 |
| -# |
181 |
| -# .. math:: |
182 |
| -# |
183 |
| -# |
184 |
| -# \vec{v} |
185 |
| -# = |
186 |
| -# \left(\begin{array}{ccc}\frac{\partial l}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}\end{array}\right)^{T} |
187 |
| -# |
188 |
| -# then by the chain rule, the vector-Jacobian product would be the |
189 |
| -# gradient of :math:`l` with respect to :math:`\vec{x}`: |
190 |
| -# |
191 |
| -# .. math:: |
192 |
| -# |
193 |
| -# |
194 |
| -# J^{T}\cdot \vec{v}=\left(\begin{array}{ccc} |
195 |
| -# \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}}\\ |
196 |
| -# \vdots & \ddots & \vdots\\ |
197 |
| -# \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}} |
198 |
| -# \end{array}\right)\left(\begin{array}{c} |
199 |
| -# \frac{\partial l}{\partial y_{1}}\\ |
200 |
| -# \vdots\\ |
201 |
| -# \frac{\partial l}{\partial y_{m}} |
202 |
| -# \end{array}\right)=\left(\begin{array}{c} |
203 |
| -# \frac{\partial l}{\partial x_{1}}\\ |
204 |
| -# \vdots\\ |
205 |
| -# \frac{\partial l}{\partial x_{n}} |
206 |
| -# \end{array}\right) |
207 |
| -# |
208 |
| -# This characteristic of vector-Jacobian product is what we use in the above example; |
209 |
| -# ``external_grad`` represents :math:`\vec{v}`. |
210 |
| -# |
211 |
| - |
212 |
| - |
213 |
| - |
214 |
| -###################################################################### |
215 |
| -# Computational Graph |
216 |
| -# ~~~~~~~~~~~~~~~~~~~ |
217 |
| -# |
218 |
| -# Conceptually, autograd keeps a record of data (tensors) & all executed |
219 |
| -# operations (along with the resulting new tensors) in a directed acyclic |
220 |
| -# graph (DAG) consisting of |
221 |
| -# `Function <https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function>`__ |
222 |
| -# objects. In this DAG, leaves are the input tensors, roots are the output |
223 |
| -# tensors. By tracing this graph from roots to leaves, you can |
224 |
| -# automatically compute the gradients using the chain rule. |
225 |
| -# |
226 |
| -# In a forward pass, autograd does two things simultaneously: |
227 |
| -# |
228 |
| -# - run the requested operation to compute a resulting tensor, and |
229 |
| -# - maintain the operation’s *gradient function* in the DAG. |
230 |
| -# |
231 |
| -# The backward pass kicks off when ``.backward()`` is called on the DAG |
232 |
| -# root. ``autograd`` then: |
233 |
| -# |
234 |
| -# - computes the gradients from each ``.grad_fn``, |
235 |
| -# - accumulates them in the respective tensor’s ``.grad`` attribute, and |
236 |
| -# - using the chain rule, propagates all the way to the leaf tensors. |
237 |
| -# |
238 |
| -# Below is a visual representation of the DAG in our example. In the graph, |
239 |
| -# the arrows are in the direction of the forward pass. The nodes represent the backward functions |
240 |
| -# of each operation in the forward pass. The leaf nodes in blue represent our leaf tensors ``a`` and ``b``. |
241 |
| -# |
242 |
| -# .. figure:: /_static/img/dag_autograd.png |
243 |
| -# |
244 |
| -# .. note:: |
245 |
| -# **DAGs are dynamic in PyTorch** |
246 |
| -# An important thing to note is that the graph is recreated from scratch; after each |
247 |
| -# ``.backward()`` call, autograd starts populating a new graph. This is |
248 |
| -# exactly what allows you to use control flow statements in your model; |
249 |
| -# you can change the shape, size and operations at every iteration if |
250 |
| -# needed. |
251 |
| -# |
252 |
| -# Exclusion from the DAG |
253 |
| -# ^^^^^^^^^^^^^^^^^^^^^^ |
254 |
| -# |
255 |
| -# ``torch.autograd`` tracks operations on all tensors which have their |
256 |
| -# ``requires_grad`` flag set to ``True``. For tensors that don’t require |
257 |
| -# gradients, setting this attribute to ``False`` excludes it from the |
258 |
| -# gradient computation DAG. |
259 |
| -# |
260 |
| -# The output tensor of an operation will require gradients even if only a |
261 |
| -# single input tensor has ``requires_grad=True``. |
262 |
| -# |
263 |
| - |
264 |
| -x = torch.rand(5, 5) |
265 |
| -y = torch.rand(5, 5) |
266 |
| -z = torch.rand((5, 5), requires_grad=True) |
267 |
| - |
268 |
| -a = x + y |
269 |
| -print(f"Does `a` require gradients?: {a.requires_grad}") |
270 |
| -b = x + z |
271 |
| -print(f"Does `b` require gradients?: {b.requires_grad}") |
272 |
| - |
273 |
| - |
274 |
| -###################################################################### |
275 |
| -# In a NN, parameters that don't compute gradients are usually called **frozen parameters**. |
276 |
| -# It is useful to "freeze" part of your model if you know in advance that you won't need the gradients of those parameters |
277 |
| -# (this offers some performance benefits by reducing autograd computations). |
278 |
| -# |
279 |
| -# In finetuning, we freeze most of the model and typically only modify the classifier layers to make predictions on new labels. |
280 |
| -# Let's walk through a small example to demonstrate this. As before, we load a pretrained resnet18 model, and freeze all the parameters. |
281 |
| - |
282 |
| -from torch import nn, optim |
283 |
| - |
284 |
| -model = resnet18(weights=ResNet18_Weights.DEFAULT) |
285 |
| - |
286 |
| -# Freeze all the parameters in the network |
287 |
| -for param in model.parameters(): |
288 |
| - param.requires_grad = False |
289 |
| - |
290 |
| -###################################################################### |
291 |
| -# Let's say we want to finetune the model on a new dataset with 10 labels. |
292 |
| -# In resnet, the classifier is the last linear layer ``model.fc``. |
293 |
| -# We can simply replace it with a new linear layer (unfrozen by default) |
294 |
| -# that acts as our classifier. |
295 |
| - |
296 |
| -model.fc = nn.Linear(512, 10) |
297 |
| - |
298 |
| -###################################################################### |
299 |
| -# Now all parameters in the model, except the parameters of ``model.fc``, are frozen. |
300 |
| -# The only parameters that compute gradients are the weights and bias of ``model.fc``. |
301 |
| - |
302 |
| -# Optimize only the classifier |
303 |
| -optimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9) |
304 |
| - |
305 |
| -########################################################################## |
306 |
| -# Notice although we register all the parameters in the optimizer, |
307 |
| -# the only parameters that are computing gradients (and hence updated in gradient descent) |
308 |
| -# are the weights and bias of the classifier. |
309 |
| -# |
310 |
| -# The same exclusionary functionality is available as a context manager in |
311 |
| -# `torch.no_grad() <https://pytorch.org/docs/stable/generated/torch.no_grad.html>`__ |
312 |
| -# |
313 |
| - |
314 |
| -###################################################################### |
315 |
| -# -------------- |
316 |
| -# |
317 |
| - |
318 |
| -###################################################################### |
319 |
| -# Further readings: |
320 |
| -# ~~~~~~~~~~~~~~~~~~~ |
321 |
| -# |
322 |
| -# - `In-place operations & Multithreaded Autograd <https://pytorch.org/docs/stable/notes/autograd.html>`__ |
323 |
| -# - `Example implementation of reverse-mode autodiff <https://colab.research.google.com/drive/1VpeE6UvEPRz9HmsHh1KS0XxXjYu533EC>`__ |
324 |
| -# - `Video: PyTorch Autograd Explained - In-depth Tutorial <https://www.youtube.com/watch?v=MswxJw-8PvE>`__ |
0 commit comments