| Week | Day | Room | File |
|---|---|---|---|
| S1 | 15/09/2025 | BARB02 | lab_forward.jl |
| S2 | 22/09/2025 | BARB02 | lab_reverse.jl |
Before coming to the practice session, install the necessary software:
- Install Julia: recommended way
- Install VSCode with the Julia extension: see here
During the lab, run the file lab_forward.jl (S1) and lab_forward.jl (S2).
See here for how to run a file.
In summary,
- open the Julia REPL with
Alt + JthenAlt + O, then - activate the environment of the folter
lab-handouteither with- enter the Pkg mode with
], the prompt should now be blue. Then enteractivate lab-handout, the prompt should now be(lab-handout) pkg>. Leave the Pkg prompt with a backspace. - or write
import Pkg; Pkg.activate("lab-handout").
- enter the Pkg mode with
- Place your cursor on the first line of
lab.jland useShift + Enterto execute the file line by line interactively like a notebook.
Warning
If you run the line include("filename.jl") to load the definition
of methods, these won't get updated automatically when
you modify the file. To update them, you need to run the
include line again (or use Revise.includet as detailed here).
Automatic Differentiation (AD) is a technique to compute the derivatives of a function with machine-precision accuracy, relying on its implementation. Theoretically, the only thing we do is decompose the function into elementary operations, and apply the Chain Rule (just as we would by hand).
"Using the Chain Rule is like peeling an onion: you have to deal with each layer at a time, and if it's too big you'll start crying." (Anonymous professor)
In this lab, we aim to build an onion-peeling machine. As we will see, it is relatively simple. Moreover, when built, we are able to peel varieties of onions, and we realize that peeled onions are widely used today.
A classic example of AD application is the training of Neural Networks (NNs). NNs are parametrized functions, and training is the process of adjusting the parameters (the weights), so as to improve their average performance on some training data. More precisely, we minimize a loss function representing the cost of the model predictions being far from the actual target (
During training, the weights (stored in matrices
As was already said above, AD is nothing but automatically applying the Chain Rule on elementary operations composing the function. If our function ,
we know
Therefore, to get the derivative of function
Each node of the expression graph corresponds to a basic operation in the composition. An example of such a representation of the function
as a DAG is given below (Source: [2]).
Both implementations are explained below, and should be completed in the forward.jl and reverse.jl code files respectively.
A. Forward mode: The Forward mode is the most intuitive. It follows the natural order of the elementary operations: starting from the leaves of the DAG (the variables) and their local jacobians, and propagating directional derivatives from left to right in the graph. Note that we said directional derivatives. Indeed, each forward pass through the expression graph starts with a direction
For some applications, only a directional derivative (sometimes called Jacobian Vector Product (JVP)) is needed. When the entire gradient of a function of
The implementation consists in:
- Defining Dual numbers, holding both the value of a function and the value of the derivative.
- Redefining all elementary operations on Dual numbers (so that each time that an operation is called, both the result and the associated derivative are computed). The definitions come from the simple rules of calculus (derivative of a sum, product rule, derivative of a power, etc).
B. Reverse mode: The Reverse mode is less intuitive, but worth it. It consists of:
- A forward pass (from left to right in the expression graph), during which the value of the function is computed, as well as all the intermediate results of the elementary operations. Let's denote these intermediate results
$s_i$ (result after function$f_i$ ). - A backward pass (from right to left), during which the derivative of the function with respect to the intermediate results is propagated backward. This backpropagation consists of multiplying the previous result at each stage by the local Jacobian of the elementary operation in question. When we reach the left end of the expression graph, we have computed the desired derivative, with respect to the variables
$x$ .
When starting a backward pass from the right end of the graph, one specific output function must be chosen. Said differently, if the function$f$ has$m$ outputs,$m$ backward passes are needed in Reverse mode to get the entire Jacobian of$f$ .
Note that the forward pass is always necessary (even if the function values are not requested and only the derivative values are). Indeed, the local jacobians used in the backward pass (for some operations) do depend on the function values (which are computed during the forward pass).
Finally, we would like to quickly draw your attention to the order in which the nodes of the graph are visited. Since the expression graph is a DAG (and not necessarily a tree), it is important that to define an order of visit, so that no nodes are visited before any of their parents during the backward pass. This is ensured by the topo_sort function in the code.
Note
- In
reverse.jl, when creating the expression graph, we construct a Node for each of the variables and constants of the expression. Are the derivatives with respect to ALL these Nodes useful? If not, modify the code so as to avoid the unnecessary computations. - In
reverse.jl, the local jacobians are computed and stored during the backward pass. This means that we have to check the symbol of the operation of each Node during the backward pass, implying a costly if-else during the backward pass. Note however that the local jacobians could be computed during the forward pass (in which we have to check (if-else) the symbol of the operation of each Node anyway), and stored in their respective Nodes. Then the backward pass would only consists of multiplying the local jacobians together, with no need to know the symbols. The backward pass is therefore faster. Modify the code so as to implement this 'jacobian-storing' version of Reverse mode. Is there any downside to this version?
forward.jl: implements the forward mode AD.reverse.jl: implements the reverse mode AD.lab_forward.jl: tests your forward AD on example loss functions.lab_reverse.jl: tests your reverse AD on example loss functions.models.jl: defines the models and activation functions of the loss functions.train.jl: implements a simple Gradient Descent for training.
Note
The AD interface implemented by reverse and forward is chosen so as to
match DifferentiationInterface
That way, you could also easily try out other backends as well.
Note that their implementation may not be as readable as the lab solutions that
we will provide though because these backends favor efficiency and large coverage
the part of Julia they support while we favor simplicity to be didactical.
[1] Evaluating Derivatives, A. Griewank and A. Walther
[2] The Elements of Differentiable Programming, M. Blondel, V. Roulet
[3] 3Blue1Brown, https://www.3blue1brown.com/?v=chain-rule-and-product-rule, https://www.3blue1brown.com/?v=neural-networks, https://www.3blue1brown.com/lessons/backpropagation#title
