Skip to content

Commit f78a3d0

Browse files
committed
debug
1 parent 484d6ad commit f78a3d0

File tree

1 file changed

+0
-225
lines changed

1 file changed

+0
-225
lines changed

chapter_compiler_frontend/Automatic_Differentiation.md

Lines changed: 0 additions & 225 deletions
Original file line numberDiff line numberDiff line change
@@ -197,228 +197,3 @@ implications.
197197

198198
![Illustration of forward-mode automaticdifferentiation](../img/ch04/AD-forward_example.png)
199199
:label:`ch04/ch04-forward-mode-compute-function`
200-
201-
Figure :numref:`ch04/ch04-forward-mode-compute-function` elucidates thecomputation process within the forward mode. The sequence of elementaryoperations, derived from the source program, is displayed on the left.Following the chain rule and using established derivative evaluationrules, we sequentially compute each intermediate variable${\dot{v}_i}=\frac{\partial v_i}{\partial x_1}$ from top to bottom, asdepicted on the right. Consequently, this leads to the computation ofthe final variable ${\dot{v}_5}=\frac{\partial y}{\partial x_1}$.In the process of derivative evaluation of a function, we obtain a setof partial derivatives of any output with respect to any input of thisfunction. For a function $f:{\mathbf{R}^n}\to \mathbf{R}^m$, where $n$is the number of independent input variables $x_i$, and $m$ is thenumber of independent output variables $y_i$, the derivative resultscorrespond to the following Jacobian matrix:$$\mathbf{J}_{f}= \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix}$$Each forward pass of function $f$ results in partial derivatives of alloutputs with respect to a single input, represented by the vectorsbelow. This corresponds to one column of the Jacobian matrix. Therefore,executing $n$ forward passes gives us the full Jacobian matrix.$$\begin{bmatrix} \frac{\partial y_1}{\partial x_i} \\ \vdots \\ \frac{\partial y_m}{\partial x_i} \end{bmatrix}$$The forward mode allows us to compute Jacobian-vector products byinitializing $\dot{\mathbf{x}}=\mathbf{r}$ to generate the results for asingle column. As the derivative evaluation rules for elementaryoperations are pre-determined, we know the Jacobian matrix for all theelementary operations. Consequently, by leveraging the chain rule toevaluate the derivatives of $f$ propagated from inputs to outputs, wesecure one column in the Jacobian matrix of the entire network.$$\mathbf{J}_{f}\mathbf{r}= \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix} \begin{bmatrix} r_1 \\ \vdots \\ r_n \end{bmatrix}$$### Reverse ModeFigure :numref:`ch04/ch04-backward-mode-compute` illustrates theautomatic differentiation process in the reverse mode. The sequence ofelementary operations, derived from the source program, is displayed onthe left. Beginning from$\bar{v}_5=\bar{y}=\frac{\partial y}{\partial y}=1$, we sequentiallycompute each intermediate variable${\bar{v}_i}=\frac{\partial y_j}{\partial v_i}$ from bottom to top,
202-
leveraging the chain rule and established derivative evaluation rules
203-
(as depicted on the right). Thus, we can compute the final variables
204-
${\bar{x}_1}=\frac{\partial y}{\partial x_1}$ and
205-
${\bar{x}_2}=\frac{\partial y}{\partial x_2}$.
206-
207-
![Illustration of reverse-mode automaticdifferentiation](../img/ch04/AD-backward_example.png)
208-
:label:`ch04/ch04-backward-mode-compute`
209-
210-
Every reverse pass of function $f$ produces partial derivatives of asingle output with respect to all inputs, represented by the followingvectors. This corresponds to a single row of the Jacobian matrix.Consequently, executing $m$ reverse passes gives us the full Jacobianmatrix.
211-
212-
$$\begin{bmatrix} \frac{\partial y_j}{\partial x_1} & \cdots & \frac{\partial y_j}{\partial x_n} \end{bmatrix}$$Similarly, we can compute vector-Jacobian products to obtain the resultsfor a single row.$$\mathbf{r}^{T}\mathbf{J}_{f}= \begin{bmatrix} r_1 & \cdots & r_m \end{bmatrix} \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix}$$
213-
214-
The quantity of columns and rows in a Jacobian matrix directly
215-
influences the number of forward and reverse passes needed to solve it
216-
for a given function $f$. This characteristic is particularly
217-
significant when determining the most efficient method of automatic
218-
differentiation.
219-
220-
When the function has significantly fewer inputs than outputs
221-
$(f:{\mathbf{R}^n}\to \mathbf{R}^m, n << m)$, the forward mode proves to
222-
be more efficient. Conversely, when the function has considerably more
223-
inputs than outputs $(f:{\mathbf{R}^n}\to \mathbf{R}^m, n >> m)$, the
224-
reverse mode becomes advantageous.
225-
226-
For an extreme case where the function maps from $n$ inputs to a single
227-
output $f:{\mathbf{R}^n}\to \mathbf{R}$, we can evaluate all the
228-
derivatives of the output with respect to the inputs
229-
$(\frac{\partial y}{\partial x_1},\cdots,\frac{\partial y}{\partial n})$
230-
using a single reverse pass or $n$ forward passes. This is a situation
231-
akin to derivative evaluation for a multi-input, single-output network,
232-
a structure frequently encountered in machine learning.
233-
234-
Due to this feature, reverse-mode automatic differentiation forms the
235-
basis for the backpropagation algorithm, a key technique for training
236-
neural networks. By enabling efficient computation of gradients,
237-
especially in scenarios with high-dimensional input data and scalar
238-
output (common in many machine learning applications), reverse-mode
239-
automatic differentiation has become indispensable in the field.
240-
241-
However, the reverse mode does come with certain limitations. For
242-
instance, once a source program is decomposed into a sequence of
243-
elementary operations in the forward mode, inputs can be obtained
244-
synchronously during the execution of these operations. This is possible
245-
because the sequence of derivative evaluations aligns with the sequence
246-
of operation execution. In contrast, in the reverse mode, the sequence
247-
for derivative evaluation is the inverse of the execution sequence of
248-
the source program, leading to a two-phased computation process. The
249-
initial phase entails executing the source program and storing the
250-
intermediate results in memory, while the subsequent phase involves
251-
retrieving these intermediate results to evaluate the derivatives. Due
252-
to the additional steps involved, the reverse mode requires more memory.
253-
254-
## Implementing Automatic Differentiation
255-
256-
This section explores typical design patterns for implementing automatic
257-
differentiation in machine learning frameworks. These design patterns
258-
can be broadly classified into three categories: elemental libraries,
259-
operator overloading, and source transformation.
260-
261-
### Elemental Libraries
262-
263-
Elemental libraries encapsulate elementary expressions and their
264-
differential expressions as library functions. When coding, users must
265-
manually decompose a program into a set of elementary expressions and
266-
replace them with corresponding library functions. Take the program
267-
$a=(x+y)/z$ as an example; it needs to be manually decomposed as
268-
follows:
269-
270-
t = x + y
271-
a = t / z
272-
273-
Subsequently, users replace the decomposed elementary expressions with
274-
appropriate library functions:
275-
276-
// The parameters include variables x, y, and t and their derivative variables dx, dy, and dt.
277-
call ADAdd(x, dx, y, dy, t, dt)
278-
// The parameters include variables t, z, and a and their derivative variables dt, dz, and da.
279-
call ADDiv(t, dt, z, dz, a, da)
280-
281-
The library functions, ADAdd and ADDiv, use the chain rule to define the
282-
Add and Div differential expressions, respectively. This is illustrated
283-
in Code `lst:diff`.
284-
285-
**lst:diff**
286-
```
287-
def ADAdd(x, dx, y, dy, z, dz):
288-
z = x + y
289-
dz = dy + dx
290-
291-
def ADDiv(x, dx, y, dy, z, dz):
292-
z = x / y
293-
dz = dx / y + (x / (y * y)) * dy
294-
```
295-
296-
Elemental libraries constitute a simple and straightforward way of
297-
implementing automatic differentiation for programming languages.
298-
However, this approach requires users to manually decompose a program
299-
into elementary expressions before calling library functions for
300-
programming. Furthermore, it is not possible to use the native
301-
expressions found in programming languages.
302-
303-
### Operator Overloading
304-
305-
Leveraging the polymorphism characteristic inherent in modern
306-
programming languages, the Operator Overloading design pattern redefines
307-
the semantics of elementary operations and successfully encapsulates
308-
their differentiation rules. During the execution phase, it methodically
309-
documents the type, inputs, and outputs of every elementary operation
310-
within a data structure known as a 'tape'. These tapes have the ability
311-
to generate a trace, serving as a pathway for applying the chain rule.
312-
This makes it possible to aggregate elementary operations either in a
313-
forward or backward direction to facilitate differentiation. As depicted
314-
in Code `lst:OO`,
315-
we utilize the AutoDiff code from automatic differentiation libraries as
316-
a case in point to overload the basic arithmetic operators in
317-
programming languages.
318-
319-
**lst:OO**
320-
```
321-
namespace AutoDiff
322-
{
323-
public abstract class Term
324-
{
325-
// To overload and call operators (`+`, `*`, and `/`),
326-
// TermBuilder records the types, inputs, and outputs of operations in tapes.
327-
public static Term operator+(Term left, Term right)
328-
{
329-
return TermBuilder.Sum(left, right);
330-
}
331-
public static Term operator*(Term left, Term right)
332-
{
333-
return TermBuilder.Product(left, right);
334-
}
335-
public static Term operator/(Term numerator, Term denominator)
336-
{
337-
return TermBuilder.Product(numerator, TermBuilder.Power(denominator, -1));
338-
}
339-
}
340-
341-
// Tape data structures include the following basic elements:
342-
// 1) Arithmetic results of operations
343-
// 2) Derivative evaluation results corresponding to arithmetic results of operations
344-
// 3) Inputs of operations
345-
// In addition, functions Eval and Diff are used to define the computation and differentiation rules of the arithmetic operations.
346-
internal abstract class TapeElement
347-
{
348-
public double Value;
349-
public double Adjoint;
350-
public InputEdges Inputs;
351-
352-
public abstract void Eval();
353-
public abstract void Diff();
354-
}
355-
}
356-
```
357-
358-
Operator overloading carries the advantage of tracing the program
359-
through function calls and control flows, resulting in an implementation
360-
process that is both simple and straightforward. However, the
361-
requirement to trace the program during runtime introduces certain
362-
challenges. Specifically, operator overloading is necessitated to
363-
execute reverse-mode differentiation along the trace, which can
364-
potentially cause a drop in performance, particularly for elementary
365-
operations that are executed swiftly. Furthermore, due to the
366-
constraints of runtime, operator overloading is unable to conduct
367-
compile-time graph optimization prior to execution, and the unfolding of
368-
control flows must be based on the information available at runtime.
369-
Despite these challenges, operator overloading is extensively employed
370-
in the PyTorch framework for automatic differentiation due to its
371-
inherent simplicity and adaptability.
372-
373-
### Source Transformation
374-
375-
Source transformation is a design pattern that enriches programming
376-
languages and scrutinizes a program's source code or its Abstract Syntax
377-
Tree (AST) to automatically deconstruct the program into a set of
378-
differentiable elementary operations, each with predefined
379-
differentiation rules. The chain rule is then employed to amalgamate the
380-
differential expressions of the elementary operations, resulting in a
381-
novel program expression that conducts the differentiation. Source
382-
Transformation is integral to machine learning frameworks such as
383-
TensorFlow and MindSpore.
384-
385-
Unlike operator overloading, which functions within programming
386-
languages, source transformation necessitates parsers and tools that
387-
manipulate IRs. It also requires transformation rules for function calls
388-
and control flow statements, such as loops and conditions. The principal
389-
advantage of source transformation is that the automatic differentiation
390-
transformation occurs only once per program, thus eliminating runtime
391-
overhead. Additionally, the complete differentiation program is
392-
available during compilation, enabling ahead-of-time optimization using
393-
compilers.
394-
395-
However, source transformation presents a higher implementation
396-
complexity compared to the other approaches. It must support a wider
397-
array of data types and operations, and it necessitates preprocessors,
398-
compilers, or interpreters of extended languages, along with a more
399-
robust type-checking system. Even though source transformation does not
400-
manage automatic differentiation transformation at runtime, it still
401-
must ensure that certain intermediate variables from the forward pass
402-
are accessible by the adjoint in reverse mode. Two modes are available
403-
to facilitate this:
404-
405-
- **Tape-based mode**: This mode requires a global tape that ensures
406-
the accessibility of intermediate variables. In this method, the
407-
primitive function is augmented so that intermediate variables are
408-
written to functions in the tape during the forward pass, and the
409-
adjoint program reads these intermediate variables from the tape
410-
during the backward pass. The tape used in source transformation
411-
primarily stores the intermediate variables, while the tape used in
412-
operator overloading additionally stores the executed operation
413-
types. Given that the tape is a data structure constructed at
414-
runtime, custom compiler optimizations are required. Moreover, tape
415-
read and write operations must be differentiable to support
416-
higher-order differentiation, which involves multiple applications
417-
of reverse mode. As most tape-based tools do not differentiate tape
418-
read and write operations, such tools do not support
419-
reverse-over-reverse automatic differentiation.
420-
421-
- **Closure-based mode**: This mode was proposed to mitigate some of
422-
the limitations observed in the tape-based mode. Within functional
423-
programming, closures can capture the execution environment of a
424-
statement and identify the non-local use of intermediate variables.

0 commit comments

Comments
 (0)