You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/theory.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,10 +14,10 @@ i.e., the function value and the directional derivative up to order $p$.
14
14
This notation might be unfamiliar to Julia users that had experience with other AD packages, but $\partial f(x)$ is simply the jacobian $J$, and $\partial f(x)\times v$ is simply the Jacobian-vector product (JVP).
15
15
In other words, this is a simple generalization of Jacobian-vector product to Hessian-vector-vector product, and to even higher orders.
16
16
17
-
The main advantage of doing this instead of doing $p$ first-order Jacobian-vector products is that nesting first-order AD results in exponential scaling w.r.t $p$, while this method, also known as Taylor mode, should be (almost) linear scaling w.r.t $p$.
17
+
The main advantage of doing this instead of doing $p$ first-order Jacobian-vector products is that nesting first-order AD results in exponential scaling w.r.t $p$, while this method, also known as Taylor mode, should scale (almost) linearly w.r.t $p$.
18
18
We will see the reason of this claim later.
19
19
20
-
In order to achieve this, assuming that $f$ is a nested function $f_k\circ\cdots\circ f_2\circ f_1$, where each $f_i$ is a basic and simple function, or called "primitives".
20
+
In order to achieve this, we assume that $f$ is a nested function $f_k\circ\cdots\circ f_2\circ f_1$, where each $f_i$ is a basic and simple function, also called "primitive".
21
21
We need to figure out how to propagate the derivatives through each step.
22
22
In first order AD, this is achieved by the "dual" pair $x_0+x_1\varepsilon$, where $\varepsilon^2=0$, and for each primitive we make a method overload
23
23
```math
@@ -118,20 +118,20 @@ Note that this is an elegant and straightforward corollary from the definition o
118
118
119
119
## Generic pushforward rule
120
120
121
-
For a generic $f(x)$, if we don't bother deriving the specific recurrence rule for it, we can still automatically generate pushforward rule in the following manner.
121
+
For a generic $f(x)$, if we don't bother deriving the specific recurrence rule for it, we can still automatically generate a pushforward rule in the following manner.
122
122
Let's denote the derivative of $f$ w.r.t $x$ to be $d(x)$, then for $f(t)=f(x(t))$ we have
123
123
```math
124
124
f'(t)=d(x(t))x'(t);\quad f(0)=f(x_0)
125
125
```
126
126
127
127
when we expand $f$ and $x$ up to order $p$ into this equation, we notice that only order $p-1$ is needed for $d(x(t))$.
128
-
In other words, we turn a problem of finding $p$-th order pushforward for $f$, to a problem of finding $p-1$-th order pushforward for $d$, and we can recurse down to the first order.
129
-
The first-order derivative expressions are captured from ChainRules.jl, which made this process fully automatic.
128
+
In other words, we turn a problem of finding $p$-th order pushforward for $f$, to a problem of finding $(p-1)$-th order pushforward for $d$, and we can recurse down to the first order.
129
+
The first-order derivative expressions are captured from ChainRules.jl, which makes this process fully automatic.
130
130
131
-
This strategy is in principle equivalent to nesting first-order differentiation, which could potentially leads to exponential scaling; however, in practice there is a huge difference.
132
-
This generation of pushforward rule happens at **compile time**, which gives the compiler a chance to check redundant expressions and optimize it down to quadratic time.
133
-
Compiler has stack limits but this should work for at least up to order 100.
131
+
This strategy is in principle equivalent to nesting first-order differentiation, which could potentially lead to exponential scaling; however, in practice there is a huge difference.
132
+
This generation of pushforward rules happens at **compile time**, which gives the compiler a chance to check redundant expressions and optimize it down to quadratic time.
133
+
The compiler has stack limits but this should work at least up to order 100.
134
134
135
-
In the current implementation of TaylorDiff.jl, all $\log$-like functions' pushforward rules are generated by this strategy, since their derivatives are simple algebraic expressions; some $\exp$-like functions, like sinh, is also generated; the most-often-used several $\exp$-like functions are hand-written with hand-derived recurrence relations.
135
+
In the current implementation of TaylorDiff.jl, all $\log$-like functions' pushforward rules are generated by this strategy, since their derivatives are simple algebraic expressions; some $\exp$-like functions, like $\sinh$, are also generated; several of the most-often-used $\exp$-like functions are hand-written with hand-derived recurrence relations.
136
136
137
137
If you find that the code generated by this strategy is slow, please file an issue and we will look into it.
0 commit comments