Document the need for explicit gradients (#80)

mcabbott · darsnack · web-flow · commit 4e53612f263e · 2022-06-01T09:49:14.000-04:00
* document use of explicit gradients

* wording

* Update docs/src/index.md

Co-authored-by: Kyle Daruwalla &lt;daruwalla.k.public@icloud.com&gt;

* wording

* add a Usage with Lux.jl section too

* further comment on model state

* better notation

* use the same resnet example for Lux

* pipe to gpu

* tweak resnet lines

Co-authored-by: Kyle Daruwalla &lt;daruwalla.k.public@icloud.com&gt;
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@ Optimisers.jl defines many standard gradient-based optimisation rules, and tools
 
 This is the future of training for [Flux.jl](https://github.com/FluxML/Flux.jl) neural networks,
 and the present for [Lux.jl](https://github.com/avik-pal/Lux.jl).
-But it can be used separately on anything understood by [Functors.jl](https://github.com/FluxML/Functors.jl).
+But it can be used separately on any array, or anything else understood by [Functors.jl](https://github.com/FluxML/Functors.jl).
 
 ## Installation
 
@@ -38,11 +38,15 @@ state, and the model with its trainable parameters adjusted:
 ```julia
 state = Optimisers.setup(Optimisers.Adam(), model)  # just once
 
+grad = Zygote.gradient(m -> loss(m(x), y), model)[1]
+
 state, model = Optimisers.update(state, model, grad)  # at every step
 ```
 
 For models with deeply nested layers containing the parameters (like [Flux.jl](https://github.com/FluxML/Flux.jl) models),
-this state is a similarly nested tree.
+this state is a similarly nested tree. As is the gradient: if using Zygote, you must use the "explicit" style as shown,
+not the "implicit" one with `Params`.
+
 The function `destructure` collects all the trainable parameters into one vector,
 and returns this along with a function to re-build a similar model:
 
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -1,6 +1,6 @@
 # Optimisers.jl
 
-## Define an Optimiser
+## Defining an Optimiser
 
 A new optimiser must overload two functions, `apply!` and `init`:
 
@@ -30,7 +30,7 @@ is a key design principle and allows users to manage their own state explicitly.
 
 It of course also makes it easier to store the state.
 
-## Usage
+## Usage with [Flux.jl](https://github.com/FluxML/Flux.jl)
 
 To apply such an optimiser to a whole model, `setup` builds a tree containing any initial
 state for every trainable array. Then at each step, `update` uses this and the gradient
@@ -40,29 +40,73 @@ to adjust the model:
 
 using Flux, Metalhead, Optimisers
 
-model = Metalhead.ResNet18() # define a model to train on
-image = rand(Float32, 224, 224, 3, 1); # dummy data
-@show sum(model(image)); # dummy loss function
+model = Metalhead.ResNet(18) |> gpu  # define a model to train
+image = rand(Float32, 224, 224, 3, 1) |> gpu;  # dummy data
+@show sum(model(image));  # dummy loss function
 
-o = Optimisers.ADAM() # define an ADAM optimiser with default settings
-st = Optimisers.setup(o, model);  # initialize the optimiser before using it
+rule = Optimisers.Adam()  # use the Adam optimiser with its default settings
+state = Optimisers.setup(rule, model);  # initialise this optimiser's momentum etc.
 
-m̄, _ = gradient(model, image) do m, x # calculate the gradients
+∇model, _ = gradient(model, image) do m, x  # calculate the gradients
   sum(m(x))
 end;
 
-st, model = Optimisers.update(st, model, m̄);
+state, model = Optimisers.update(state, model, ∇model);
 @show sum(model(image));
 
 ```
 
 Notice that a completely new instance of the model is returned. Internally, this
 is handled by [Functors.jl](https://fluxml.ai/Functors.jl), where we do a walk over the
-tree formed by the model and update the parameters using the gradients. Optimisers can
-work with different forms of gradients, but most likely use case are the gradients as
-returned by [Zygote.jl](https://fluxml.ai/Zygote.jl).
+tree formed by the model and update the parameters using the gradients.
+
+Optimisers.jl does not depend on any one automatic differentiation package,
+but for now the most likely source of gradients is [Zygote.jl](https://fluxml.ai/Zygote.jl).
+Note that `update` always wants the gradient from Zygote's "explicit" mode, as shown above.
+This `∇model` is another tree structure, rather than the dictionary-like object from 
+Zygote's "implicit" mode `gradient(() -> loss(...), Flux.params(model))` -- see 
+[Zygote's documentation](https://fluxml.ai/Zygote.jl/dev/#Explicit-and-Implicit-Parameters-1) for more about this difference.
 
 There is also `Optimisers.update!` which similarly returns a new model and new state,
 but is free to mutate arrays within the old one for efficiency.
 The method of `apply!` you write is likewise free to mutate arrays within its state;
 they are defensively copied when this rule is used with `update`.
+
+## Usage with [Lux.jl](https://github.com/avik-pal/Lux.jl)
+
+The main design difference of Lux is that the tree of parameters is separate from
+the layer structure. It is these parameters which `setup` and `update` need to know about.
+
+Lux describes this separation of parameter storage from model description as "explicit" parameters.
+Beware that it has nothing to do with Zygote's notion of "explicit" gradients.
+(If the same model is written in Flux and Lux, `∇model` above and `∇params` below will often be
+identical trees of nested `NamedTuple`s.)
+
+```julia
+
+using Lux, Boltz, Zygote, Optimisers
+
+lux_model, params, lux_state = Boltz.resnet(:resnet18) |> gpu;  # define and initialise model
+images = rand(Float32, 224, 224, 3, 4) |> gpu;  # batch of dummy data
+y, _ = Lux.apply(lux_model, images, params, lux_state);  # run the model
+@show sum(y)  # initial dummy loss
+
+rule = Optimisers.Adam()
+opt_state = Optimisers.setup(rule, params);  # optimiser state based on model parameters
+
+∇params, _ = gradient(params, images) do p, x  # gradient with respect to parameter tree
+  y, _ = Lux.apply(lux_model, x, p, lux_state)
+  sum(y)
+end;
+
+opt_state, params = Optimisers.update!(opt_state, params, ∇params);
+
+y, _ = Lux.apply(lux_model, images, params, lux_state);
+@show sum(y)
+
+```
+
+Besides the parameters stored in `params` and gradually optimised, any other model state
+is stored in `lux_state`. For simplicity this example does not show how to propagate the 
+updated `lux_state` to the next iteration, see Lux's documentation.
+