Skip to content

Commit 13bb061

Browse files
move schedulers (#2560)
1 parent 7a3eafd commit 13bb061

File tree

2 files changed

+34
-31
lines changed

2 files changed

+34
-31
lines changed

docs/src/guide/training/training.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -337,6 +337,39 @@ opt_state = Flux.setup(Adam(0.02), bimodel)
337337
Flux.adjust!(opt_state.layers.enc, 0.03)
338338
```
339339

340+
341+
## Scheduling Optimisers
342+
343+
In practice, it is fairly common to schedule the learning rate of an optimiser to obtain faster convergence. There are a variety of popular scheduling policies, and you can find implementations of them in [ParameterSchedulers.jl](http://fluxml.ai/ParameterSchedulers.jl/stable). The documentation for ParameterSchedulers.jl provides a more detailed overview of the different scheduling policies, and how to use them with Flux optimisers. Below, we provide a brief snippet illustrating a [cosine annealing](https://arxiv.org/pdf/1608.03983.pdf) schedule with a momentum optimiser.
344+
345+
First, we import ParameterSchedulers.jl and initialize a cosine annealing schedule to vary the learning rate between `1e-4` and `1e-2` every 10 epochs. We also create a new [`Momentum`](@ref Optimisers.Momentum) optimiser.
346+
```julia
347+
using ParameterSchedulers
348+
349+
opt_state = Flux.setup(Momentum(), model)
350+
schedule = Cos(λ0 = 1e-4, λ1 = 1e-2, period = 10)
351+
for (eta, epoch) in zip(schedule, 1:100)
352+
Flux.adjust!(opt_state, eta)
353+
# your training code here
354+
end
355+
```
356+
`schedule` can also be indexed (e.g. `schedule(100)`) or iterated like any iterator in Julia.
357+
358+
ParameterSchedulers.jl schedules are stateless (they don't store their iteration state). If you want a _stateful_ schedule, you can use `ParameterSchedulers.Stateful`:
359+
```julia
360+
using ParameterSchedulers: Stateful, next!
361+
362+
schedule = Stateful(Cos(λ0 = 1e-4, λ1 = 1e-2, period = 10))
363+
for epoch in 1:100
364+
Flux.adjust!(opt_state, next!(schedule))
365+
# your training code here
366+
end
367+
```
368+
369+
Finally, a scheduling function can be incorporated into the optimser's state, advanced at each gradient update step, and possibly passed to the `train!` function. See [this section](https://fluxml.ai/ParameterSchedulers.jl/stable/tutorials/optimizers/#Working-with-Flux-optimizers) of ParameterSchedulers.jl documentation for more details.
370+
371+
ParameterSchedulers.jl allows for many more scheduling policies including arbitrary functions, looping any function with a given period, or sequences of many schedules. See the [ParameterSchedulers.jl documentation](https://fluxml.ai/ParameterSchedulers.jl/stable) for more info.
372+
340373
## Freezing layer parameters
341374

342375
To completely disable training of some part of the model, use [`freeze!`](@ref Flux.freeze!).

docs/src/reference/training/optimisers.md

Lines changed: 1 addition & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -67,36 +67,6 @@ It is possible to compose optimisers for some added flexibility.
6767
Optimisers.OptimiserChain
6868
```
6969

70-
## Scheduling Optimisers
71-
72-
In practice, it is fairly common to schedule the learning rate of an optimiser to obtain faster convergence. There are a variety of popular scheduling policies, and you can find implementations of them in [ParameterSchedulers.jl](http://fluxml.ai/ParameterSchedulers.jl/stable). The documentation for ParameterSchedulers.jl provides a more detailed overview of the different scheduling policies, and how to use them with Flux optimisers. Below, we provide a brief snippet illustrating a [cosine annealing](https://arxiv.org/pdf/1608.03983.pdf) schedule with a momentum optimiser.
73-
74-
First, we import ParameterSchedulers.jl and initialize a cosine annealing schedule to vary the learning rate between `1e-4` and `1e-2` every 10 steps. We also create a new [`Momentum`](@ref Optimisers.Momentum) optimiser.
75-
```julia
76-
using ParameterSchedulers
77-
78-
opt = Momentum()
79-
schedule = Cos(λ0 = 1e-4, λ1 = 1e-2, period = 10)
80-
for (eta, epoch) in zip(schedule, 1:100)
81-
opt.eta = eta
82-
# your training code here
83-
end
84-
```
85-
`schedule` can also be indexed (e.g. `schedule(100)`) or iterated like any iterator in Julia.
86-
87-
ParameterSchedulers.jl schedules are stateless (they don't store their iteration state). If you want a _stateful_ schedule, you can use `ParameterSchedulers.Stateful`:
88-
```julia
89-
using ParameterSchedulers: Stateful, next!
90-
91-
schedule = Stateful(Cos(λ0 = 1e-4, λ1 = 1e-2, period = 10))
92-
for epoch in 1:100
93-
opt.eta = next!(schedule)
94-
# your training code here
95-
end
96-
```
97-
98-
ParameterSchedulers.jl allows for many more scheduling policies including arbitrary functions, looping any function with a given period, or sequences of many schedules. See the ParameterSchedulers.jl documentation for more info.
99-
10070
## Decays
10171

10272
Similar to optimisers, Flux also defines some simple decays that can be used in conjunction with other optimisers, or standalone.
@@ -111,7 +81,7 @@ Optimisers.WeightDecay
11181
Gradient clipping is useful for training recurrent neural networks, which have a tendency to suffer from the exploding gradient problem. An example usage is
11282

11383
```julia
114-
opt = OptimiserChain(ClipValue(1e-3), Adam(1e-3))
84+
opt = OptimiserChain(ClipGrad(1e-3), Adam(1e-3))
11585
```
11686

11787
```@docs

0 commit comments

Comments
 (0)