Use consistent spelling for optimise (#2203)

jeremiahpslewis · mcabbott · web-flow · commit eb15eae6873c · 2023-03-14T20:11:15.000-04:00
* Use consistent spelling for optimise

* Update NEWS.md

* Update NEWS.md

---------

Co-authored-by: Michael Abbott &lt;32575566+mcabbott@users.noreply.github.com&gt;
diff --git a/NEWS.md b/NEWS.md
@@ -1,5 +1,6 @@
 # Flux Release Notes
 
+
 ## v0.13.14
 * Fixed various deprecation warnings, from `Zygone.@nograd` and `Vararg`.
 
@@ -45,7 +46,7 @@ been removed in favour of MLDatasets.jl.
 * Fixed [AlphaDropout](https://github.com/FluxML/Flux.jl/pull/1781)
 
 ## v0.12.8
-* Optimized inference and gradient calculation of OneHotMatrix[pr](https://github.com/FluxML/Flux.jl/pull/1756)
+* Optimised inference and gradient calculation of OneHotMatrix[pr](https://github.com/FluxML/Flux.jl/pull/1756)
 
 ## v0.12.7
 * Added support for [`GRUv3`](https://github.com/FluxML/Flux.jl/pull/1675)
@@ -99,7 +100,7 @@ been removed in favour of MLDatasets.jl.
 * Change to `DataLoader`'s [constructor](https://github.com/FluxML/Flux.jl/pull/1152)
 * Uniform loss [interface](https://github.com/FluxML/Flux.jl/pull/1150)
 * Loss functions now live in the `Flux.Losses` [module](https://github.com/FluxML/Flux.jl/pull/1264)
-* Optimistic ADAM (OADAM) optimizer for [adversarial training](https://github.com/FluxML/Flux.jl/pull/1246).
+* Optimistic ADAM (OADAM) optimiser for [adversarial training](https://github.com/FluxML/Flux.jl/pull/1246).
 * Add option for [same padding](https://github.com/FluxML/Flux.jl/pull/901) to conv and pooling layers by setting `pad=SamePad()`.
 * Added option to set `bias` to [Flux.Zeros](https://github.com/FluxML/Flux.jl/pull/873) to eliminating `bias` from being trained.
 * Added `GlobalMaxPool` and `GlobalMeanPool` [layers](https://github.com/FluxML/Flux.jl/pull/950) for performing global pooling operations.
diff --git a/docs/src/ecosystem.md b/docs/src/ecosystem.md
@@ -99,7 +99,7 @@ Packages based on differentiable programming but not necessarily related to Mach
 
 Some useful and random packages!
 
-- [AdversarialPrediction.jl](https://github.com/rizalzaf/AdversarialPrediction.jl) provides a way to easily optimize generic performance metrics in supervised learning settings using the [Adversarial Prediction](https://arxiv.org/abs/1812.07526) framework.
+- [AdversarialPrediction.jl](https://github.com/rizalzaf/AdversarialPrediction.jl) provides a way to easily optimise generic performance metrics in supervised learning settings using the [Adversarial Prediction](https://arxiv.org/abs/1812.07526) framework.
 - [Mill.jl](https://github.com/CTUAvastLab/Mill.jl) helps to prototype flexible multi-instance learning models.
 - [MLMetrics.jl](https://github.com/JuliaML/MLMetrics.jl) is a utility for scoring models in data science and machine learning.
 - [Torch.jl](https://github.com/FluxML/Torch.jl) exposes torch in Julia.
diff --git a/docs/src/gpu.md b/docs/src/gpu.md
@@ -138,12 +138,12 @@ In order to train the model using the GPU both model and the training data have
 1. Iterating over the batches in a [DataLoader](@ref) object transferring each one of the training batches at a time to the GPU. 
    ```julia
    train_loader = Flux.DataLoader((xtrain, ytrain), batchsize = 64, shuffle = true)
-   # ... model, optimizer and loss definitions
+   # ... model, optimiser and loss definitions
    for epoch in 1:nepochs
        for (xtrain_batch, ytrain_batch) in train_loader
            x, y = gpu(xtrain_batch), gpu(ytrain_batch)
            gradients = gradient(() -> loss(x, y), parameters)
-           Flux.Optimise.update!(optimizer, parameters, gradients)
+           Flux.Optimise.update!(optimiser, parameters, gradients)
        end
    end
    ```
@@ -166,7 +166,7 @@ In order to train the model using the GPU both model and the training data have
    ```julia
    using CUDA: CuIterator
    train_loader = Flux.DataLoader((xtrain, ytrain), batchsize = 64, shuffle = true)
-   # ... model, optimizer and loss definitions
+   # ... model, optimiser and loss definitions
    for epoch in 1:nepochs
        for (xtrain_batch, ytrain_batch) in CuIterator(train_loader)
           # ...
diff --git a/docs/src/models/overview.md b/docs/src/models/overview.md
@@ -116,7 +116,7 @@ julia> predict.bias
 
 The dimensions of these model parameters depend on the number of inputs and outputs.
 
-Flux will adjust predictions by iteratively changing these parameters according to the optimizer.
+Flux will adjust predictions by iteratively changing these parameters according to the optimiser.
 
 This optimiser implements the classic gradient descent strategy. Now improve the parameters of the model with a call to [`Flux.train!`](@ref) like this:
 
@@ -178,7 +178,7 @@ First, we gathered real-world data into the variables `x_train`, `y_train`, `x_t
 
 Then, we built a single input, single output predictive model, `predict = Dense(1 => 1)`. The initial predictions weren't accurate, because we had not trained the model yet.
 
-After building the model, we trained it with `train!(loss, predict, data, opt)`. The loss function is first, followed by the model itself, the training data, and the `Descent` optimizer provided by Flux. We ran the training step once, and observed that the parameters changed and the loss went down. Then, we ran the `train!` many times to finish the training process.
+After building the model, we trained it with `train!(loss, predict, data, opt)`. The loss function is first, followed by the model itself, the training data, and the `Descent` optimiser provided by Flux. We ran the training step once, and observed that the parameters changed and the loss went down. Then, we ran the `train!` many times to finish the training process.
 
 After we trained the model, we verified it with the test data to verify the results. 
 
diff --git a/docs/src/saving.md b/docs/src/saving.md
@@ -129,10 +129,10 @@ revert to an older copy of the model if it starts to overfit.
 @save "model-$(now()).bson" model loss = testloss()
 ```
 
-Note that to resume a model's training, you might need to restore other stateful parts of your training loop. Possible examples are stateful optimizers (which usually utilize an `IdDict` to store their state), and the randomness used to partition the original data into the training and validation sets.
+Note that to resume a model's training, you might need to restore other stateful parts of your training loop. Possible examples are stateful optimisers (which usually utilize an `IdDict` to store their state), and the randomness used to partition the original data into the training and validation sets.
 
 You can store the optimiser state alongside the model, to resume training
-exactly where you left off. BSON is smart enough to [cache values](https://github.com/JuliaIO/BSON.jl/blob/v0.3.4/src/write.jl#L71) and insert links when saving, but only if it knows everything to be saved up front. Thus models and optimizers must be saved together to have the latter work after restoring.
+exactly where you left off. BSON is smart enough to [cache values](https://github.com/JuliaIO/BSON.jl/blob/v0.3.4/src/write.jl#L71) and insert links when saving, but only if it knows everything to be saved up front. Thus models and optimisers must be saved together to have the latter work after restoring.
 
 ```julia
 opt = Adam()
diff --git a/docs/src/training/optimisers.md b/docs/src/training/optimisers.md
@@ -76,7 +76,7 @@ Flux.Optimise.Optimiser
 
 ## Scheduling Optimisers
 
-In practice, it is fairly common to schedule the learning rate of an optimiser to obtain faster convergence. There are a variety of popular scheduling policies, and you can find implementations of them in [ParameterSchedulers.jl](https://darsnack.github.io/ParameterSchedulers.jl/dev/README.html). The documentation for ParameterSchedulers.jl provides a more detailed overview of the different scheduling policies, and how to use them with Flux optimizers. Below, we provide a brief snippet illustrating a [cosine annealing](https://arxiv.org/pdf/1608.03983.pdf) schedule with a momentum optimiser.
+In practice, it is fairly common to schedule the learning rate of an optimiser to obtain faster convergence. There are a variety of popular scheduling policies, and you can find implementations of them in [ParameterSchedulers.jl](https://darsnack.github.io/ParameterSchedulers.jl/dev/README.html). The documentation for ParameterSchedulers.jl provides a more detailed overview of the different scheduling policies, and how to use them with Flux optimisers. Below, we provide a brief snippet illustrating a [cosine annealing](https://arxiv.org/pdf/1608.03983.pdf) schedule with a momentum optimiser.
 
 First, we import ParameterSchedulers.jl and initialize a cosine annealing schedule to vary the learning rate between `1e-4` and `1e-2` every 10 steps. We also create a new [`Momentum`](@ref) optimiser.
 ```julia
diff --git a/docs/src/tutorials/2021-02-07-convnet.md b/docs/src/tutorials/2021-02-07-convnet.md
@@ -145,7 +145,7 @@ function train(; kws...)
        return logitcrossentropy(ŷ, y)
    end
   
-   # Train our model with the given training set using the Adam optimizer and
+   # Train our model with the given training set using the Adam optimiser and
    # printing out performance against the test set as we go.
    opt = Adam(args.lr)
   
diff --git a/docs/src/tutorials/2021-10-08-dcgan-mnist.md b/docs/src/tutorials/2021-10-08-dcgan-mnist.md
@@ -206,7 +206,7 @@ The generator's loss quantifies how well it was able to trick the discriminator.
 generator_loss(fake_output) = logitbinarycrossentropy(fake_output, 1)
 ```
 
-We also need optimizers for our network. Why you may ask? Read more [here](https://towardsdatascience.com/overview-of-various-optimizers-in-neural-networks-17c1be2df6d5). For both the generator and discriminator, we will use the [ADAM optimizer](https://fluxml.ai/Flux.jl/stable/training/optimisers/#Flux.Optimise.ADAM).
+We also need optimisers for our network. Why you may ask? Read more [here](https://towardsdatascience.com/overview-of-various-optimisers-in-neural-networks-17c1be2df6d5). For both the generator and discriminator, we will use the [ADAM optimiser](https://fluxml.ai/Flux.jl/stable/training/optimisers/#Flux.Optimise.ADAM).
 
 ## Utility functions
 
@@ -253,7 +253,7 @@ function train_generator!(gen, disc, fake_img, opt, ps, hparams)
 end
 ```
 
-Now that we have defined every function we need, we integrate everything into a single `train` function where we first set up all the models and optimizers and then train the GAN for a specified number of epochs.
+Now that we have defined every function we need, we integrate everything into a single `train` function where we first set up all the models and optimisers and then train the GAN for a specified number of epochs.
 
 ```julia
 function train(hparams)
@@ -278,7 +278,7 @@ function train(hparams)
     disc_ps = params(disc)
     gen_ps = params(gen)
 
-    # Initialize the ADAM optimizers for both the sub-models
+    # Initialize the ADAM optimisers for both the sub-models
     # with respective learning rates
     disc_opt = ADAM(hparams.disc_lr)
     gen_opt = ADAM(hparams.gen_lr)
diff --git a/docs/src/tutorials/2021-10-14-vanilla-gan.md b/docs/src/tutorials/2021-10-14-vanilla-gan.md
@@ -38,7 +38,7 @@ at plots in a separate window, use fantastic for debugging.
 
 
 Next, let us define values for learning rate, batch size, epochs, and other
-hyper-parameters. While we are at it, we also define optimizers for the generator
+hyper-parameters. While we are at it, we also define optimisers for the generator
 and discriminator network. More on what these are later.
 
 ```julia
@@ -49,8 +49,8 @@ and discriminator network. More on what these are later.
     output_period = 100 # Period length for plots of generator samples
     n_features = 28 * 28# Number of pixels in each sample of the MNIST dataset
     latent_dim = 100    # Dimension of latent space
-    opt_dscr = ADAM(lr_d)# Optimizer for the discriminator
-    opt_gen = ADAM(lr_g) # Optimizer for the generator
+    opt_dscr = ADAM(lr_d)# Optimiser for the discriminator
+    opt_gen = ADAM(lr_g) # Optimiser for the generator
 ```
 
 
diff --git a/src/optimise/optimisers.jl b/src/optimise/optimisers.jl
@@ -45,7 +45,7 @@ end
 """
     Momentum(η = 0.01, ρ = 0.9)
 
-Gradient descent optimizer with learning rate `η` and momentum `ρ`.
+Gradient descent optimiser with learning rate `η` and momentum `ρ`.
 
 # Parameters
 - Learning rate (`η`): Amount by which gradients are discounted before updating
@@ -78,7 +78,7 @@ end
 """
     Nesterov(η = 0.001, ρ = 0.9)
 
-Gradient descent optimizer with learning rate `η` and Nesterov momentum `ρ`.
+Gradient descent optimiser with learning rate `η` and Nesterov momentum `ρ`.
 
 # Parameters
 - Learning rate (`η`): Amount by which gradients are discounted before updating
@@ -191,7 +191,7 @@ end
 """
     RAdam(η = 0.001, β::Tuple = (0.9, 0.999), ϵ = $EPS)
 
-[Rectified Adam](https://arxiv.org/abs/1908.03265) optimizer.
+[Rectified Adam](https://arxiv.org/abs/1908.03265) optimiser.
 
 # Parameters
 - Learning rate (`η`): Amount by which gradients are discounted before updating
@@ -328,7 +328,7 @@ end
 """
     AdaGrad(η = 0.1, ϵ = $EPS)
 
-[AdaGrad](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) optimizer. It has
+[AdaGrad](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) optimiser. It has
 parameter specific learning rates based on how frequently it is updated.
 Parameters don't need tuning.
 
@@ -540,7 +540,7 @@ function apply!(o::AdaBelief, x, Δ)
   #= st is a variance and can go to zero. This is in contrast to Adam, which uses the
   second moment which is usually far enough from zero. This is problematic, since st
   can be slightly negative due to numerical error, and the square root below will fail.
-  Also, if we want to differentiate through the optimizer, √0 is not differentiable.
+  Also, if we want to differentiate through the optimiser, √0 is not differentiable.
   To protect against this, we add a small number, st -> st + eps2.
   The original implementation (https://github.com/juntang-zhuang/Adabelief-Optimizer)
   uses the square of Adam's epsilon, which we do here.
@@ -556,7 +556,7 @@ function apply!(o::AdaBelief, x, Δ)
 end
 
 
-# Compose optimizers
+# Compose optimisers
 
 """
     Optimiser(a, b, c...)
@@ -598,7 +598,7 @@ for more general scheduling techniques.
 
 # Examples
 
-`InvDecay` is typically composed  with other optimizers 
+`InvDecay` is typically composed  with other optimisers 
 as the last transformation of the gradient:
 
 ```julia
@@ -643,13 +643,13 @@ for more general scheduling techniques.
 
 # Examples
 
-`ExpDecay` is typically composed  with other optimizers 
+`ExpDecay` is typically composed  with other optimisers 
 as the last transformation of the gradient:
 ```julia
 opt = Optimiser(Adam(), ExpDecay(1.0))
 ```
 Note: you may want to start with `η=1` in `ExpDecay` when combined with other
-optimizers (`Adam` in this case) that have their own learning rate.
+optimisers (`Adam` in this case) that have their own learning rate.
 """
 mutable struct ExpDecay <: AbstractOptimiser
   eta::Float64
@@ -677,7 +677,7 @@ end
     WeightDecay(λ = 0)
 
 Decay weights by ``λ``. 
-Typically composed  with other optimizers as the first transformation to the gradient,
+Typically composed  with other optimisers as the first transformation to the gradient,
 making it equivalent to adding ``L_2`` regularization 
 with coefficient  ``λ`` to the loss.
 
diff --git a/src/optimise/train.jl b/src/optimise/train.jl
@@ -10,9 +10,9 @@ import Optimisers.update!
     update!(opt, ps::Params, gs)
 
 Perform an update step of the parameters `ps` (or the single parameter `p`)
-according to optimizer `opt::AbstractOptimiser`  and the gradients `gs` (the gradient `g`).
+according to optimiser `opt::AbstractOptimiser`  and the gradients `gs` (the gradient `g`).
 
-As a result, the parameters are mutated and the optimizer's internal state may change.
+As a result, the parameters are mutated and the optimiser's internal state may change.
 The gradient could be mutated as well.
 
 !!! compat "Deprecated"
@@ -108,7 +108,7 @@ Here `pars` is produced by calling [`Flux.params`](@ref) on your model.
 (Or just on the layers you want to train, like `train!(loss, params(model[1:end-2]), data, opt)`.)
 This is the "implicit" style of parameter handling.
 
-This gradient is then used by optimizer `opt` to update the parameters:
+This gradient is then used by optimiser `opt` to update the parameters:
 ```
     update!(opt, pars, grads)
 ```
diff --git a/test/optimise.jl b/test/optimise.jl
@@ -221,7 +221,7 @@ end
 # wreaks all sorts of havoc on our training loops.  This test ensures that
 # a simple optimization is montonically decreasing (up to learning step effects)
 @testset "Momentum Optimisers and complex values" begin
-  # Test every optimizer that has momentum internally
+  # Test every optimiser that has momentum internally
   for opt_ctor in [Adam, RMSProp, RAdam, OAdam, AdaGrad, AdaDelta, NAdam, AdaBelief]
     # Our "model" is just a complex number
     w = zeros(ComplexF32, 1)