Skip to content

Commit 5c2a226

Browse files
Merge pull request #998 from ChrisRackauckas-Claude/add-sophia-docstring
Add comprehensive docstring for Sophia optimizer
2 parents 2a90ec0 + 1a2f3a6 commit 5c2a226

File tree

2 files changed

+49
-18
lines changed

2 files changed

+49
-18
lines changed

docs/src/optimization_packages/optimization.md

Lines changed: 3 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -8,24 +8,9 @@ There are some solvers that are available in the Optimization.jl package directl
88

99
This can also handle arbitrary non-linear constraints through a Augmented Lagrangian method with bounds constraints described in 17.4 of Numerical Optimization by Nocedal and Wright. Thus serving as a general-purpose nonlinear optimization solver available directly in Optimization.jl.
1010

11-
- `Sophia`: Based on the recent paper https://arxiv.org/abs/2305.14342. It incorporates second order information in the form of the diagonal of the Hessian matrix hence avoiding the need to compute the complete hessian. It has been shown to converge faster than other first order methods such as Adam and SGD.
12-
13-
+ `solve(problem, Sophia(; η, βs, ϵ, λ, k, ρ))`
14-
15-
+ `η` is the learning rate
16-
+ `βs` are the decay of momentums
17-
+ `ϵ` is the epsilon value
18-
+ `λ` is the weight decay parameter
19-
+ `k` is the number of iterations to re-compute the diagonal of the Hessian matrix
20-
+ `ρ` is the momentum
21-
+ Defaults:
22-
23-
* `η = 0.001`
24-
* `βs = (0.9, 0.999)`
25-
* `ϵ = 1e-8`
26-
* `λ = 0.1`
27-
* `k = 10`
28-
* `ρ = 0.04`
11+
```@docs
12+
Sophia
13+
```
2914

3015
## Examples
3116

src/sophia.jl

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,49 @@
1+
"""
2+
Sophia(; η = 1e-3, βs = (0.9, 0.999), ϵ = 1e-8, λ = 1e-1, k = 10, ρ = 0.04)
3+
4+
A second-order optimizer that incorporates diagonal Hessian information for faster convergence.
5+
6+
Based on the paper "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training"
7+
(https://arxiv.org/abs/2305.14342). Sophia uses an efficient estimate of the diagonal of the Hessian
8+
matrix to adaptively adjust the learning rate for each parameter, achieving faster convergence than
9+
first-order methods like Adam and SGD while avoiding the computational cost of full second-order methods.
10+
11+
## Arguments
12+
13+
- `η::Float64 = 1e-3`: Learning rate (step size)
14+
- `βs::Tuple{Float64, Float64} = (0.9, 0.999)`: Exponential decay rates for the first moment (β₁)
15+
and diagonal Hessian (β₂) estimates
16+
- `ϵ::Float64 = 1e-8`: Small constant for numerical stability
17+
- `λ::Float64 = 1e-1`: Weight decay coefficient for L2 regularization
18+
- `k::Integer = 10`: Frequency of Hessian diagonal estimation (every k iterations)
19+
- `ρ::Float64 = 0.04`: Clipping threshold for the update to maintain stability
20+
21+
## Example
22+
23+
```julia
24+
using Optimization, OptimizationOptimisers
25+
26+
# Define optimization problem
27+
rosenbrock(x, p) = (1 - x[1])^2 + 100 * (x[2] - x[1]^2)^2
28+
x0 = zeros(2)
29+
optf = OptimizationFunction(rosenbrock, Optimization.AutoZygote())
30+
prob = OptimizationProblem(optf, x0)
31+
32+
# Solve with Sophia
33+
sol = solve(prob, Sophia(η=0.01, k=5))
34+
```
35+
36+
## Notes
37+
38+
Sophia is particularly effective for:
39+
- Large-scale optimization problems
40+
- Neural network training
41+
- Problems where second-order information can significantly improve convergence
42+
43+
The algorithm maintains computational efficiency by only estimating the diagonal of the Hessian
44+
matrix using a Hutchinson trace estimator with random vectors, making it more scalable than
45+
full second-order methods while still leveraging curvature information.
46+
"""
147
struct Sophia
248
η::Float64
349
βs::Tuple{Float64, Float64}

0 commit comments

Comments
 (0)