|
| 1 | +""" |
| 2 | + Sophia(; η = 1e-3, βs = (0.9, 0.999), ϵ = 1e-8, λ = 1e-1, k = 10, ρ = 0.04) |
| 3 | +
|
| 4 | +A second-order optimizer that incorporates diagonal Hessian information for faster convergence. |
| 5 | +
|
| 6 | +Based on the paper "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training" |
| 7 | +(https://arxiv.org/abs/2305.14342). Sophia uses an efficient estimate of the diagonal of the Hessian |
| 8 | +matrix to adaptively adjust the learning rate for each parameter, achieving faster convergence than |
| 9 | +first-order methods like Adam and SGD while avoiding the computational cost of full second-order methods. |
| 10 | +
|
| 11 | +## Arguments |
| 12 | +
|
| 13 | + - `η::Float64 = 1e-3`: Learning rate (step size) |
| 14 | + - `βs::Tuple{Float64, Float64} = (0.9, 0.999)`: Exponential decay rates for the first moment (β₁) |
| 15 | + and diagonal Hessian (β₂) estimates |
| 16 | + - `ϵ::Float64 = 1e-8`: Small constant for numerical stability |
| 17 | + - `λ::Float64 = 1e-1`: Weight decay coefficient for L2 regularization |
| 18 | + - `k::Integer = 10`: Frequency of Hessian diagonal estimation (every k iterations) |
| 19 | + - `ρ::Float64 = 0.04`: Clipping threshold for the update to maintain stability |
| 20 | +
|
| 21 | +## Example |
| 22 | +
|
| 23 | +```julia |
| 24 | +using Optimization, OptimizationOptimisers |
| 25 | +
|
| 26 | +# Define optimization problem |
| 27 | +rosenbrock(x, p) = (1 - x[1])^2 + 100 * (x[2] - x[1]^2)^2 |
| 28 | +x0 = zeros(2) |
| 29 | +optf = OptimizationFunction(rosenbrock, Optimization.AutoZygote()) |
| 30 | +prob = OptimizationProblem(optf, x0) |
| 31 | +
|
| 32 | +# Solve with Sophia |
| 33 | +sol = solve(prob, Sophia(η=0.01, k=5)) |
| 34 | +``` |
| 35 | +
|
| 36 | +## Notes |
| 37 | +
|
| 38 | +Sophia is particularly effective for: |
| 39 | + - Large-scale optimization problems |
| 40 | + - Neural network training |
| 41 | + - Problems where second-order information can significantly improve convergence |
| 42 | + |
| 43 | +The algorithm maintains computational efficiency by only estimating the diagonal of the Hessian |
| 44 | +matrix using a Hutchinson trace estimator with random vectors, making it more scalable than |
| 45 | +full second-order methods while still leveraging curvature information. |
| 46 | +""" |
1 | 47 | struct Sophia |
2 | 48 | η::Float64 |
3 | 49 | βs::Tuple{Float64, Float64} |
|
0 commit comments