- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25
Add Apollo optimizer (https://arxiv.org/pdf/2412.05270) #196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 10 commits
bb94d68
              acbe8e3
              d358026
              8a05289
              b46c0ef
              e39add7
              43d30c6
              b282c35
              ca2ae0a
              6aa32c1
              d9637c6
              c75142f
              b95fd3c
              97b0332
              File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|  | @@ -599,6 +599,136 @@ function apply!(o::AdaBelief, state, x::AbstractArray{T}, dx) where T | |||||||||||||
| return (mt, st, βt .* β), dx′ | ||||||||||||||
| end | ||||||||||||||
|  | ||||||||||||||
|  | ||||||||||||||
| """ | ||||||||||||||
| GradNormGrowthLimiter(γ = 1.1; m = 1e-3, ϵ = 1e-8, throw = true, paramscale_min = true) | ||||||||||||||
|  | ||||||||||||||
| Gradient norm growth limiter from Chen et al. (https://arxiv.org/pdf/2410.01623) and used with Apollo in Zhu et al. (https://arxiv.org/pdf/2412.05270). | ||||||||||||||
|          | ||||||||||||||
| Gradient norm growth limiter from Chen et al. (https://arxiv.org/pdf/2410.01623) and used with Apollo in Zhu et al. (https://arxiv.org/pdf/2412.05270). | |
| Gradient norm growth limiter from [Chen et al.](https://arxiv.org/abs/2410.01623) and used with Apollo in [Zhu et al.](https://arxiv.org/abs/2412.05270). | 
        
          
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain the role of gamma?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the new version.
        
          
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't allow unicode field names, suggest:
| γ::Float64 | |
| m::Float64 #Min grad norm, to stop a tensor getting stuck near zero | |
| ϵ::Float64 | |
| gamma::Float64 | |
| mu::Float64 # Min grad norm, to stop a tensor getting stuck near zero | |
| epsilon::Float64 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but changed the variable names to avoid eg. gamma.
        
          
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have greek-letter keyword options, nor field names -- the API should never ask the user to type these. They are used only in documentation / as local variables. Probably the first 3 should be positional.
Bikeshedding names bit, to avoid overly long things, the constructor could be:
| GradNormGrowthLimiter(γ = 1.1; m = 1e-3, ϵ = 1e-8, throw = true, paramscale_min = true) = GradNormGrowthLimiter(γ, m, ϵ, throw, paramscale_min) | |
| NormGrowLimit(γ = 1.1, m = 1e-3, ε = 1e-8; throw = true, scale = true) = NormGrowLimit(γ, m, ε, throw, scale) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with NormGrowthCap here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to materialize in matrix case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For everything except the whatever comes in during the "gradient type" test you don't need materialize. I wasn't 100% sure exactly what is coming in during those tests, so wasn't sure how to separate them from regular matrix/tensors. What do you suggest here?
        
          
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines allocate a lot.
Rhat isn't used?
For the rest maybe it can be something like
sum1R2 = sum(abs2, R; dims=1)  # it's already the right shape, no need for [:] & reshape(s, 1, :)?
s = @. sqrt(sum1R2) / sqrt(Rhat + ϵ)
dx′′ = @lazy η * (dx * s) + λ * x  # one fused broadcast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got something like this working, but the @lazy breaks things, so omitted for now.
        
          
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| dx′′ = dx′′' | |
| dx′′ = transpose(dx′′) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this sort of branching introduces type instability. IDK if we care but perhaps worth some thought. Maybe there's a nicer way to just store everything transposed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe an optimization we can figure out later if it becomes an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the default value for
mcorrespond to the original paper (i.e. m=0 i suppose)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m=0makes sense when this is applied to the entire model, but could be fatal when applied tensor-wise. I think it is better to have non-footgun defaults, and make it clearer that this isn't a faithful reproduction?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've kept a non-zero default, but I've tweaked the docs to clarify that this method isn't quite the same as in those papers. (I also switched the "scaling m by the number of parameters" to using
sqrt).