- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 332
Explicit convmixer cifar10 #374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| The original README does have some notes about achievable accuracy, but I don't believe anyone verified those numbers. One thing that could help is using the built-in ConvMixer model in Metalhead, because that has undergone basic convergence testing. | 
1333b75    to
    7f13463      
    Compare
  
    | function create_loss_function(dataloader, device) | ||
|  | ||
| function loss(model) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functions which create functions seem like an anti-pattern.
| if use_cuda | ||
| device = gpu | ||
| @info "Training on GPU" | ||
| else | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This @info will lie to you, unless you edit the file to change use_cuda... simpler just to call gpu and it will do nothing if you don't have one.
| BatchNorm(dim), | ||
| [ | ||
| Chain( | ||
| SkipConnection(Chain(Conv((kernel_size,kernel_size), dim=>dim, gelu; pad=SamePad(), groups=dim), BatchNorm(dim)), +), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No functional change to the model, just indents etc.
| opt = OptimiserChain( | ||
| WeightDecay(1f-3), | ||
| ClipNorm(1.0), | ||
| ADAM(η) | ||
| ClipNorm(1f0), | ||
| Adam(η), | ||
| ) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exposes a real error: composing ClipNorm after another optimiser means it tries to take norm(::Broadcasted) which fails:
julia> train(epochs=10, images=128)
ERROR: Scalar indexing is disallowed.
...
Stacktrace:
...
 [12] norm(itr::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, NTuple{4, Base.OneTo{Int64}}, typeof(+), Tuple{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(*), Tuple{Float32, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}}}}}, p::Float32)
    @ LinearAlgebra ~/julia-9ded051e9f/share/julia/stdlib/v1.10/LinearAlgebra/src/generic.jl:596
 [13] apply!(o::Optimisers.ClipNorm{Float32}, state::Nothing, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, dx::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, NTuple{4, Base.OneTo{Int64}}, typeof(+), Tuple{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(*), Tuple{Float32, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}}}}})
    @ Optimisers ~/.julia/packages/Optimisers/kPdJV/src/rules.jl:584
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed, but not yet tested here.
7f13463    to
    3a2580b      
    Compare
  
    |  | ||
| model = model |> cpu | ||
| @save "model.bson" model | ||
| @save "model.bson" cpu(model) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should really be updated to Flux.state too
3a2580b    to
    ca9b934      
    Compare
  
    | opt = OptimiserChain( | ||
| WeightDecay(1f-3), # L2 regularisation | ||
| ClipNorm(1f0), | ||
| Adam(3f-4), # learning rate | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current practice would be to use AdamW and set L2 regularization there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but maybe it's good to show how to chain rules?
| for (x1, y1) in train_loader | ||
| # move one batch at a time to GPU; gpu(train_loader) would be another way | ||
| x, y = gpu(x1), gpu(y1) | ||
| grads = gradient(m -> Flux.logitcrossentropy(m(x), y; agg=sum), model) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why agg=sum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just left this, as it probably doesn't hurt... no idea why it's there, maybe it was like that in what someone followed? Showing what the keyword is doesn't seem awful
| cpu_model = cpu(model) | ||
| BSON.@save "model.bson" cpu_model | ||
| BSON.@save "losses.bson" train_save test_save | ||
| # it's generally more robust to save just the state, like this, but | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but what? I would remove entirely the model saving with bson in favor of state saving with JLD2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted this because maybe it's not a terrible idea to show some variety of approaches, not repeat exactly the same thing in every model?
The way this model is set up, you'd have to copy-paste the code with the sizes etc. to re-build the model before you can load the state back in. So perhaps leaving this is OK?
| train_save[epoch, :] = [train_loss, train_acc] | ||
| test_save[epoch, :] = [test_loss, test_acc] | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this history saving entirely? seems useless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a crazy way to collect some numbers, maybe this pattern makes sense to someone instead of others?
This previously didn't have a manifest.
Needs a correction to the translation rule for Optimisers.ClipNorm in Flux.fixed in FluxML/Flux.jl#2145Also, a simple test showed increasing loss, so perhaps something else is broken?