- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 332
an example of a word-level language model using FluxML #357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| grad_x = clamp!(gradient[x], -args.clip, args.clip) | ||
| # backprop | ||
| x .-= lr .* grad_x | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the spirit of promoting best practices, we have https://fluxml.ai/Flux.jl/stable/training/optimisers/#Flux.Optimise.ClipValue and the rest of https://fluxml.ai/Flux.jl/stable/training/optimisers/ for this. I imagine you'd want to use something more sophisticated than plain SGD anyhow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ToucheSir I opted out of the library methods as this explicit calculation provided better performance. However, for the promotion of best practices I agree and I can switch to using the library methods.
|  | ||
| # logit cross entropy loss function | ||
| function loss(x, y) | ||
| Flux.reset!(model) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would highly recommend moving this outside of the loss function (i.e. into the training loop).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ToucheSir Can you elaborate on this? My understanding is that the loss function is called for each batch in a single training loop, thus for a sequential language model such as this one we want to reset the hidden state after each batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume the suggestion is to keep it inside the for batch in data_loader loop, but outside the gradient call.
| hold_out = zip(x_train[end-5:end], y_train[end-5:end]) | ||
|  | ||
| # used for updating hyperparameters | ||
| best_val_loss = nothing | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| best_val_loss = nothing | |
| local best_val_loss | 
Co-authored-by: Brian Chen <[email protected]>
No description provided.