-
Notifications
You must be signed in to change notification settings - Fork 104
Add documentation for Stochastic Gradient Samplers #629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
AoifeHughes
commented
Aug 4, 2025
- Docs on SGHMC / SGLD? Turing.jl#2270 adds docs to support this
Preview the changes: https://turinglang.org/docs/pr-previews/629 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can comment on the clarity of explanations, but I can't comment on some of the content, most importantly the Summary section, because I know nothing about these samplers. E.g. the recommendations for hyperparameters, I have no idea about them. @yebai, who would be a good reviewer for that?
# Define a simple Gaussian model | ||
@model function gaussian_model(x) | ||
μ ~ Normal(0, 10) | ||
σ ~ truncated(Normal(0, 5), 0, Inf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
σ ~ truncated(Normal(0, 5), 0, Inf) | |
σ ~ truncated(Normal(0, 5); lower=0) |
The Inf
version causes trouble with AD, see JuliaStats/Distributions.jl#1910. We are trying to guide users towards the kwargs lower
and upper
.
|
||
```{julia} | ||
#| output: false | ||
setprogress!(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be moved up, or replaced with progress=false
in the sample
call. Currently the above cell still produces loads of lines of progress output that don't render nicely: https://turinglang.org/docs/pr-previews/629/usage/stochastic-gradient-samplers/
``` | ||
|
||
```{julia} | ||
plot(chain_sgld) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The results on https://turinglang.org/docs/pr-previews/629/usage/stochastic-gradient-samplers/ don't look convincing to me, it looks like sampling hasn't converged. Can we increase sample counts without it taking too long? Or it could be a problem with some hyperparameters, I wouldn't know.
``` | ||
|
||
```{julia} | ||
plot(chain_sghmc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing for these results.
summarystats(chain_hmc) | ||
``` | ||
|
||
Compare the trace plots: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we comment on the conclusions from this, what do we learn from this comparison? Also, the first trace plot looks weird.
|
||
### When to Use Stochastic Gradient Samplers | ||
|
||
- **Large datasets**: When full gradient computation is prohibitively expensive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this in contradiction with the statement below that with Turing full gradients are computed anyway, and noise is added?
Pkg.instantiate(); | ||
``` | ||
|
||
Turing.jl provides stochastic gradient-based MCMC samplers that are designed for large-scale datasets where computing full gradients is computationally expensive. The two main stochastic gradient samplers are **Stochastic Gradient Langevin Dynamics (SGLD)** and **Stochastic Gradient Hamiltonian Monte Carlo (SGHMC)**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first sentence seems to be immediately undermined by the next paragraph that says that you can't actually use them for this purpose. Maybe better to lead with what they are currently useful for and then comment on possible future uses on if we ever get to implementing these better, rather than the other way around.
@@ -0,0 +1,219 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a general comment, not related to the line it's attached to: The navigation bar on the left needs a new link to this page, I think currently there's no way to navigate to it without knowing the URL.
model = gaussian_model(data) | ||
``` | ||
|
||
SGLD requires very small step sizes to ensure stability. We use a `PolynomialStepsize` that decreases over time: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have other options for stepsize
in Turing, other than PolynomialStepsize
?
|
||
## Automatic Differentiation Backends | ||
|
||
Both samplers support different AD backends: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could link to the AD page in our docs for more information.