|
1094 | 1094 | "href": "usage/stochastic-gradient-samplers/index.html#current-capabilities",
|
1095 | 1095 | "title": "Stochastic Gradient Samplers",
|
1096 | 1096 | "section": "Current Capabilities",
|
1097 |
| - "text": "Current Capabilities\nThe current implementation in Turing.jl is primarily useful for: - Research purposes: Studying stochastic gradient MCMC methods - Streaming data: When data arrives continuously - Experimental applications: Testing stochastic sampling approaches\nImportant: The current implementation computes full gradients with added stochastic noise rather than true mini-batch stochastic gradients. This means these samplers don’t currently provide the computational benefits typically associated with stochastic gradient methods for large datasets. They require very careful hyperparameter tuning and often perform slower than standard samplers like HMC or NUTS for most practical applications.", |
| 1097 | + "text": "Current Capabilities\nThe current implementation in Turing.jl is primarily useful for: - Research purposes: Studying stochastic gradient MCMC methods - Educational purposes: Understanding stochastic gradient MCMC algorithms - Streaming data: When data arrives continuously (with careful tuning) - Experimental applications: Testing stochastic sampling approaches\nImportant: The current implementation computes full gradients with added stochastic noise rather than true mini-batch stochastic gradients. This means these samplers don’t currently provide the computational benefits typically associated with stochastic gradient methods for large datasets. They require very careful hyperparameter tuning and often perform slower than standard samplers like HMC or NUTS for most practical applications.\nFuture Development: These stochastic gradient samplers are being migrated to AdvancedHMC.jl for better maintenance and development. Once migration is complete, Turing.jl will support AbstractMCMC-compatible algorithms, and users requiring research-grade stochastic gradient algorithms will be directed to AdvancedHMC.", |
1098 | 1098 | "crumbs": [
|
1099 | 1099 | "Get Started",
|
1100 | 1100 | "User Guide",
|
|
1154 | 1154 | "href": "usage/stochastic-gradient-samplers/index.html#bayesian-linear-regression-example",
|
1155 | 1155 | "title": "Stochastic Gradient Samplers",
|
1156 | 1156 | "section": "Bayesian Linear Regression Example",
|
1157 |
| - "text": "Bayesian Linear Regression Example\nHere’s a more complex example using Bayesian linear regression:\n\n# Generate regression data\nn_features = 3\nn_samples = 100\nX = randn(n_samples, n_features)\ntrue_β = [0.5, -1.2, 2.1]\ntrue_σ_noise = 0.3\ny = X * true_β + true_σ_noise * randn(n_samples)\n\n@model function linear_regression(X, y)\n n_features = size(X, 2)\n \n # Priors\n β ~ MvNormal(zeros(n_features), 3 * I)\n σ ~ truncated(Normal(0, 1); lower=0)\n \n # Likelihood\n y ~ MvNormal(X * β, σ^2 * I)\nend\n\nlr_model = linear_regression(X, y)\n\nDynamicPPL.Model{typeof(linear_regression), (:X, :y), (), (), Tuple{Matrix{Float64}, Vector{Float64}}, Tuple{}, DynamicPPL.DefaultContext}(linear_regression, (X = [-0.08993884887496832 1.2694180094557772 -0.45068406344161077; -0.23528025045836815 -1.0348870573833149 -1.2512585407119565; … ; -0.5815563239702138 -0.19790550383157401 -0.7201291845682822; 0.29678442882680006 0.6426754256642815 -0.8729317283503407], y = [-2.5001125493734633, -1.3582233483639436, -3.8825717018806856, -0.2345200635330288, 1.4937176261849854, 2.8659122069995644, -0.5833355856450775, 4.642283548210101, 0.14909888834210028, 1.3335900592696839 … 5.9741160301704, -1.5777125963436005, 3.9896734979440236, -1.0204264890982526, -1.6606828145645047, 1.76720805427176, -0.20620159329470383, -1.9121131995245513, -0.9431065705584871, -2.3648743995748114]), NamedTuple(), DynamicPPL.DefaultContext())\n\n\nSample using the stochastic gradient methods:\n\n# Very conservative parameters for stability\nsgld_lr_stepsize = Turing.PolynomialStepsize(0.00005, 10000, 0.55)\nchain_lr_sgld = sample(lr_model, SGLD(stepsize=sgld_lr_stepsize), 5000)\n\nchain_lr_sghmc = sample(lr_model, SGHMC(learning_rate=0.00005, momentum_decay=0.1), 5000)\n\nchain_lr_hmc = sample(lr_model, HMC(0.01, 10), 1000)\n\nChains MCMC chain (1000×14×1 Array{Float64, 3}):\n\nIterations = 1:1:1000\nNumber of chains = 1\nSamples per chain = 1000\nWall duration = 1.52 seconds\nCompute duration = 1.52 seconds\nparameters = β[1], β[2], β[3], σ\ninternals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, numerical_error, step_size, nom_step_size\n\nUse `describe(chains)` for summary statistics and quantiles.\n\n\nCompare the results to evaluate the performance of stochastic gradient samplers on a more complex model:\n\nprintln(\"True β values: \", true_β)\nprintln(\"True σ value: \", true_σ_noise)\nprintln()\n\nprintln(\"SGLD estimates:\")\nsummarystats(chain_lr_sgld)\n\nTrue β values: [0.5, -1.2, 2.1]\nTrue σ value: 0.3\n\nSGLD estimates:\n\n\n\nSummary Statistics\n parameters mean std mcse ess_bulk ess_tail rhat e ⋯\n Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯\n\n β[1] 1.4193 0.0092 0.0026 13.7923 33.1792 1.1167 ⋯\n β[2] -0.1851 0.0195 0.0060 11.5311 25.2985 1.4079 ⋯\n β[3] -0.1172 0.0220 0.0068 10.8609 20.7479 1.9659 ⋯\n σ 1.4488 0.0931 0.0291 10.6692 19.4335 2.0535 ⋯\n 1 column omitted\n\n\n\n\nThe linear regression example demonstrates that stochastic gradient samplers can recover the true parameters, but: - They require significantly longer chains (5000 vs 1000 for HMC) - The estimates may have higher variance - Convergence diagnostics should be carefully examined before trusting the results", |
| 1157 | + "text": "Bayesian Linear Regression Example\nHere’s a more complex example using Bayesian linear regression:\n\n# Generate regression data\nn_features = 3\nn_samples = 100\nX = randn(n_samples, n_features)\ntrue_β = [0.5, -1.2, 2.1]\ntrue_σ_noise = 0.3\ny = X * true_β + true_σ_noise * randn(n_samples)\n\n@model function linear_regression(X, y)\n n_features = size(X, 2)\n \n # Priors\n β ~ MvNormal(zeros(n_features), 3 * I)\n σ ~ truncated(Normal(0, 1); lower=0)\n \n # Likelihood\n y ~ MvNormal(X * β, σ^2 * I)\nend\n\nlr_model = linear_regression(X, y)\n\nDynamicPPL.Model{typeof(linear_regression), (:X, :y), (), (), Tuple{Matrix{Float64}, Vector{Float64}}, Tuple{}, DynamicPPL.DefaultContext}(linear_regression, (X = [-0.08993884887496832 1.2694180094557772 -0.45068406344161077; -0.23528025045836815 -1.0348870573833149 -1.2512585407119565; … ; -0.5815563239702138 -0.19790550383157401 -0.7201291845682822; 0.29678442882680006 0.6426754256642815 -0.8729317283503407], y = [-2.5001125493734633, -1.3582233483639436, -3.8825717018806856, -0.2345200635330288, 1.4937176261849854, 2.8659122069995644, -0.5833355856450775, 4.642283548210101, 0.14909888834210028, 1.3335900592696839 … 5.9741160301704, -1.5777125963436005, 3.9896734979440236, -1.0204264890982526, -1.6606828145645047, 1.76720805427176, -0.20620159329470383, -1.9121131995245513, -0.9431065705584871, -2.3648743995748114]), NamedTuple(), DynamicPPL.DefaultContext())\n\n\nSample using the stochastic gradient methods:\n\n# Very conservative parameters for stability\nsgld_lr_stepsize = Turing.PolynomialStepsize(0.00005, 10000, 0.55)\nchain_lr_sgld = sample(lr_model, SGLD(stepsize=sgld_lr_stepsize), 5000)\n\nchain_lr_sghmc = sample(lr_model, SGHMC(learning_rate=0.00005, momentum_decay=0.1), 5000)\n\nchain_lr_hmc = sample(lr_model, HMC(0.01, 10), 1000)\n\nChains MCMC chain (1000×14×1 Array{Float64, 3}):\n\nIterations = 1:1:1000\nNumber of chains = 1\nSamples per chain = 1000\nWall duration = 1.54 seconds\nCompute duration = 1.54 seconds\nparameters = β[1], β[2], β[3], σ\ninternals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, numerical_error, step_size, nom_step_size\n\nUse `describe(chains)` for summary statistics and quantiles.\n\n\nCompare the results to evaluate the performance of stochastic gradient samplers on a more complex model:\n\nprintln(\"True β values: \", true_β)\nprintln(\"True σ value: \", true_σ_noise)\nprintln()\n\nprintln(\"SGLD estimates:\")\nsummarystats(chain_lr_sgld)\n\nTrue β values: [0.5, -1.2, 2.1]\nTrue σ value: 0.3\n\nSGLD estimates:\n\n\n\nSummary Statistics\n parameters mean std mcse ess_bulk ess_tail rhat e ⋯\n Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯\n\n β[1] 1.4193 0.0092 0.0026 13.7923 33.1792 1.1167 ⋯\n β[2] -0.1851 0.0195 0.0060 11.5311 25.2985 1.4079 ⋯\n β[3] -0.1172 0.0220 0.0068 10.8609 20.7479 1.9659 ⋯\n σ 1.4488 0.0931 0.0291 10.6692 19.4335 2.0535 ⋯\n 1 column omitted\n\n\n\n\nThe linear regression example demonstrates that stochastic gradient samplers can recover the true parameters, but: - They require significantly longer chains (5000 vs 1000 for HMC) - The estimates may have higher variance - Convergence diagnostics should be carefully examined before trusting the results", |
1158 | 1158 | "crumbs": [
|
1159 | 1159 | "Get Started",
|
1160 | 1160 | "User Guide",
|
|
1178 | 1178 | "href": "usage/stochastic-gradient-samplers/index.html#best-practices-and-recommendations",
|
1179 | 1179 | "title": "Stochastic Gradient Samplers",
|
1180 | 1180 | "section": "Best Practices and Recommendations",
|
1181 |
| - "text": "Best Practices and Recommendations\n\nWhen to Consider Stochastic Gradient Samplers\n\nStreaming data: When data arrives continuously and you need online inference\nResearch: For studying stochastic gradient MCMC methods\nEducational purposes: For understanding stochastic gradient MCMC algorithms\n\n\n\nCritical Hyperparameters\nFor SGLD: - Use PolynomialStepsize with very small initial values (≤ 0.0001) - Larger b values in PolynomialStepsize(a, b, γ) provide more stability - The stepsize decreases as a / (b + t)^γ\nFor SGHMC: - Use extremely small learning rates (≤ 0.00001) - Momentum decay (friction) typically between 0.1-0.5 - Higher momentum decay improves stability but slows convergence\n\n\nCurrent Limitations\n\nNo mini-batching: Full gradients are computed despite “stochastic” name\nHyperparameter sensitivity: Requires extensive tuning\nComputational overhead: Often slower than HMC/NUTS for small-medium datasets\nConvergence: Typically requires longer chains\n\n\n\nGeneral Recommendations\n\nStart conservatively: Use very small step sizes initially\nMonitor convergence: Check trace plots and diagnostics carefully\n\nCompare with HMC/NUTS: Validate results when possible\nConsider alternatives: For most applications, HMC or NUTS will be more efficient", |
| 1181 | + "text": "Best Practices and Recommendations\n\nWhen to Consider Stochastic Gradient Samplers\n\nStreaming data: When data arrives continuously and you need online inference\nResearch: For studying stochastic gradient MCMC methods\nEducational purposes: For understanding stochastic gradient MCMC algorithms\n\n\n\nCritical Hyperparameters\nFor SGLD: - Use PolynomialStepsize with very small initial values (≤ 0.0001) - Larger b values in PolynomialStepsize(a, b, γ) provide more stability - The stepsize decreases as a / (b + t)^γ - Recommended starting point: PolynomialStepsize(0.0001, 10000, 0.55) - For unstable models: Reduce a to 0.00001 or increase b to 50000\nFor SGHMC: - Use extremely small learning rates (≤ 0.00001) - Momentum decay (friction) typically between 0.1-0.5 - Higher momentum decay improves stability but slows convergence - Recommended starting point: learning_rate=0.00001, momentum_decay=0.1 - For high-dimensional problems: Increase momentum_decay to 0.3-0.5\nTuning Strategy: 1. Start with recommended values and run a short chain (500-1000 samples) 2. If chains diverge or parameters explode, reduce step size by factor of 10 3. If mixing is too slow, carefully increase step size by factor of 2 4. Always validate against HMC/NUTS results when possible\n\n\nCurrent Limitations\n\nNo mini-batching: Full gradients are computed despite “stochastic” name\nHyperparameter sensitivity: Requires extensive tuning\nComputational overhead: Often slower than HMC/NUTS for small-medium datasets\nConvergence: Typically requires longer chains\n\n\n\nConvergence Diagnostics\nDue to the high variance and slow convergence of stochastic gradient samplers, careful diagnostics are essential:\n\nVisual inspection: Always check trace plots for all parameters\nEffective sample size (ESS): Expect lower ESS than HMC/NUTS\nR-hat values: Should be < 1.01 for all parameters\nLong chains: Often need 5,000-10,000+ samples for convergence\nMultiple chains: Run multiple chains with different initializations to verify convergence\n\n\n\nGeneral Recommendations\n\nStart conservatively: Use very small step sizes initially\nMonitor convergence: Check trace plots and diagnostics carefully\n\nIncrease samples if needed: Don’t hesitate to use 10,000+ samples if convergence is poor\nCompare with HMC/NUTS: Validate results when possible\nConsider alternatives: For most applications, HMC or NUTS will be more efficient", |
1182 | 1182 | "crumbs": [
|
1183 | 1183 | "Get Started",
|
1184 | 1184 | "User Guide",
|
|
0 commit comments