diff --git a/.github/workflows/ContinuousIntegration.yml b/.github/workflows/ContinuousIntegration.yml index 399440f70..f33e76bf2 100644 --- a/.github/workflows/ContinuousIntegration.yml +++ b/.github/workflows/ContinuousIntegration.yml @@ -25,7 +25,7 @@ jobs: strategy: fail-fast: false matrix: - julia-version: ['1', '1.6', 'nightly'] + julia-version: ['1', '1.10', '1.6', 'nightly'] julia-arch: [x64] os: [ubuntu-latest] steps: diff --git a/.github/workflows/Documentation.yml b/.github/workflows/Documentation.yml index ef66d34a5..fa490c892 100644 --- a/.github/workflows/Documentation.yml +++ b/.github/workflows/Documentation.yml @@ -19,7 +19,7 @@ jobs: - uses: actions/checkout@v2 - uses: julia-actions/setup-julia@v1 with: - version: '1.6' + version: '1.10' - name: Install dependencies run: julia --project=docs/ -e 'using Pkg; Pkg.develop(PackageSpec(path=pwd())); Pkg.instantiate()' - name: Build and deploy documentation diff --git a/docs/Project.toml b/docs/Project.toml index 6dfe81a83..4b5a5c8e3 100644 --- a/docs/Project.toml +++ b/docs/Project.toml @@ -2,6 +2,7 @@ Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4" Gen = "ea4f424c-a589-11e8-07c0-fd5c91b9da4a" Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80" +StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91" [compat] Documenter = "1" diff --git a/docs/pages.jl b/docs/pages.jl index 3a0962031..32bd58078 100644 --- a/docs/pages.jl +++ b/docs/pages.jl @@ -3,6 +3,8 @@ pages = [ "Tutorials" => [ "Getting Started" => "tutorials/getting_started.md", "Introduction to Modeling in Gen" => "tutorials/modeling_in_gen.md", + "Basics of MCMC and MAP Inference" => "tutorials/mcmc_map.md", + "Debugging Models with Enumeration" => "tutorials/enumerative.md", "Object Tracking with SMC" => "tutorials/smc.md", "Variational Inference in Gen" => "tutorials/vi.md", "Learning Generative Functions" => "tutorials/learning_gen_fns.md", diff --git a/docs/src/assets/example-inference.png b/docs/src/assets/example-inference.png new file mode 100644 index 000000000..e00b3e580 Binary files /dev/null and b/docs/src/assets/example-inference.png differ diff --git a/docs/src/index.md b/docs/src/index.md index 6d5e17494..c07304950 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -46,7 +46,10 @@ Learn how to use Gen by following these tutorials: Pages = [ "tutorials/getting_started.md", "tutorials/modeling_in_gen.md", + "tutorials/mcmc_map.md", + "tutorials/enumerative.md", "tutorials/smc.md", + "tutorials/vi.md", "tutorials/scaling_with_sml.md", "tutorials/learning_gen_fns.md" ] diff --git a/docs/src/tutorials/enumerative.md b/docs/src/tutorials/enumerative.md new file mode 100644 index 000000000..79dc6b103 --- /dev/null +++ b/docs/src/tutorials/enumerative.md @@ -0,0 +1,892 @@ +# [Debugging Models with Enumerative Inference](@id enumerative_tutorial) + +When working with probabilistic models, we often rely on Monte Carlo methods like importance sampling (as introduced in [Introduction to Modeling](@ref modeling_tutorial)) and MCMC (as introduced in [Basics of MCMC](@ref mcmc_map_tutorial)) to approximate posterior distributions. But how can we tell if these approximations are actually working correctly? Sometimes poor inference results stem from bugs in our inference algorithms, while other times they reveal fundamental issues with our model specification. + +This tutorial introduces *enumerative inference* as a debugging tool. Unlike the sampling-based methods we've seen in previous tutorials, which draw samples that are approximately distributed according to the posterior distribution over latent values, enumerative inference systematically evaluates the posterior probability of every value in the latent space (or a discretized version of this space). + +When the latent space isn't too large (e.g. not too many dimensions), this approach can compute a "gold standard" posterior approximation that other methods can be compared against, helping us distinguish between inference failures and model misspecification. Enumerative inference is often slower than a well-tuned Monte Carlo algorithm (since it may enumerate over regions with very low probability), but having a gold-standard posterior allows us to check that faster and more efficient algorithms are working correctly. + +```@setup enumerative_tutorial +using Gen, Plots, StatsBase, Printf +import Random +Random.seed!(0) +``` + +## Enumeration for Discrete Models + +Let's start with a simple example where enumeration can be used to perform *exact inference*, computing the exact posterior probability for every possible combination of discrete latent variables. We'll build a robust Bayesian linear regression model, but unlike the continuous model from the MCMC tutorial, we'll use discrete priors for all latent variables. + +```@example enumerative_tutorial +@gen function discrete_regression(xs::Vector{<:Real}) + # Discrete priors for slope and intercept + slope ~ uniform_discrete(-2, 2) # Slopes: -2, -1, 0, 1, 2 + intercept ~ uniform_discrete(-2, 2) # Intercepts: -2, -1, 0, 1, 2 + + # Sample outlier classification and y value for each x value + n = length(xs) + ys = Float64[] + for i = 1:n + # Prior on outlier probability + is_outlier = {:data => i => :is_outlier} ~ bernoulli(0.1) + + if is_outlier + # Outliers have large noise + y = {:data => i => :y} ~ normal(0., 5.) + else + # Inliers follow the linear relationship, with low noise + y_mean = slope * xs[i] + intercept + y = {:data => i => :y} ~ normal(y_mean, 1.) + end + push!(ys, y) + end + + return ys +end +nothing # hide +``` + +Let's generate some synthetic data with a true slope of 1 and intercept of 0: + +```@example enumerative_tutorial +# Generate synthetic data +true_slope = 1 +true_intercept = 0 +xs = [-2., -1., 0., 1., 2.] +ys = true_slope .* xs .+ true_intercept .+ 1.0 * randn(5) + +# Make one point an outlier +ys[3] = 4.0 + +# Visualize the data +point_colors = [:blue, :blue, :red, :blue, :blue] +scatter(xs, ys, label="Observations", markersize=6, xlabel="x", ylabel="y", + color=point_colors) +plot!(xs, true_slope .* xs .+ true_intercept, + label="True line", linestyle=:dash, linewidth=2, color=:black) +``` + +Now we can use enumerative inference to compute the exact posterior. We'll enumerate over all possible combinations of slope, intercept, and outlier classifications: + +```@example enumerative_tutorial +# Create observations choicemap +observations = choicemap() +for (i, y) in enumerate(ys) + observations[:data => i => :y] = y +end + +# Set up the enumeration grid +# We enumerate over discrete slope, intercept, and outlier classifications +grid_specs = Tuple[ + (:slope, -2:2), # 5 possible slopes + (:intercept, -2:2), # 3 possible intercepts +] +for i in 1:length(xs) + push!(grid_specs, (:data => i => :is_outlier, [false, true])) +end + +# Create the enumeration grid +grid = choice_vol_grid(grid_specs...) +nothing # hide +``` + +Here, we used [`choice_vol_grid`](@ref) to enumerate over all possible combinations of slope, intercept, and outlier classifications. The resulting `grid` object is a multi-dimensional iterator, where each element consists of a [`ChoiceMap`](@ref) that specifies the values of all latent variables, and the log-volume of latent space covered by that element of the grid. Since all latent variables are discrete, the volume of latent space covered by each element is equal to 1 (and hence the log-volume is 0). We can inspect the first element of this grid using the `first` function: + +```@example enumerative_tutorial +choices, log_vol = first(grid) +println("Log volume: ", log_vol) +println("Choices: ") +show(stdout, "text/plain", choices) +``` + +Having constructed the enumeration grid, we now pass this to the [`enumerative_inference`](@ref) function, along with the generative model (`discrete_regression`), model arguments (in this case, `xs`), and the `observations`: + +```@example enumerative_tutorial +# Run enumerative inference +traces, log_norm_weights, lml_est = + enumerative_inference(discrete_regression, (xs,), observations, grid) + +println("Grid size: ", size(grid)) +println("Log marginal likelihood: ", lml_est) +``` + +The [`enumerative_inference`](@ref) function returns an array of `traces` and an array of normalized log posterior probabilities (`log_norm_weights`) with the same shape as the input `grid`. It also returns an estimate of the log marginal likelihood (`lml_est`) of the observations. The estimate is *exact* in this case, since we enumerated over all possible combinations of latent variables. + +Each trace corresponds to a full combination of the latent variables that were enumerated over. As such, the `log_norm_weights` array represents the *joint* posterior distribution over all latent variables. By summing over all traces which have the same value for a specific latent variable (or equivalently, by summing over a dimension of the `log_norm_weights` array), we can compute the *marginal* posterior distribution for that variable. We'll do this below for the `slope` and `intercept` variables: + +```@example enumerative_tutorial +# Compute 2D marginal posterior over slope and intercept +sum_dims = Tuple(3:ndims(log_norm_weights)) # Sum over all other variables +posterior_grid = sum(exp.(log_norm_weights), dims=sum_dims) +posterior_grid = dropdims(posterior_grid; dims=sum_dims) +``` + +Let's visualize the marginal posteriors over these variables, as well the joint posterior for both variables together. Below is some code to plot a 2D posterior heatmap with 1D marginals as histograms. + +```@raw html +
Code to plot 2D grid of posterior values +``` + +```@example enumerative_tutorial +using Plots + +function plot_posterior_grid( + x_range, y_range, xy_probs; + is_discrete = true, x_true = missing, y_true = missing, + xlabel = "", ylabel = "" +) + # Create the main heatmap + p_main = heatmap(x_range, y_range, xy_probs, colorbar=false, widen=false, + color=:grays, xlabel=xlabel, ylabel=ylabel) + if is_discrete + # Add true parameters + scatter!(p_main, [x_true], [y_true], legend=true, + markersize=36, markershape=:rect, color=:white, + markerstrokecolor=:red, label="True Parameters") + # Annotate each cell with its posterior probability + for idx in CartesianIndices(xy_probs) + i, j = Tuple(idx) + prob_str = @sprintf("%.3f", xy_probs[j, i]) + prob_color = xy_probs[j, i] > 0.2 ? :black : :white + annotate!(x_range[i], y_range[j], + text(prob_str, color = prob_color, pointsize=12)) + end + else + # Add true parameters + if !ismissing(x_true) && !ismissing(y_true) + scatter!(p_main, [x_true], [y_true], legend=true, + markersize=6, color=:red, markershape=:cross, + label="True Parameters") + end + if !ismissing(x_true) + vline!([x_true], linestyle=:dash, linewidth=1, color=:red, + label="", alpha=0.5) + end + if !ismissing(y_true) + hline!([y_true], linestyle=:dash, linewidth=1, color=:red, + label="", alpha=0.5) + end + end + + # Create 1D marginal histograms + x_probs = vec(sum(xy_probs, dims=1)) + y_probs = vec(sum(xy_probs, dims=2)) + p_top = bar(x_range, x_probs, orientation=:v, ylims=(0, maximum(x_probs)), + bar_width=diff(x_range)[1], linewidth=0, color=:black, + showaxis=true, ticks=false, legend=false, widen=false) + p_right = bar(y_range, y_probs, orientation=:h, xlims=(0, maximum(y_probs)), + bar_width=diff(y_range)[1], linewidth=0, color=:black, + showaxis=true, ticks=false, legend=false, widen=false) + if !is_discrete + xlims!(p_top, xlims(p_main)) + ylims!(p_right, ylims(p_main)) + if !ismissing(x_true) + vline!(p_top, [x_true], linestyle=:dash, + linewidth=1, color=:red, legend=false) + end + if !ismissing(y_true) + hline!(p_right, [y_true], linestyle=:dash, + linewidth=1, color=:red, legend=false) + end + end + + # Create empty plot for top-right corner + p_empty = plot(legend=false, grid=false, showaxis=false, ticks=false) + + # Combine plots using layout + plot(p_top, p_empty, p_main, p_right, + layout=@layout([a{0.9w,0.1h} b{0.1w,0.1h}; c{0.9w,0.9h} d{0.1w,0.9h}]), + size=(750, 750)) +end +nothing # hide +``` + +```@raw html +
+``` + +```@example enumerative_tutorial +# Extract parameter ranges +slope_range = [trs[1][:slope] for trs in eachslice(traces, dims=1)] +intercept_range = [trs[1][:intercept] for trs in eachslice(traces, dims=2)] + +# Plot 2D posterior heatmap with 1D marginals as histograms +plot_posterior_grid(intercept_range, slope_range, posterior_grid, + x_true = true_intercept, y_true = true_slope, + xlabel = "Intercept", ylabel = "Slope", + is_discrete = true) +``` + +As can be seen, the posterior concentrates around the true values of the slope and intercept, though there is some uncertainty about both. + +We can also examine which points are most likely to be outliers: + +```@example enumerative_tutorial +# Compute posterior probability of each point being an outlier +outlier_probs = zeros(length(xs)) +for (j, trace) in enumerate(traces) + for i in 1:length(xs) + if trace[:data => i => :is_outlier] + outlier_probs[i] += exp(log_norm_weights[j]) + end + end +end + +bar(1:length(xs), outlier_probs, + xlabel="x", ylabel="P(outlier | data)", + color=:black, ylim=(0, 1), legend=false) +``` + +Notice that enumerative inference correctly identifies that point 3 (which we made an outlier) has a high probability of being an outlier, while maintaining uncertainty about the exact classifications. + +## Enumeration for Continuous Models + +Many generative models of interest have continuous latent variables. While we can't enumerate over continuous spaces exactly, we can create a discrete approximation of a continuous target distribution by defining a grid. Let's extend our model to use continuous priors: + +```@example enumerative_tutorial +@gen function continuous_regression(xs::Vector{<:Real}) + # Continuous slope and intercept priors + slope ~ normal(0, 1) + intercept ~ normal(0, 2) + + # Sample outlier classification and y value for each x value + n = length(xs) + ys = Float64[] + for i = 1:n + # Prior on outlier probability + is_outlier = {:data => i => :is_outlier} ~ bernoulli(0.1) + + if is_outlier + # Outliers have large noise + y = {:data => i => :y} ~ normal(0., 5.) + else + # Inliers follow the linear relationship, with low noise + y_mean = slope * xs[i] + intercept + y = {:data => i => :y} ~ normal(y_mean, 1.) + end + push!(ys, y) + end + + return ys +end +nothing # hide +``` + +We now construct a grid over the latent space using [`choice_vol_grid`](@ref). For continuous variables, we need to provide a range of grid points (including start and end points), and specify that the variable is `:continuous`: + +```@example enumerative_tutorial +grid = choice_vol_grid( + (:slope, -3:0.25:3, :continuous), # 24 grid intervals + (:intercept, -4:0.5:4, :continuous), # 16 grid intervals + # Still enumerate exactly over outlier classifications + (:data => 1 => :is_outlier, [false, true]), + (:data => 2 => :is_outlier, [false, true]), + (:data => 3 => :is_outlier, [false, true]), + (:data => 4 => :is_outlier, [false, true]), + (:data => 5 => :is_outlier, [false, true]); + anchor = :midpoint # Anchor evaluation point at midpoint of each interval +) + +println("Grid size for continuous model: ", size(grid)) +println("Number of grid elements: ", length(grid)) +``` + +When some variables are specified as `:continuous`, the [`choice_vol_grid`](@ref) function automatically computes the log-volume of each grid cell. Inspecting the first element of the grid, we see that the log-volume is equal to `log(0.25 * 0.5) ≈ -2.0794`, since that grid cell is covers a volume of 0.25 * 0.5 = 0.125 of the slope-intercept latent space. We also see that the `slope` and `intercept` variables lie at the midpoint of this grid cell, since the `anchor` keyword argument was set to `:midpoint`: + +```@example enumerative_tutorial +choices, log_vol = first(grid) +println("Log volume: ", log_vol) +println("Choices: ") +choices +``` + +Now let's generate some synthetic data to do inference on. We'll use ground-truth continuous parameters that don't lie exactly on the grid, in order to show that enumerative inference can still produce a reasonable approximation when the posterior is sufficiently smooth. + +```@example enumerative_tutorial +# Generate synthetic data +true_slope = -1.21 +true_intercept = 2.56 +xs = [-2., -1., 0., 1., 2.] +ys = true_slope .* xs .+ true_intercept .+ 1.0 * randn(5) + +# Make one point an outlier +ys[2] = 0. + +# Create observations choicemap +observations = choicemap() +for (i, y) in enumerate(ys) + observations[:data => i => :y] = y +end + +# Visualize the data +point_colors = [:blue, :red, :blue, :blue, :blue] +scatter(xs, ys, label="Observations", markersize=6, xlabel="x", ylabel="y", + color=point_colors) +plot!(xs, true_slope .* xs .+ true_intercept, + label="True line", linestyle=:dash, linewidth=2, color=:black) +``` + +As in the discrete case, we can use [`enumerative_inference`](@ref) to perform inference on the continuous model: + +```@example enumerative_tutorial +# Run inference on the continuous model +traces, log_norm_weights, lml_est = + enumerative_inference(continuous_regression, (xs,), observations, grid) + +println("Log marginal likelihood: ", lml_est) +``` + +Again, we can visualize the joint posterior over the `slope` and `intercept` variables with the help of some plotting code. + +```@example enumerative_tutorial +# Compute marginal posterior over slope and intercept +sum_dims = Tuple(3:ndims(log_norm_weights)) # Sum over all other variables +posterior_grid = sum(exp.(log_norm_weights), dims=sum_dims) +posterior_grid = dropdims(posterior_grid; dims=sum_dims) + +# Extract parameter ranges +slope_range = [trs[1][:slope] for trs in eachslice(traces, dims=1)] +intercept_range = [trs[1][:intercept] for trs in eachslice(traces, dims=2)] + +# Plot 2D posterior heatmap with 1D marginals as histograms +p = plot_posterior_grid(intercept_range, slope_range, posterior_grid, + x_true = true_intercept, y_true = true_slope, + xlabel = "Intercept", ylabel = "Slope", is_discrete = false) +``` + +We can see that the true parameters lie in a cell with reasonably high posterior probability, though there is a fair amount of uncertainty due to bimodal nature of the posterior distribution. This manifests in the posterior over outlier classifications as well: + +```@example enumerative_tutorial +# Compute posterior probability of each point being an outlier +outlier_probs = zeros(length(xs)) +for i = 1:length(xs) + for (j, trace) in enumerate(traces) + if trace[:data => i => :is_outlier] + outlier_probs[i] += exp(log_norm_weights[j]) + end + end +end + +# Plot posterior probability of each point being an outlier +bar(1:length(xs), outlier_probs, + xlabel="x", ylabel="P(outlier | data)", + color=:black, ylim=(0, 1), legend=false) +``` + +The points at both `x=1` and `x=2` are inferred to be possible outliers, corresponding to each possible mode of the full posterior distribution. By extracting the slice of the `log_norm_weights` array that corresponds to `x=2` being an outlier (i.e., when `data => 2 => :is_outlier` is `true`), we can visualize the posterior distribution over the `slope` and `intercept` variables conditional on `x=2` being an outlier. As shown below, this conditional posterior is no longer bimodal, and concentrates more closely around the true parameters. + +```@example enumerative_tutorial +# Extract slice of weights corresponding to x=2 being an outlier +cond_log_norm_weights = log_norm_weights[:,:,:,end:end,:,:,:] + +# Compute marginal posterior over slope & intercept given that x=2 is an outlier +sum_dims = Tuple(3:ndims(cond_log_norm_weights)) # Sum over all other variables +posterior_grid = sum(exp.(cond_log_norm_weights), dims=sum_dims) +posterior_grid = dropdims(posterior_grid; dims=sum_dims) + +# Extract parameter ranges +slope_range = [trs[1][:slope] for trs in eachslice(traces, dims=1)] +intercept_range = [trs[1][:intercept] for trs in eachslice(traces, dims=2)] + +# Plot 2D posterior heatmap with 1D marginals as histograms +plot_posterior_grid(intercept_range, slope_range, posterior_grid, + x_true = true_intercept, y_true = true_slope, + xlabel = "Intercept", ylabel = "Slope", is_discrete = false) +``` + +Instead of extracting a slice of the full weight array, we could also have used [`choice_vol_grid`](@ref) to construct an enumeration grid with `data => 2 => :is_outlier` constrained to `true`, and then called [`enumerative_inference`](@ref) with this conditional grid. This ability to compute conditional posteriors is another useful aspect of enumerative inference: Even when the latent space becomes too high-dimensional for enumeration over the full joint posterior, we can still inspect the conditional posteriors over some variables conditioned on the values of other variables, and check whether they make sense. + +## Diagnosing Model Misspecification + +As we have seen above, enumerative inference allows us to approximate a posterior distribution with a high degree of fidelity (at the expense of additional computation). This allows us to distinguish between two ways that inference in a Bayesian model can go wrong: + +- **Inference Failure:** The inference algorithm fails to approximate the true posterior distribution well (e.g. due to a bad importance sampling proposal, a poorly-designed MCMC kernel, or insufficient computation). + +- **Model Misspecification:** The Bayesian model itself is misspecified, such that the true posterior distribution does not correspond with our intuitions about what the posterior should look like. + +Both of these issues can occur at the same time: an algorithm might fail to converge to the true posterior, and the model might be misspecified. Regardless, since enumerative inference can approximate the true posterior distribution arbitrarily well (by making the grid arbitrarily large and fine), we can use it to check whether some other algorithm converges to the true posterior, and also whether the true posterior itself concords with our intuitions. + +As a demonstration, let us write a version of the continuous regression model with narrow slope and intercept priors, and a high probability of outliers: + +```@example enumerative_tutorial +@gen function misspecified_regression(xs::Vector{<:Real}) + # Narrow slope and intercept priors + slope ~ normal(0, sqrt(0.5)) + intercept ~ normal(0, sqrt(0.5)) + + # Sample outlier classification and y value for each x value + n = length(xs) + ys = Float64[] + for i = 1:n + # High (25% chance) prior probability of being an outlier + is_outlier = {:data => i => :is_outlier} ~ bernoulli(0.25) + + if is_outlier + # Outliers have large noise + y = {:data => i => :y} ~ normal(0., 5.) + else + # Inliers follow the linear relationship, with low noise + y_mean = slope * xs[i] + intercept + y = {:data => i => :y} ~ normal(y_mean, 1.) + end + push!(ys, y) + end + + return ys +end +nothing # hide +``` + +To create a case where the model is misspecified, we generate data with a steep slope and a large intercept, but no outliers: + +```@example enumerative_tutorial +# Generate synthetic data +true_slope = 2.8 +true_intercept = -2.4 +xs = [-2., -1., 0., 1., 2.] +ys = true_slope .* xs .+ true_intercept .+ 1.0 * randn(5) + +# Create observations choicemap +observations = choicemap() +for (i, y) in enumerate(ys) + observations[:data => i => :y] = y +end + +# Visualize the data +point_colors = [:blue, :blue, :blue, :blue, :blue] +scatter(xs, ys, label="Observations", markersize=6, xlabel="x", ylabel="y", + color=point_colors) +plot!(xs, true_slope .* xs .+ true_intercept, + label="True line", linestyle=:dash, linewidth=2, color=:black) +``` + +Now let us try using [`importance_resampling`](@ref) to approximate the posterior distribution under the misspecified model: + +```@example enumerative_tutorial +# Try importance resampling with 2000 inner samples and 100 outer samples +println("Running importance sampling...") +traces = [importance_resampling(misspecified_regression, (xs,), observations, 2000)[1] for i in 1:100] + +# Compute the mean slope and intercept +mean_slope = sum(trace[:slope] for trace in traces) / length(traces) +mean_intercept = sum(trace[:intercept] for trace in traces) / length(traces) + +println("Mean slope: ", mean_slope) +println("Mean intercept: ", mean_intercept) +``` + +Instead of recovering anything close to the true parameters, importance sampling infers a much smaller mean for the slope and intercept. We can also visualize the joint posterior over the slope and intercept by plotting a 2D histogram from the samples: + +```@raw html +
Code to plot posterior samples +``` + +```@example enumerative_tutorial +function plot_posterior_samples( + x_range, y_range, x_values, y_values; + x_true = missing, y_true = missing, + xlabel = "", ylabel = "" +) + # Create the main heatmap + p_main = histogram2d(x_values, y_values, bins=(x_range, y_range), + show_empty_bins=true, normalize=:probability, + color=:grays, colorbar=false, legend=false, + xlabel=xlabel, ylabel=ylabel) + xlims!(p_main, minimum(x_range), maximum(x_range)) + ylims!(p_main, minimum(y_range), maximum(y_range)) + # Add true parameters + if !ismissing(x_true) && !ismissing(y_true) + scatter!(p_main, [true_intercept], [true_slope], + markersize=6, color=:red, markershape=:cross, + label="True Parameters", legend=true) + end + if !ismissing(x_true) + vline!([x_true], linestyle=:dash, linewidth=1, color=:red, + label="", alpha=0.5) + end + if !ismissing(y_true) + hline!([y_true], linestyle=:dash, linewidth=1, color=:red, + label="", alpha=0.5) + end + + # Create 1D marginal histograms + p_top = histogram(x_values, bins=x_range, orientation=:v, legend=false, + normalize=:probability, linewidth=0, color=:black, + showaxis=true, ticks=false) + x_probs_max = maximum(p_top.series_list[2].plotattributes[:y]) + ylims!(p_top, 0, x_probs_max) + xlims!(p_top, minimum(x_range), maximum(x_range)) + p_right = histogram(y_values, bins=y_range, orientation=:h, legend=false, + normalize=:probability, linewidth=0, color=:black, + showaxis=true, ticks=false) + y_probs_max = maximum(p_right.series_list[2].plotattributes[:y]) + xlims!(p_right, 0, y_probs_max) + ylims!(p_right, minimum(y_range), maximum(y_range)) + # Add true parameters + if !ismissing(x_true) + vline!(p_top, [x_true], linestyle=:dash, + linewidth=1, color=:red, legend=false) + end + if !ismissing(y_true) + hline!(p_right, [y_true], linestyle=:dash, + linewidth=1, color=:red, legend=false) + end + + # Create empty plot for top-right corner + p_empty = plot(legend=false, grid=false, showaxis=false, ticks=false) + + # Combine plots using layout + plot(p_top, p_empty, p_main, p_right, + layout=@layout([a{0.9w,0.1h} b{0.1w,0.1h}; c{0.9w,0.9h} d{0.1w,0.9h}]), + size=(750, 750)) +end +nothing # hide +``` + +```@raw html +
+``` + +```@example enumerative_tutorial +# Plot a 2D histogram for the slope and intercept variables +slopes = [trace[:slope] for trace in traces] +intercepts = [trace[:intercept] for trace in traces] +plot_posterior_samples(-4:0.25:4, -4:0.25:4, intercepts, slopes, + x_true=true_intercept, y_true=true_slope, + xlabel="Intercept", ylabel="Slope") +``` + +The distribution of samples produced by importance sampling lies far from the true slope and intercept, and concentrates around values that do not intuitively make sense given the data. The distribution over outlier classifications sheds some light on the problem: + +```@example enumerative_tutorial +# Estimate posterior probability of each point being an outlier +outlier_probs = zeros(length(xs)) +for i = 1:length(xs) + for (j, trace) in enumerate(traces) + if trace[:data => i => :is_outlier] + outlier_probs[i] += 1/length(traces) + end + end +end + +# Plot posterior probability of each point being an outlier +bar(1:length(xs), outlier_probs, + xlabel="x", ylabel="P(outlier | data)", + color=:black, ylim=(0, 1), legend=false) +``` + +Importance sampling infers that many of the points are likely to be outliers. That is, instead of inferring a steep slope and a negative intercept, importance sampling prefers to explain the data as a flatter line with *many* outliers. + +These inferences are indicative of model misspecification. Still, we can't be confident that this isn't just an inference failure. After all, we used importance sampling with the prior as our proposal distribution. Since the prior over slopes and intercepts is very narrow, it is very likely that *none* of the 2000 inner samples used by [`importance_resampling`](@ref) came close to the true slope and intercept. So it is possible that the issues above arise because importance sampling fails to produce a good approximation of the true posterior. + +Before using enumerative inference to resolve this ambiguity, let us try using an MCMC inference algorithm, which might avoid the inference failures of importance sampling by exploring a broader region of the latent space. Similar to the [tutorial on MCMC](@ref mcmc_map_tutorial), we'll use an MCMC kernel that performs Gaussian drift on the continuous parameters, followed by block resimulation on the outlier classifications: + +```@example enumerative_tutorial +@gen function line_proposal(trace) + slope ~ normal(trace[:slope], 0.5) + intercept ~ normal(trace[:intercept], 0.5) +end + +function mcmc_kernel(trace) + # Gaussian drift on line parameters + (trace, _) = mh(trace, line_proposal, ()) + + # Block resimulation: Update the outlier classifications + (xs,) = get_args(trace) + n = length(xs) + for i=1:n + (trace, _) = mh(trace, select(:data => i => :is_outlier)) + end + return trace +end + +function mcmc_sampler(kernel, trace, n_iters::Int, n_burnin::Int = 0) + traces = Vector{typeof(trace)}() + for i in 1:(n_iters + n_burnin) + trace = kernel(trace) + if i > n_burnin + push!(traces, trace) + end + end + return traces +end +nothing # hide +``` + +In addition, we will intiialize MCMC at the true slope and intercept. This way, we can rule out the possibility that MCMC never explores the region of latent space near the true parameters. + +```@example enumerative_tutorial +# Generate initial trace at true slope and intercept +constraints = choicemap( + :slope => true_slope, + :intercept => true_intercept +) +constraints = merge(constraints, observations) +(trace, _) = Gen.generate(misspecified_regression, (xs,), constraints) + +# Run MCMC for 10,000 iterations with a burn-in of 500 +traces = mcmc_sampler(mcmc_kernel, trace, 10000, 500) + +# Compute the mean slope and intercept +mean_slope = sum(trace[:slope] for trace in traces) / length(traces) +mean_intercept = sum(trace[:intercept] for trace in traces) / length(traces) + +println("Mean slope: ", mean_slope) +println("Mean intercept: ", mean_intercept) +``` + +Like importance sampling, MCMC infers a much smaller slope and intercept than the true parameters. Let us visualize the joint posterior. + +```@example enumerative_tutorial +# Plot posterior samples +slopes = [trace[:slope] for trace in traces] +intercepts = [trace[:intercept] for trace in traces] +plot_posterior_samples(-4:0.25:4, -4:0.25:4, intercepts, slopes, + x_true=true_intercept, y_true=true_slope, + xlabel="Intercept", ylabel="Slope") +``` + +Let us also plot the inferred outlier probabilities: + +```@example enumerative_tutorial +# Estimate posterior probability of each point being an outlier +outlier_probs = zeros(length(xs)) +for i = 1:length(xs) + for (j, trace) in enumerate(traces) + if trace[:data => i => :is_outlier] + outlier_probs[i] += 1/length(traces) + end + end +end + +# Plot posterior probability of each point being an outlier +bar(1:length(xs), outlier_probs, + xlabel="x", ylabel="P(outlier | data)", + color=:black, ylim=(0, 1), legend=false) +``` + +Both MCMC and importance sampling produce similar inferences, inferring a flat slope with many outliers rather than a steep slope with few outliers. This is despite the fact that MCMC was initialized at the true parameters, strongly indicating that model misspecification is at play here. + +In general, however, we don't have access to the true parameters, nor do we always know if MCMC will converge to the posterior given a finite sample budget. To decisively diagnose model misspecification, we now use enumerative inference with a sufficiently fine grid, ensuring systemic coverage over the latent space. + +```@example enumerative_tutorial +# Construct enumeration grid +grid = choice_vol_grid( + (:slope, -4:0.25:4, :continuous), # 32 grid intervals + (:intercept, -4:0.25:4, :continuous), # 32 grid intervals + # Enumerate exactly over outlier classifications + (:data => 1 => :is_outlier, [false, true]), + (:data => 2 => :is_outlier, [false, true]), + (:data => 3 => :is_outlier, [false, true]), + (:data => 4 => :is_outlier, [false, true]), + (:data => 5 => :is_outlier, [false, true]); + anchor = :midpoint # Anchor evaluation point at midpoint of each interval +) + +# Run enumerative inference +traces, log_norm_weights, lml_est = + enumerative_inference(misspecified_regression, (xs,), observations, grid) + +# Compute marginal posterior over slope and intercept +sum_dims = Tuple(3:ndims(log_norm_weights)) # Sum over all other variables +posterior_grid = sum(exp.(log_norm_weights), dims=sum_dims) +posterior_grid = dropdims(posterior_grid; dims=sum_dims) + +# Extract parameter ranges +slope_range = [trs[1][:slope] for trs in eachslice(traces, dims=1)] +intercept_range = [trs[1][:intercept] for trs in eachslice(traces, dims=2)] + +# Plot 2D posterior heatmap with 1D marginals as histograms +plot_posterior_grid(intercept_range, slope_range, posterior_grid, + x_true = true_intercept, y_true = true_slope, + xlabel = "Intercept", ylabel = "Slope", is_discrete = false) +``` + +While enumerative inference produces a posterior approximation that is smoother than both importance sampling and MCMC, it still assigns a very low posterior density to the true slope and intercept. Inspecting the outlier classifications, we see that many points are inferred as likely outliers: + +```@example enumerative_tutorial +# Compute posterior probability of each point being an outlier +outlier_probs = zeros(length(xs)) +for (j, trace) in enumerate(traces) + for i in 1:length(xs) + if trace[:data => i => :is_outlier] + outlier_probs[i] += exp(log_norm_weights[j]) + end + end +end + +bar(1:length(xs), outlier_probs, + xlabel="x", ylabel="P(outlier | data)", + color=:black, ylim=(0, 1), legend=false) +``` + +This confirms that *model misspecification is the underlying issue*: The generative model we wrote doesn't capture our intuitions about what posterior inference from the data should give us. + +## Addressing Model Misspecification + +Now that we know our model is misspecified, how do we fix it? In the specific example we considered, the priors over the slope and intercept are too narrow, whereas the outlier probability is too high. A straightforward fix would thus be to widen the slope and intercept priors, while lowering the outlier probability. + +However, this change might not generalize to other sets of observations. If some data really is well-characterized by a shallow slope with many outliers, we would like to infer this as well. A more robust solution then, is to introduce *hyper-priors*: Priors on the parameters of the slope and intercept priors and the outlier probability. Adding hyper-priors results in a hierarchical Bayesian model: + +```@example enumerative_tutorial +@gen function h_bayes_regression(xs::Vector{<:Real}) + # Hyper-prior on slope and intercept prior variances + slope_var ~ inv_gamma(1, 1) + intercept_var ~ inv_gamma(1, 1) + # Slope and intercept priors + slope ~ normal(0, sqrt(slope_var)) + intercept ~ normal(0, sqrt(intercept_var)) + # Prior on outlier probability + prob_outlier ~ beta(1, 1) + + # Sample outlier classification and y value for each x value + n = length(xs) + ys = Float64[] + for i = 1:n + # Sample outlier classification + is_outlier = {:data => i => :is_outlier} ~ bernoulli(prob_outlier) + + if is_outlier + # Outliers have large noise + y = {:data => i => :y} ~ normal(0., 5.) + else + # Inliers follow the linear relationship, with low noise + y_mean = slope * xs[i] + intercept + y = {:data => i => :y} ~ normal(y_mean, 1.) + end + push!(ys, y) + end + + return ys +end +nothing # hide +``` + +Let's run enumerative inference on this expanded model, using a coarser grid to compensate for the increased dimensionality of the latent space: + +```@example enumerative_tutorial +# Construct enumeration grid +grid = choice_vol_grid( + (:slope, -4:1:4, :continuous), # 8 grid intervals + (:intercept, -4:1:4, :continuous), # 8 grid intervals + (:slope_var, 0:1:5, :continuous), # 5 grid intervals + (:intercept_var, 0:1:5, :continuous), # 5 grid intervals + (:prob_outlier, 0.0:0.2:1.0, :continuous), # 5 grid intervals + # Enumerate exactly over outlier classifications + (:data => 1 => :is_outlier, [false, true]), + (:data => 2 => :is_outlier, [false, true]), + (:data => 3 => :is_outlier, [false, true]), + (:data => 4 => :is_outlier, [false, true]), + (:data => 5 => :is_outlier, [false, true]); + anchor = :midpoint # Anchor evaluation point at midpoint of each interval +) + +# Run enumerative inference (this may take a while) +traces, log_norm_weights, lml_est = + enumerative_inference(h_bayes_regression, (xs,), observations, grid) + +# Compute marginal posterior over slope and intercept +sum_dims = Tuple(3:ndims(log_norm_weights)) # Sum over all other variables +posterior_grid = sum(exp.(log_norm_weights), dims=sum_dims) +posterior_grid = dropdims(posterior_grid; dims=sum_dims) + +# Extract parameter ranges +slope_range = [trs[1][:slope] for trs in eachslice(traces, dims=1)] +intercept_range = [trs[1][:intercept] for trs in eachslice(traces, dims=2)] + +# Plot 2D posterior heatmap with 1D marginals as histograms +plot_posterior_grid(intercept_range, slope_range, posterior_grid, + x_true = true_intercept, y_true = true_slope, + xlabel = "Intercept", ylabel = "Slope", is_discrete = false) +``` + +We see that the mode of the posterior distribution is now close to the true parameters (though there is also a secondary mode corresponding to the interpretation that the data has a shallow slope with outliers). To get a sense of why inference is now reasonable under our new model, let us visualize the conditional posteriors over `slope_var`, `intercept_var` and `prob_outlier` when `slope` and `intercept` are fixed at their true values. + +```@example enumerative_tutorial +# Construct enumeration grid conditional on true slope and intercept +cond_grid = choice_vol_grid( + (:slope_var, 0.0:0.5:5.0, :continuous), # 10 grid intervals + (:intercept_var, 0.0:0.5:5.0, :continuous), # 10 grid intervals + (:prob_outlier, 0.0:0.1:1.0, :continuous), # 10 grid intervals + # Enumerate exactly over outlier classifications + (:data => 1 => :is_outlier, [false, true]), + (:data => 2 => :is_outlier, [false, true]), + (:data => 3 => :is_outlier, [false, true]), + (:data => 4 => :is_outlier, [false, true]), + (:data => 5 => :is_outlier, [false, true]); + anchor = :midpoint # Anchor evaluation point at the right of each interval +) + +# Run enumerative inference over conditional posterior +constraints = choicemap(:slope => true_slope, :intercept => true_intercept) +constraints = merge(constraints, observations) +traces, log_norm_weights, lml_est = + enumerative_inference(h_bayes_regression, (xs,), constraints, cond_grid) + +# Compute marginal posterior over slope_var and intercept_var +sum_dims = Tuple(3:ndims(log_norm_weights)) # Sum over all other variables +posterior_grid = sum(exp.(log_norm_weights), dims=sum_dims) +posterior_grid = dropdims(posterior_grid; dims=sum_dims) + +# Extract parameter ranges +slope_var_range = [trs[1][:slope_var] for trs in eachslice(traces, dims=1)] +intercept_var_range = [trs[1][:intercept_var] for trs in eachslice(traces, dims=2)] + +# Plot 2D posterior heatmap with 1D marginals as histograms +plot_posterior_grid(intercept_var_range, slope_var_range, posterior_grid, + xlabel = "Intercept Variance", ylabel = "Slope Variance", + is_discrete = false) +``` + +```@example enumerative_tutorial +# Compute marginal posterior over prob_outlier +sum_dims = (1, 2, 4:ndims(log_norm_weights)...) # Sum over all other variables +prob_outlier_grid = sum(exp.(log_norm_weights), dims=sum_dims) +prob_outlier_grid = dropdims(prob_outlier_grid; dims=sum_dims) +prob_outlier_range = [trs[1][:prob_outlier] for trs in eachslice(traces, dims=3)] + +# Plot marginal posterior distribution over prob_outlier +bar(prob_outlier_range, prob_outlier_grid, + legend=false, bar_width=diff(prob_outlier_range)[1], + linewidth=0, color=:black, widen=false, xlims=(0, 1), + xlabel = "Outlier Probability (prob_outlier)", + ylabel = "Conditional Posterior Probability") +``` + +Conditional on the observed data and the true parameters (`slope = 2.8` and `intercept = -2.4`), the distribution over `slope_var` and `intercept_var` skews towards large values, while the distribution over `prob_outlier` skews towards low values. This avoids the failure mode that arose when the slope and intercept priors were forced to be narrow. Instead, `slope_var`, `intercept_var` and `prob_outlier` can adjust upwards or downwards to adapt to the observed data. + +Having gained confidence that our new model is well-specified by performing enumerative inference at a coarse-grained level, we can now use MCMC to approximate the posterior more efficiently, and with a higher degree of spatial resolution. + +```@example enumerative_tutorial +function h_bayes_mcmc_kernel(trace) + # Gaussian drift on line parameters + (trace, _) = mh(trace, line_proposal, ()) + + # Block resimulation: Update the outlier classifications + (xs,) = get_args(trace) + n = length(xs) + for i=1:n + (trace, _) = mh(trace, select(:data => i => :is_outlier)) + end + + # Block resimulation: Update the prior parameters + (trace, _) = mh(trace, select(:slope_var)) + (trace, _) = mh(trace, select(:intercept_var)) + (trace, _) = mh(trace, select(:prob_outlier)) + return trace +end + +# Generate initial trace from prior +trace, _ = Gen.generate(h_bayes_regression, (xs,), observations) + +# Run MCMC for 20,000 iterations with a burn-in of 500 +traces = mcmc_sampler(h_bayes_mcmc_kernel, trace, 20000, 500) + +# Plot posterior samples +slopes = [trace[:slope] for trace in traces] +intercepts = [trace[:intercept] for trace in traces] +plot_posterior_samples(-4:0.25:4, -4:0.25:4, intercepts, slopes, + x_true=true_intercept, y_true=true_slope, + xlabel="Intercept", ylabel="Slope") +``` + +MCMC produces samples that concentrate around the true parameters, while still exhibiting some of the bimodality we saw when using coarse-grained enumerative inference. diff --git a/docs/src/tutorials/mcmc_map.md b/docs/src/tutorials/mcmc_map.md new file mode 100644 index 000000000..c4821dee0 --- /dev/null +++ b/docs/src/tutorials/mcmc_map.md @@ -0,0 +1,1292 @@ +# [Basics of MCMC and MAP Inference](@id mcmc_map_tutorial) + +This tutorial introduces the basics of writing *Markov chain Monte Carlo (MCMC)* +inference programs in Gen. We also briefly cover *maximum a posteriori (MAP)* +inference, as another kind of iterative inference algorithm. + +## Linear Regression with Outliers + +As an example task, we consider a simple extension of the linear regression problem introduced in previous tutorials: *linear regression with outliers*. Suppose we have a dataset of points in the $x,y$ plane that is *mostly* explained by a linear relationship, but which also has several outliers. Our goal will be to automatically identify the outliers, and to find a linear relationship (a slope and intercept, as well as an inherent noise level) that explains rest of the points: + +```@raw html +
+ See https://dspace.mit.edu/bitstream/handle/1721.1/119255/MIT-CSAIL-TR-2018-020.pdf, Figure 2(a) +
+``` + +This is a simple inference problem. But it has two features that make it ideal for introducing concepts in modeling and inference. + +1. First, we want not only to estimate the slope and intercept of the line that best fits the data, but also to classify each point as an inlier or outlier; that is, there are a large number of latent variables of interest, enough to make importance sampling an unreliable method (absent a more involved custom proposal that does the heavy lifting). +2. Second, several of the parameters we're estimating (the slope and intercept) are continuous and amenable to gradient-based search techniques, which will allow us to explore Gen's optimization capabilities. + +Let's get started! + +```@setup mcmc_map_tutorial +import Random, Logging +using Gen, Plots + +# Disable logging, because @animate is verbose otherwise +Logging.disable_logging(Logging.Info) +nothing # hide +``` + +## Writing the Model + +We begin, as usual, by writing a model: a generative function responsible +(conceptually) for simulating a synthetic dataset. + +Our model will take as input a vector of `x` coordinates, and produce as +output corresponding `y` coordinates. + +We will also use this opportunity to introduce some syntactic sugar. +As described in [Introduction to Modeling in Gen](@ref modeling_tutorial), +random choices in Gen are given _addresses_ using the syntax +`{addr} ~ distribution(...)`. But this can be a bit verbose, and often leads to +code that looks like the following: + +```julia +x = {:x} ~ normal(0, 1) +slope = {:slope} ~ normal(0, 1) +``` + +In these examples, the variable name is duplicated as the address of the +random choice. Because this is a common pattern, Gen provides syntactic +sugar that makes it nicer to use: + +```julia +# Desugars to "x = {:x} ~ normal(0, 1)" +x ~ normal(0, 1) +# Desugars to "slope = {:slope} ~ normal(0, 1)" +slope ~ normal(0, 1) +``` + +Note that sometimes, it is still necessary to use the `{...}` form, for example +in loops: + +```julia +# INVALID: +for i=1:10 + y ~ normal(0, 1) # The name :y will be used more than once!! + println(y) +end + +# VALID: +for i=1:10 + y = {(:y, i)} ~ normal(0, 1) # OK: the address is different each time. + println(y) +end +``` + +We'll use this new syntax for writing our model of linear regression with +outliers. As we've seen before, the model generates parameters from a prior, +and then simulates data based on those parameters: + +```@example mcmc_map_tutorial +@gen function regression_with_outliers(xs::Vector{<:Real}) + # First, generate some parameters of the model. We make these + # random choices, because later, we will want to infer them + # from data. The distributions we use here express our assumptions + # about the parameters: we think the slope and intercept won't be + # too far from 0; that the noise is relatively small; and that + # the proportion of the dataset that don't fit a linear relationship + # (outliers) could be anything between 0 and 1. + slope ~ normal(0, 2) + intercept ~ normal(0, 2) + noise ~ gamma(1, 1) + prob_outlier ~ uniform(0, 1) + + # Next, we generate the actual y coordinates. + n = length(xs) + ys = Float64[] + + for i = 1:n + # Decide whether this point is an outlier, and set + # mean and standard deviation accordingly + if ({:data => i => :is_outlier} ~ bernoulli(prob_outlier)) + (mu, std) = (0., 10.) + else + (mu, std) = (xs[i] * slope + intercept, noise) + end + # Sample a y value for this point + push!(ys, {:data => i => :y} ~ normal(mu, std)) + end + ys +end +nothing # hide +``` + +## Visualizing the Model + +Let's visualize what our model is doing by drawing some samples from the +prior. We'll use the following helper functions. + +```@raw html +
Helper functions for visualization +``` + +```@example mcmc_map_tutorial +function serialize_trace(trace) + (xs,) = Gen.get_args(trace) + Dict(:slope => trace[:slope], + :intercept => trace[:intercept], + :inlier_std => trace[:noise], + :points => zip(xs, [trace[:data => i => :y] for i in 1:length(xs)]), + :outliers => [trace[:data => i => :is_outlier] for i in 1:length(xs)]) +end + + +function visualize_trace(trace::Trace; title="") + trace = serialize_trace(trace) + + outliers = [pt for (pt, outlier) in zip(trace[:points], trace[:outliers]) if outlier] + inliers = [pt for (pt, outlier) in zip(trace[:points], trace[:outliers]) if !outlier] + Plots.scatter(map(first, inliers), map(last, inliers), markercolor="blue", label=nothing, xlims=[-5, 5], ylims=[-20, 20], title=title) + Plots.scatter!(map(first, outliers), map(last, outliers), markercolor="red", label=nothing) + + inferred_line(x) = trace[:slope] * x + trace[:intercept] + left_x = -5 + left_y = inferred_line(left_x) + right_x = 5 + right_y = inferred_line(right_x) + Plots.plot!([left_x, right_x], [left_y, right_y], color="black", label=nothing) + + # Inlier noise + inlier_std = trace[:inlier_std] + noise_points = [(left_x, left_y + inlier_std), + (right_x, right_y + inlier_std), + (right_x, right_y - inlier_std), + (left_x, left_y - inlier_std)] + Plots.plot!(Shape(map(first, noise_points), map(last, noise_points)), color="black", alpha=0.2, label=nothing) + Plots.plot!(Shape([-5, 5, 5, -5], [10, 10, -10, -10]), color="black", label=nothing, alpha=0.08) +end +nothing # hide +``` + +```@raw html +

+``` + +We'll use these functions to visualize samples from our model. + +```@example mcmc_map_tutorial +# Generate nine traces and visualize them +xs = collect(range(-5, stop=5, length=20)) +traces = [Gen.simulate(regression_with_outliers, (xs,)) for i in 1:9] +Plots.plot([visualize_trace(t) for t in traces]...) +``` + +##### Legend: + +* red points: outliers; +* blue points: inliers (i.e. regular data); +* dark grey shading: noise associated with inliers; and +* light grey shading: noise associated with outliers. + +Note that an outlier can occur anywhere — including close to the line — and +that our model is capable of generating datasets in which the vast majority +of points are outliers. + +## The Limits of Importance Sampling + +To motivate the need for more complex inference algorithms, let's begin by +using the simple importance sampling method from the [Introduction to Modeling](@ref modeling_tutorial) tutorial, and +thinking about where it fails. + +First, let us create a synthetic dataset to do inference _about_. + +```@example mcmc_map_tutorial +function make_synthetic_dataset(n) + Random.seed!(1) + prob_outlier = 0.2 + true_inlier_noise = 0.5 + true_outlier_noise = 5.0 + true_slope = -1 + true_intercept = 2 + xs = collect(range(-5, stop=5, length=n)) + ys = Float64[] + for (i, x) in enumerate(xs) + if rand() < prob_outlier + y = randn() * true_outlier_noise + else + y = true_slope * x + true_intercept + randn() * true_inlier_noise + end + push!(ys, y) + end + (xs, ys) +end + +(xs, ys) = make_synthetic_dataset(20) +Plots.scatter(xs, ys, color="black", xlabel="X", ylabel="Y", + label=nothing, title="Observations - regular data and outliers") +``` + +We will to express our _observations_ as a [`ChoiceMap`](@ref) that constrains the +values of certain random choices to equal their observed values. Here, we +want to constrain the values of the choices with address `:data => i => :y` +(that is, the sampled $y$ coordinates) to equal the observed $y$ values. +Let's write a helper function that takes in a vector of $y$ values and +creates a [`ChoiceMap`](@ref) that we can use to constrain our model: + +```@example mcmc_map_tutorial +function make_constraints(ys::Vector{Float64}) + constraints = Gen.choicemap() + for i=1:length(ys) + constraints[:data => i => :y] = ys[i] + end + constraints +end +nothing # hide +``` + +We can apply it to our dataset's vector of `ys` to make a set of constraints +for doing inference: + +```@example mcmc_map_tutorial +observations = make_constraints(ys) +nothing # hide +``` + +Now, we use the library function `importance_resampling` to draw approximate +posterior samples given those observations: + +```@example mcmc_map_tutorial +function logmeanexp(scores) + logsumexp(scores) - log(length(scores)) +end +nothing # hide +``` + +```@example mcmc_map_tutorial +traces = [first(Gen.importance_resampling(regression_with_outliers, (xs,), observations, 2000)) for i in 1:9] +log_probs = [get_score(t) for t in traces] +println("Average log probability: $(logmeanexp(log_probs))") +Plots.plot([visualize_trace(t) for t in traces]...) +``` + +We see here that importance resampling hasn't completely failed: it generally +finds a reasonable position for the line. But the details are off: there is +little logic to the outlier classification, and the inferred noise around the +line is too wide. The problem is that there are just too many variables to +get right, and so sampling everything in one go is highly unlikely to produce +a perfect hit. + +In the remainder of this tutorial, we'll explore techniques for finding the +right solution _iteratively_, beginning with an initial guess and making many +small changes, until we achieve a reasonable posterior sample. + +## MCMC Part 1: Block Resimulation + +### What is MCMC? + +_Markov Chain Monte Carlo_ ("MCMC") methods are a powerful family of +algorithms for iteratively producing approximate samples from a distribution +(when applied to Bayesian inference problems, the posterior distribution of +unknown (hidden) model variables given data). + +There is a rich theory behind MCMC methods (see [this paper](https://doi.org/10.1023/A:1020281327116) +for an introduction), but we focus on applying MCMC in Gen, introducing +theoretical ideas only when necessary for understanding. As we will see, Gen +provides abstractions that hide and automate much of the math necessary for +implementing MCMC algorithms correctly. + +The general shape of an MCMC algorithm is as follows. We begin by sampling an +intial setting of all unobserved variables; in Gen, we produce an initial +_trace_ consistent with (but not necessarily _probable_ given) our +observations. Then, in a long-running loop, we make small, stochastic changes +to the trace; in order for the algorithm to be asymptotically correct, these +stochastic updates must satisfy certain probabilistic properties. + +One common way of ensuring that the updates do satisfy those properties is to +compute a _Metropolis-Hastings acceptance ratio_. Essentially, after +proposing a change to a trace, we add an "accept or reject" step that +stochastically decides whether to commit the update or to revert it. This is +an over-simplification, but generally speaking, this step ensures we are more +likely to accept changes that make our trace fit the observed data better, +and to reject ones that make our current trace worse. The algorithm also +tries not to go down dead ends: it is more likely to take an exploratory step +into a low-probability region if it knows it can easily get back to where it +came from. + +Gen's [`metropolis_hastings`](@ref) function _automatically_ adds this +"accept/reject" check (including the correct computation of the probability +of acceptance or rejection), so that inference programmers need only +think about what sorts of updates might be useful to propose. Starting in +this section, we'll look at several design patterns for MCMC updates, and how +to apply them in Gen. + +### Block Resimulation + +One of the simplest strategies we can use is called Resimulation MH, and it +works as follows. + +We begin, as in most iterative inference algorithms, by sampling an initial +trace from our model using the [`generate`](@ref) API function, fixing the +observed choices to their observed values. + +```julia +# Gen's `generate` function accepts a model, a tuple of arguments to the model, +# and a `ChoiceMap` representing observations (or constraints to satisfy). It returns +# a complete trace consistent with the observations, and an importance weight. +# In this call, we ignore the weight returned. +(tr, _) = generate(regression_with_outliers, (xs,), observations) +``` + +Then, in each iteration of our program, we propose changes to all our model's +variables in "blocks," by erasing a set of variables from our current trace +and _resimulating_ them from the model. After resimulating each block of +choices, we perform an accept/reject step, deciding whether the proposed +changes are worth making. + +```julia +# Pseudocode +for iter=1:500 + tr = maybe_update_block_1(tr) + tr = maybe_update_block_2(tr) + ... + tr = maybe_update_block_n(tr) +end +``` + +The main design choice in designing a Block Resimulation MH algorithm is how +to block the choices together for resimulation. At one extreme, we could put +each random choice the model makes in its own block. At the other, we could +put all variables into a single block (a strategy sometimes called +"independent" MH, and which bears a strong similarity to importance +resampling, as it involves repeatedly generating completely new traces and +deciding whether to keep them or not). Usually, the right thing to do is +somewhere in between. + +For the regression problem, here is one possible blocking of choices: + +**Block 1: `slope`, `intercept`, and `noise`.** These parameters determine +the linear relationship; resimulating them is like picking a new line. We +know from our importance sampling experiment above that before too long, +we're bound to sample something close to the right line. + +**Blocks 2 through N+1: Each `is_outlier`, in its own block.** One problem we +saw with importance sampling in this problem was that it tried to sample +_every_ outlier classification at once, when in reality the chances of a +single sample that correctly classifies all the points are very low. Here, we +can choose to resimulate each `is_outlier` choice separately, and for each +one, decide whether to use the resimulated value or not. + +**Block N+2: `prob_outlier`.** Finally, we can propose a new `prob_outlier` +value; in general, we can expect to accept the proposal when it is in line +with the current hypothesized proportion of `is_outlier` choices that are +set to `true`. + +Resimulating a block of variables is the simplest form of update that Gen's +[`metropolis_hastings`](@ref) operator (or [`mh`](@ref) for short) supports. +When supplied with a _current trace_ and a _selection_ of trace addresses to +resimulate, [`mh`](@ref) performs the resimulation and the appropriate +accept/reject check, then returns a possibly updated trace, along with a Boolean +indicating whether the update was accepted or not. A selection is created using +the [`select`](@ref) method. So a single update of the scheme we proposed above +would look like this: + +```@example mcmc_map_tutorial +# Perform a single block resimulation update of a trace. +function block_resimulation_update(tr) + # Block 1: Update the line's parameters + line_params = select(:noise, :slope, :intercept) + (tr, _) = mh(tr, line_params) + + # Blocks 2-N+1: Update the outlier classifications + (xs,) = get_args(tr) + n = length(xs) + for i=1:n + (tr, _) = mh(tr, select(:data => i => :is_outlier)) + end + + # Block N+2: Update the prob_outlier parameter + (tr, _) = mh(tr, select(:prob_outlier)) + + # Return the updated trace + tr +end +nothing # hide +``` + +All that's left is to (a) obtain an initial trace, and then (b) run that +update in a loop for as long as we'd like: + +```@example mcmc_map_tutorial +function block_resimulation_inference(xs, ys, observations) + observations = make_constraints(ys) + (tr, _) = generate(regression_with_outliers, (xs,), observations) + for iter=1:500 + tr = block_resimulation_update(tr) + end + tr +end +nothing # hide +``` + +Let's test it out: + +```@example mcmc_map_tutorial +scores = Vector{Float64}(undef, 10) +for i=1:10 + @time tr = block_resimulation_inference(xs, ys, observations) + scores[i] = get_score(tr) +end +println("Log probability: ", logmeanexp(scores)) +``` + +We note that this is significantly better than importance sampling, even if +we run importance sampling for about the same amount of (wall-clock) time per +sample: + +```@example mcmc_map_tutorial +scores = Vector{Float64}(undef, 10) +for i=1:10 + @time (tr, _) = importance_resampling(regression_with_outliers, (xs,), observations, 17000) + scores[i] = get_score(tr) +end +println("Log probability: ", logmeanexp(scores)) +``` + +It's one thing to see a log probability increase; it's better to understand +what the inference algorithm is actually doing, and to see _why_ it's doing +better. + +A great tool for debugging and improving MCMC algorithms is visualization. We +can use `Plots.@animate` to produce an +animated visualization: + +```@example mcmc_map_tutorial +t, = generate(regression_with_outliers, (xs,), observations) + +viz = Plots.@animate for i in 1:500 + global t + t = block_resimulation_update(t) + visualize_trace(t; title="Iteration $i/500") +end +gif(viz) +``` + +We can see that although the algorithm keeps changing the inferences of which +points are inliers and outliers, it has a harder time refining the continuous +parameters. We address this challenge next. + +## MCMC Part 2: Gaussian Drift MH + +So far, we've seen one form of incremental trace update: + +```julia +(tr, did_accept) = mh(tr, select(:address1, :address2, ...)) +``` + +This update is incremental in that it only proposes changes to part of a +trace (the selected addresses). But when computing _what_ changes to propose, +it ignores the current state completely and resimulates all-new values from +the model. + +That wholesale resimulation of values is often not the best way to search for +improvements. To that end, Gen also offers a more general flavor of MH: + +```julia +(tr, did_accept) = mh(tr, custom_proposal, custom_proposal_args) +``` + +A "custom proposal" is just what it sounds like: whereas before, we were +using the _default resimulation proposal_ to come up with new values for the +selected addresses, we can now pass in a generative function that samples +proposed values however it wants. + +For example, here is a custom proposal that takes in a current trace, and +proposes a new slope and intercept by randomly perturbing the existing +values: + +```@example mcmc_map_tutorial +@gen function line_proposal(current_trace) + slope ~ normal(current_trace[:slope], 0.5) + intercept ~ normal(current_trace[:intercept], 0.5) +end +nothing # hide +``` + +This is often called a "Gaussian drift" proposal, because it essentially amounts +to proposing steps of a random walk. (What makes it different from a random walk +is that we will still use an MH accept/reject step to make sure we don't wander +into areas of very low probability.) + +To use the proposal, we write: + +```julia +(tr, did_accept) = mh(tr, line_proposal, ()) +``` + +Two things to note: + +1. We no longer need to pass a selection of addresses. Instead, Gen assumes + that whichever addresses are sampled by the proposal (in this case, + `:slope` and `:intercept`) are being proposed to. + +2. The argument list to the proposal is an empty tuple, `()`. The + `line_proposal` generative function does expect an argument, the previous + trace, but this is supplied automatically to all MH custom proposals + (a proposal generative function for use with [`mh`](@ref) must take as its + first argument the current trace of the model). + +Let's swap it into our update: + +```@example mcmc_map_tutorial +function gaussian_drift_update(tr) + # Gaussian drift on line params + (tr, _) = mh(tr, line_proposal, ()) + + # Block resimulation: Update the outlier classifications + (xs,) = get_args(tr) + n = length(xs) + for i=1:n + (tr, _) = mh(tr, select(:data => i => :is_outlier)) + end + + # Block resimulation: Update the prob_outlier parameter + (tr, w) = mh(tr, select(:prob_outlier)) + (tr, w) = mh(tr, select(:noise)) + tr +end +nothing # hide +``` + +If we compare the Gaussian Drift proposal visually with our old algorithm, we +can see the new behavior: + +```@example mcmc_map_tutorial +tr1, = generate(regression_with_outliers, (xs,), observations) +tr2 = tr1 + +viz = Plots.@animate for i in 1:300 + global tr1, tr2 + tr1 = gaussian_drift_update(tr1) + tr2 = block_resimulation_update(tr2) + Plots.plot(visualize_trace(tr1; title="Drift Kernel (Iter $i)"), + visualize_trace(tr2; title="Resim Kernel (Iter $i)")) +end +gif(viz) +``` + +------------------------------------------------------------------- + +### Exercise: Analyzing the algorithms + +Run the cell above several times. Compare the two +algorithms with respect to the following: + +- How fast do they find a relatively good line? + +- Does one of them tend to get stuck more than the other? Under what + conditions? Why? + +------------------------------------------------------------------- + +A more quantitative comparison demonstrates that our change has +improved our inference quality: + +```@example mcmc_map_tutorial +function gaussian_drift_inference(xs, observations) + (tr, _) = generate(regression_with_outliers, (xs,), observations) + for iter=1:500 + tr = gaussian_drift_update(tr) + end + tr +end + +scores = Vector{Float64}(undef, 10) +for i=1:10 + @time tr = gaussian_drift_inference(xs, observations) + scores[i] = get_score(tr) +end +println("Log probability: ", logmeanexp(scores)) +``` + +## MCMC Part 3: Heuristic Guidance + +In this section, we'll look at another strategy for improving MCMC inference: +using arbitrary heuristics to make smarter proposals. In particular, we'll +use a method called "Random Sample Consensus" (or RANSAC) to quickly find +promising settings of the slope and intercept parameters. + +RANSAC works as follows: +1. We repeatedly choose a small random subset of the points, say, of size 3. +2. We do least-squares linear regression to find a line of best fit for those + points. +3. We count how many points (from the entire set) are near the line we found. +4. After a suitable number of iterations (say, 10), we return the line that + had the highest score. + +Here's our implementation of the algorithm in Julia: + +```@example mcmc_map_tutorial +import StatsBase + +struct RANSACParams + """the number of random subsets to try""" + iters::Int + + """the number of points to use to construct a hypothesis""" + subset_size::Int + + """the error threshold below which a datum is considered an inlier""" + eps::Float64 + + function RANSACParams(iters, subset_size, eps) + if iters < 1 + error("iters < 1") + end + new(iters, subset_size, eps) + end +end + +function ransac(xs::Vector{Float64}, ys::Vector{Float64}, params::RANSACParams) + best_num_inliers::Int = -1 + best_slope::Float64 = NaN + best_intercept::Float64 = NaN + for i=1:params.iters + # select a random subset of points + rand_ind = StatsBase.sample(1:length(xs), params.subset_size, replace=false) + subset_xs = xs[rand_ind] + subset_ys = ys[rand_ind] + + # estimate slope and intercept using least squares + A = hcat(subset_xs, ones(length(subset_xs))) + slope, intercept = A \ subset_ys # use backslash operator for least sq soln + + ypred = intercept .+ slope * xs + + # count the number of inliers for this (slope, intercept) hypothesis + inliers = abs.(ys - ypred) .< params.eps + num_inliers = sum(inliers) + + if num_inliers > best_num_inliers + best_slope, best_intercept = slope, intercept + best_num_inliers = num_inliers + end + end + + # return the hypothesis that resulted in the most inliers + (best_slope, best_intercept) +end +nothing # hide +``` + +We can now wrap it in a Gen proposal that calls out to RANSAC, then samples a +slope and intercept near the one it proposed. + +```@example mcmc_map_tutorial +@gen function ransac_proposal(prev_trace, xs, ys) + (slope_guess, intercept_guess) = ransac(xs, ys, RANSACParams(10, 3, 1.)) + slope ~ normal(slope_guess, 0.1) + intercept ~ normal(intercept_guess, 1.0) +end +nothing # hide +``` + +(Notice that although `ransac` makes random choices, they are not addressed +(and they happen outside of a Gen generative function), so Gen cannot reason +about them. This is OK (see [1]). Writing proposals that have traced internal +randomness (i.e., that make traced random choices that are not directly used +in the proposal) can lead to better inference, but requires the use of a more +complex version of Gen's [`mh`](@ref) operator, which is beyond the scope of +this tutorial.) + +[1] [Using probabilistic programs as proposals](https://arxiv.org/abs/1801.03612), Marco F. Cusumano-Towner, Vikash K. Mansinghka, 2018. + +One iteration of our update algorithm will now look like this: + +```@example mcmc_map_tutorial +function ransac_update(tr) + # Use RANSAC to (potentially) jump to a better line + # from wherever we are + (tr, _) = mh(tr, ransac_proposal, (xs, ys)) + + # Spend a while refining the parameters, using Gaussian drift + # to tune the slope and intercept, and resimulation for the noise + # and outliers. + for j=1:20 + (tr, _) = mh(tr, select(:prob_outlier)) + (tr, _) = mh(tr, select(:noise)) + (tr, _) = mh(tr, line_proposal, ()) + # Reclassify outliers + for i=1:length(get_args(tr)[1]) + (tr, _) = mh(tr, select(:data => i => :is_outlier)) + end + end + tr +end +``` + +We can now run our main loop for just 5 iterations, and achieve pretty good +results. (Of course, since we do 20 inner loop iterations in `ransac_update`, +this is really closer to 100 iterations.) The running time is significantly +less than before, without a real dip in quality: + +```@example mcmc_map_tutorial +function ransac_inference(xs, ys, observations) + (slope, intercept) = ransac(xs, ys, RANSACParams(10, 3, 1.)) + slope_intercept_init = choicemap() + slope_intercept_init[:slope] = slope + slope_intercept_init[:intercept] = intercept + (tr, _) = generate(regression_with_outliers, (xs,), merge(observations, slope_intercept_init)) + for iter=1:5 + tr = ransac_update(tr) + end + tr +end + +scores = Vector{Float64}(undef, 10) +for i=1:10 + @time tr = ransac_inference(xs, ys, observations) + scores[i] = get_score(tr) +end +println("Log probability: ", logmeanexp(scores)) +``` + +Let's visualize the algorithm: + +```@example mcmc_map_tutorial +(slope, intercept) = ransac(xs, ys, RANSACParams(10, 3, 1.)) +slope_intercept_init = choicemap() +slope_intercept_init[:slope] = slope +slope_intercept_init[:intercept] = intercept +(tr, _) = generate(regression_with_outliers, (xs,), merge(observations, slope_intercept_init)) + +viz = Plots.@animate for i in 1:100 + global tr + + if i % 20 == 0 + (tr, _) = mh(tr, ransac_proposal, (xs, ys)) + end + + # Spend a while refining the parameters, using Gaussian drift + # to tune the slope and intercept, and resimulation for the noise + # and outliers. + (tr, _) = mh(tr, select(:prob_outlier)) + (tr, _) = mh(tr, select(:noise)) + (tr, _) = mh(tr, line_proposal, ()) + + # Reclassify outliers + for i=1:length(get_args(tr)[1]) + (tr, _) = mh(tr, select(:data => i => :is_outlier)) + end + + visualize_trace(tr; title="Iteration $i") +end +gif(viz) +``` + +------------------------------------------------------------------- +### Exercise +#### Improving the heuristic +Currently, the RANSAC heuristic does not use the current trace's information +at all. Try changing it to use the current state as follows: Instead of a +constant `eps` parameter that controls whether a point is considered an inlier, +make this decision based on the currently hypothesized noise level. +Specifically, set `eps` to be equal to the `noise` parameter of the trace. + +Examine whether this improves inference. + +```@example mcmc_map_tutorial +# Modify the function below (which currently is just a copy of `ransac_proposal`) +# as described above so that implements a RANSAC proposal with inlier +# status decided by the noise parameter of the previous trace +# (do not modify the return value, which is unneccessary for a proposal, +# but used for testing) + +@gen function ransac_proposal_noise_based(prev_trace, xs, ys) + params = RANSACParams(10, 3, 1.) + (slope_guess, intercept_guess) = ransac(xs, ys, params) + slope ~ normal(slope_guess, 0.1) + intercept ~ normal(intercept_guess, 1.0) + return params, slope, intercept # (return values just for testing) +end +nothing # hide +``` + +```@raw html +
Solution +``` + +```julia +@gen function ransac_proposal_noise_based(prev_trace, xs, ys) + eps = prev_trace[:noise] + params = RANSACParams(10, 3, eps) + (slope_guess, intercept_guess) = ransac(xs, ys, params) + slope ~ normal(slope_guess, 0.1) + intercept ~ normal(intercept_guess, 1.0) + return params, slope, intercept # (return values just for testing) +end +``` + +```@raw html +
+``` + +------------------------------------------------------------------- + +The code below runs the RANSAC inference as above, but using `ransac_proposal_noise_based`. + +```@example mcmc_map_tutorial +function ransac_update_noise_based(tr) + # Use RANSAC to (potentially) jump to a better line + (tr, _) = mh(tr, ransac_proposal_noise_based, (xs, ys)) + # Refining the parameters + for j=1:20 + (tr, _) = mh(tr, select(:prob_outlier)) + (tr, _) = mh(tr, select(:noise)) + (tr, _) = mh(tr, line_proposal, ()) + # Reclassify outliers + for i=1:length(get_args(tr)[1]) + (tr, _) = mh(tr, select(:data => i => :is_outlier)) + end + end + tr +end +function ransac_inference_noise_based(xs, ys, observations) + # Use an initial epsilon value of 1. + (slope, intercept) = ransac(xs, ys, RANSACParams(10, 3, 1.)) + slope_intercept_init = choicemap() + slope_intercept_init[:slope] = slope + slope_intercept_init[:intercept] = intercept + (tr, _) = generate(regression_with_outliers, (xs,), merge(observations, slope_intercept_init)) + for iter=1:5 + tr = ransac_update_noise_based(tr) + end + tr +end + +scores = Vector{Float64}(undef, 10) +for i=1:10 + @time tr = ransac_inference_noise_based(xs, ys, observations) + scores[i] = get_score(tr) +end +println("Log probability: ", logmeanexp(scores)) +``` + +------------------------------------------------------------------- + +### Exercise +Implement a heuristic-based proposal that selects the points that are +currently classified as *inliers*, finds the line of best fit for this +subset of points, and adds some noise. + +_Hint_: you can get the result for linear regression using least squares approximation by +solving a linear system using Julia's [backslash operator, `\`](https://docs.julialang.org/en/v1/base/math/#Base.:\\-Tuple{Any,Any}) (as is done in the `ransac` function, above). + +We provide some starter code. You can test your solution by modifying the plotting code above. + +```@example mcmc_map_tutorial +@gen function inlier_heuristic_proposal(prev_trace, xs, ys) + # Put your code below, ensure that you compute values for + # inlier_slope, inlier_intercept and delete the two placeholders + # below. + + inlier_slope = 10000. # + inlier_intercept = 10000. # + + + # Make a noisy proposal. + slope ~ normal(inlier_slope, 0.5) + intercept ~ normal(inlier_intercept, 0.5) + # We return values here for testing; normally, proposals don't have to return values. + return inlier_slope, inlier_intercept +end + +function inlier_heuristic_update(tr) + # Use inlier heuristics to (potentially) jump to a better line + # from wherever we are. + (tr, _) = mh(tr, inlier_heuristic_proposal, (xs, ys)) + # Spend a while refining the parameters, using Gaussian drift + # to tune the slope and intercept, and resimulation for the noise + # and outliers. + for j=1:20 + (tr, _) = mh(tr, select(:prob_outlier)) + (tr, _) = mh(tr, select(:noise)) + (tr, _) = mh(tr, line_proposal, ()) + # Reclassify outliers + for i=1:length(get_args(tr)[1]) + (tr, _) = mh(tr, select(:data => i => :is_outlier)) + end + end + tr +end +nothing # hide +``` + +```@example mcmc_map_tutorial +tr, = Gen.generate(regression_with_outliers, (xs,), observations) +viz = @animate for i in 1:50 + global tr + tr = inlier_heuristic_update(tr) + visualize_trace(tr; title="Iteration $i") +end +gif(viz) +``` + +```@raw html +
Solution +``` + +```julia +@gen function inlier_heuristic_proposal(prev_trace, xs, ys) + # get_indeces for inliers. + inlier_indeces = filter( + i -> !prev_trace[:data => i => :is_outlier], + 1:length(xs) + ) + xs_inlier = xs[inlier_indeces] + ys_inlier = ys[inlier_indeces] + # estimate slope and intercept using least squares. + A = hcat(xs_inlier, ones(length(xs_inlier))) + inlier_slope, inlier_intercept = A \ ys_inlier + + # Make a noisy proposal. + slope ~ normal(inlier_slope, 0.5) + intercept ~ normal(inlier_intercept, 0.5) + # We return values here for testing; normally, proposals don't have to return values. + return inlier_slope, inlier_intercept +end; + +function inlier_heuristic_update(tr) + # Use inlier heuristics to (potentially) jump to a better line + # from wherever we are. + (tr, _) = mh(tr, inlier_heuristic_proposal, (xs, ys)) + # Spend a while refining the parameters, using Gaussian drift + # to tune the slope and intercept, and resimulation for the noise + # and outliers. + for j=1:20 + (tr, _) = mh(tr, select(:prob_outlier)) + (tr, _) = mh(tr, select(:noise)) + (tr, _) = mh(tr, line_proposal, ()) + # Reclassify outliers + for i=1:length(get_args(tr)[1]) + (tr, _) = mh(tr, select(:data => i => :is_outlier)) + end + end + tr +end +``` + +```@raw html +
+``` + +------------------------------------------------------------------- + +### Exercise: Initialization + +In our inference program above, when generating an initial trace on which to +iterate, we initialize the slope and intercept to values proposed by RANSAC. +If we don't do this, the performance decreases sharply, despite the fact that +we still propose new slope/intercept pairs from RANSAC once the loop starts. +Why is this? + +------------------------------------------------------------------- + +## MAP Optimization + +Everything we've done so far has been within the MCMC framework. But +sometimes you're not interested in getting posterior samples—sometimes you +just want a single likely explanation for your data. Gen also provides tools +for _maximum a posteriori_ estimation ("MAP estimation"), the problem of +finding a trace that maximizes the posterior probability under the model +given observations. + +For example, let's say we wanted to take a trace and assign each point's +`is_outlier` score to the most likely possibility. We can do this by +iterating over both possible traces, scoring them, and choosing the one with +the higher score. We can do this using Gen's [`update`](@ref) function, +which allows us to manually update a trace to satisfy some constraints: + +```@example mcmc_map_tutorial +function is_outlier_map_update(tr) + (xs,) = get_args(tr) + for i=1:length(xs) + constraints = choicemap(:prob_outlier => 0.1) + constraints[:data => i => :is_outlier] = false + (trace1,) = update(tr, (xs,), (NoChange(),), constraints) + constraints[:data => i => :is_outlier] = true + (trace2,) = update(tr, (xs,), (NoChange(),), constraints) + tr = (get_score(trace1) > get_score(trace2)) ? trace1 : trace2 + end + tr +end +nothing # hide +``` + +For continuous parameters, we can use Gen's [`map_optimize`](@ref) function, +which uses _automatic differentiation_ to shift the selected parameters in the +direction that causes the probability of the trace to increase most sharply: + +```julia +tr = map_optimize(tr, select(:slope, :intercept), max_step_size=1., min_step_size=1e-5) +``` + +Putting these updates together, we can write an inference program that uses our +RANSAC algorithm from above to get an initial trace, then tunes it using +optimization: + +```@example mcmc_map_tutorial +using StatsBase: mean + +(slope, intercept) = ransac(xs, ys, RANSACParams(10, 3, 1.)) +slope_intercept_init = choicemap() +slope_intercept_init[:slope] = slope +slope_intercept_init[:intercept] = intercept +(tr,) = generate(regression_with_outliers, (xs,), merge(observations, slope_intercept_init)) + + +ransac_score, final_score = 0, 0 +viz = Plots.@animate for i in 1:35 + global tr, ransac_score + if i < 6 + tr = ransac_update(tr) + else + tr = map_optimize(tr, select(:slope, :intercept), max_step_size=1., min_step_size=1e-5) + tr = map_optimize(tr, select(:noise), max_step_size=1e-2, min_step_size=1e-5) + tr = is_outlier_map_update(tr) + optimal_prob_outlier = mean([tr[:data => i => :is_outlier] for i in 1:length(xs)]) + optimal_prob_outlier = min(0.5, max(0.05, optimal_prob_outlier)) + tr, = update(tr, (xs,), (NoChange(),), choicemap(:prob_outlier => optimal_prob_outlier)) + end + + if i == 5 + ransac_score = get_score(tr) + end + + visualize_trace(tr; title="Iteration $i $(i < 6 ? "(RANSAC init)" : "(MAP optimization)")") +end +final_score = get_score(tr) + +println("Score after ransac: $(ransac_score). Final score: $(final_score).") +gif(viz) +``` + +Below, we evaluate the algorithm and we see that it gets our best scores yet, +which is what it's meant to do: + +```@example mcmc_map_tutorial +function map_inference(xs, ys, observations) + (slope, intercept) = ransac(xs, ys, RANSACParams(10, 3, 1.)) + slope_intercept_init = choicemap() + slope_intercept_init[:slope] = slope + slope_intercept_init[:intercept] = intercept + (tr, _) = generate(regression_with_outliers, (xs,), merge(observations, slope_intercept_init)) + for iter=1:5 + tr = ransac_update(tr) + end + + for iter = 1:20 + # Take a single gradient step on the line parameters. + tr = map_optimize(tr, select(:slope, :intercept), max_step_size=1., min_step_size=1e-5) + tr = map_optimize(tr, select(:noise), max_step_size=1e-2, min_step_size=1e-5) + + # Choose the most likely classification of outliers. + tr = is_outlier_map_update(tr) + + # Update the prob outlier + choices = get_choices(tr) + optimal_prob_outlier = count(i -> choices[:data => i => :is_outlier], 1:length(xs)) / length(xs) + optimal_prob_outlier = min(0.5, max(0.05, optimal_prob_outlier)) + (tr, _) = update(tr, (xs,), (NoChange(),), choicemap(:prob_outlier => optimal_prob_outlier)) + end + tr +end + +scores = Vector{Float64}(undef, 10) +for i=1:10 + @time tr = map_inference(xs,ys,observations) + scores[i] = get_score(tr) +end +println(logmeanexp(scores)) +``` + +This doesn't necessarily mean that it's "better," though. It finds the most +probable explanation of the data, which is a different problem from the one +we tackled with MCMC inference. There, the goal was to sample from the +posterior, which allows us to better characterize our uncertainty. Using +MCMC, there might be a borderline point that is sometimes classified as an +outlier and sometimes not, reflecting our uncertainty; with MAP optimization, +we will always be shown the most probable answer. + +--------- + +Below we generate a dataset for which there are two distinct possible explanations +(the grey lines) under our model `regression_with_outliers`. + +```@example mcmc_map_tutorial +function make_bimodal_dataset(n) + Random.seed!(4) + prob_outlier = 0.2 + true_inlier_noise = 0.5 + true_outlier_noise = 5.0 + true_slope1 = 1 + true_intercept1 = 0 + true_slope2 = -2/3 + true_intercept2 = 0 + xs = collect(range(-5, stop=5, length=n)) + ys = Float64[] + for (i, x) in enumerate(xs) + if rand() < prob_outlier + y = randn() * true_outlier_noise + else + if rand((true,false)) + y = true_slope1 * x + true_intercept1 + randn() * true_inlier_noise + else + y = true_slope2 * x + true_intercept2 + randn() * true_inlier_noise + end + end + push!(ys, y) + end + xs,ys,true_slope1,true_slope2,true_intercept1,true_intercept2 +end + +(xs, ys_bimodal, m1,m2,b1,b2) = make_bimodal_dataset(20) +observations_bimodal = make_constraints(ys_bimodal) + +Plots.scatter(xs, ys_bimodal, color="black", xlabel="X", ylabel="Y", label=nothing, title="Bimodal data") +Plots.plot!(xs,m1.*xs.+b1, color="blue", label=nothing) +Plots.plot!(xs,m2.*xs.+b2, color="green", label=nothing) +``` + +### Exercise + +For this dataset, the code below will run (i) Metropolis hastings with a +Gaussian Drift proposal and (ii) MAP optimization, using implementations from +above. Make sure you understand what it is doing. Do both algorithms explore +both modes (i.e. both possible explanations)? Play with running the +algorithms multiple times. + +If one or both algorithms doesn't then explain in a few sentences why you +think this is. + +```@example mcmc_map_tutorial +(slope, intercept) = ransac(xs, ys_bimodal, RANSACParams(10, 3, 1.)) +slope_intercept_init = choicemap() +slope_intercept_init[:slope] = slope +slope_intercept_init[:intercept] = intercept +(tr,) = generate( + regression_with_outliers, (xs,), + merge(observations_bimodal, slope_intercept_init)) + +tr_drift = tr +tr_map = tr + +viz = Plots.@animate for i in 1:305 + global tr_map, tr_drift + if i < 6 + tr_drift = ransac_update(tr) + tr_map = tr_drift + else + # Take a single gradient step on the line parameters. + tr_map = map_optimize(tr_map, select(:slope, :intercept), max_step_size=1., min_step_size=1e-5) + tr_map = map_optimize(tr_map, select(:noise), max_step_size=1e-2, min_step_size=1e-5) + # Choose the most likely classification of outliers. + tr_map = is_outlier_map_update(tr_map) + # Update the prob outlier + optimal_prob_outlier = mean([tr_map[:data => i => :is_outlier] for i in 1:length(xs)]) + optimal_prob_outlier = min(0.5, max(0.05, optimal_prob_outlier)) + tr_map, = update(tr_map, (xs,), (NoChange(),), choicemap(:prob_outlier => optimal_prob_outlier)) + + # Gaussian drift update: + tr_drift = gaussian_drift_update(tr_drift) + end + + Plots.plot(visualize_trace(tr_drift; title="Drift (Iter $i)"), visualize_trace(tr_map; title="MAP (Iter $i)")) +end + +drift_final_score = get_score(tr_drift) +map_final_score = get_score(tr_map) +println("i. MH Gaussian drift score $(drift_final_score)") +println("ii. MAP final score: $(final_score).") + +gif(viz) +``` + +The above was good for an overall qualitative examination, but let's also +examine a little more quantitatively how often the two proposals explore +the two modes, by running multiple times and keeping track of how often the +slope is positive/negative for each, for a few different initializations. + +**Warning: the following cell may take a few minutes to run.** + +```@example mcmc_map_tutorial +total_runs = 25 + +for (index, value) in enumerate([(1, 0), (-1, 0), ransac(xs, ys_bimodal, RANSACParams(10, 3, 1.))]) + n_pos_drift = n_neg_drift = n_pos_map = n_neg_map = 0 + + for i=1:total_runs + pos_drift = neg_drift = pos_map = neg_map = false + + #### RANSAC for initializing + (slope, intercept) = value # ransac(xs, ys_bimodal, RANSACParams(10, 3, 1.)) + slope_intercept_init = choicemap() + slope_intercept_init[:slope] = slope + slope_intercept_init[:intercept] = intercept + (tr,) = generate( + regression_with_outliers, (xs,), + merge(observations_bimodal, slope_intercept_init)) + for iter=1:5 + tr = ransac_update(tr) + end + ransac_score = get_score(tr) + tr_drift = tr # version of the trace for the Gaussian drift algorithm + tr_map = tr # version of the trace for the MAP optimization + + #### Refine the parameters according to each of the algorithms + for iter = 1:300 + # MAP optimiztion: + # Take a single gradient step on the line parameters. + tr_map = map_optimize(tr_map, select(:slope, :intercept), max_step_size=1., min_step_size=1e-5) + tr_map = map_optimize(tr_map, select(:noise), max_step_size=1e-2, min_step_size=1e-5) + # Choose the most likely classification of outliers. + tr_map = is_outlier_map_update(tr_map) + # Update the prob outlier + optimal_prob_outlier = count(i -> tr_map[:data => i => :is_outlier], 1:length(xs)) / length(xs) + optimal_prob_outlier = min(0.5, max(0.05, optimal_prob_outlier)) + (tr_map, _) = update(tr_map, (xs,), (NoChange(),), choicemap(:prob_outlier => optimal_prob_outlier)) + + # Gaussian drift update: + tr_drift = gaussian_drift_update(tr_drift) + + if tr_drift[:slope] > 0 + pos_drift = true + elseif tr_drift[:slope] < 0 + neg_drift = true + end + if tr_map[:slope] > 0 + pos_map = true + elseif tr_map[:slope] < 0 + neg_map = true + end + end + + if pos_drift + n_pos_drift += 1 + end + if neg_drift + n_neg_drift += 1 + end + if pos_map + n_pos_map += 1 + end + if neg_map + n_neg_map += 1 + end + end + (slope, intercept) = value + println("\n\nWITH INITIAL SLOPE $(slope) AND INTERCEPT $(intercept)") + println("TOTAL RUNS EACH: $(total_runs)") + println("\n times neg. slope times pos. slope") + println("\ndrift: $(n_neg_drift) $(n_pos_drift)") + println("\nMAP: $(n_neg_map) $(n_pos_map)") +end +``` + +Although this experiment is imperfect, we can broadly see that the drift kernel +often explores both modes within a single run, whereas this is rarer for the MAP +kernel (in 25 runs, the MAP kernel visits on average 1.08 of the 2 modes, +whereas the drift kernel visits 1.6). \ No newline at end of file diff --git a/docs/src/tutorials/modeling_in_gen.md b/docs/src/tutorials/modeling_in_gen.md index 1a0c7b724..0893e4da8 100644 --- a/docs/src/tutorials/modeling_in_gen.md +++ b/docs/src/tutorials/modeling_in_gen.md @@ -1,4 +1,4 @@ -# [Introduction to Modeling in Gen](@id introduction_to_modeling_in_gen) +# [Introduction to Modeling in Gen](@id modeling_tutorial) Welcome! In this tutorial, you'll get your feet wet with Gen, a multi-paradigm platform for probabilistic modeling and inference. By @@ -444,7 +444,7 @@ observed data. That is, the inference program will try to find a trace that well explains the dataset we created above. We can inspect that trace to find estimates of the slope and intercept of a line that fits the data. -Functions like `importance_resampling` expect us to provide a _model_ and +Functions like [`importance_resampling`](@ref) expect us to provide a _model_ and also an _choice map_ representing our data set and relating it to the model. A choice map maps random choice addresses from the model to values from our data set. Here, we want to tie model addresses like `(:y, 4)` to data set @@ -476,7 +476,7 @@ trace = do_inference(line_model, xs, ys, 100) render_trace(trace) ``` -We see that `importance_resampling` found a reasonable slope and intercept to explain the data. We can also visualize many samples in a grid: +We see that [`importance_resampling`](@ref) found a reasonable slope and intercept to explain the data. We can also visualize many samples in a grid: ```@example modeling_tutorial traces = [do_inference(line_model, xs, ys, 100) for _=1:10]; @@ -532,7 +532,7 @@ Write an inference program that generates traces of `sine_model` that explain th ------------------ -## Predicting new data +## Predicting New Data What if we'd want to predict `ys` given `xs`? @@ -786,7 +786,7 @@ Plots.plot(fixed_noise_plot, inferred_noise_plot) ------------------------- -## Calling other generative functions +## Calling Other Generative Functions In addition to making random choices, generative functions can invoke other generative functions. To illustrate this, we will write a probabilistic model @@ -929,7 +929,7 @@ Hint: To estimate the posterior probability that the data was generated by the s ------- -## Modeling with an unbounded number of parameters +## Modeling with an Unbounded Number of Parameters Gen's built-in modeling language can be used to express models that use an unbounded number of parameters. This section walks you through development of diff --git a/docs/src/tutorials/scaling_with_sml.md b/docs/src/tutorials/scaling_with_sml.md index 3b039c4df..a962d9d57 100644 --- a/docs/src/tutorials/scaling_with_sml.md +++ b/docs/src/tutorials/scaling_with_sml.md @@ -334,7 +334,7 @@ plot!(ns, static_with_map_times, label="with map and static outer fn") plot!(ns, fully_static_with_map_times, label="with map and static outer and inner fns") ``` -# Checking the Inference Programs +## Checking the Inference Programs Before wrapping up, let's confirm that all of our models are giving good results: diff --git a/docs/src/tutorials/vi.md b/docs/src/tutorials/vi.md index 985377b8a..7dc49814a 100644 --- a/docs/src/tutorials/vi.md +++ b/docs/src/tutorials/vi.md @@ -36,7 +36,7 @@ end nothing # hide ``` -Since `x_mu` and `x_log_std`are not fixed to particular values, this generative function defines a *family* of distributions, not just one. Note that we intentionally chose to parameterize the distribution by the log standard deviation `x_log_std`, so that every parameter has full support over the real line, and we can perform unconstrained optimization of the parameters. +Since `x_mu` and `x_log_std` are not fixed to particular values, this generative function defines a *family* of distributions, not just one. Note that we intentionally chose to parameterize the distribution by the log standard deviation `x_log_std`, so that every parameter has full support over the real line, and we can perform unconstrained optimization of the parameters. To perform variational inference, we need to initialize the variational parameters to their starting values: