diff --git a/.gitignore b/.gitignore index 3c097ca05..eef389952 100644 --- a/.gitignore +++ b/.gitignore @@ -28,4 +28,5 @@ venv site_libs .DS_Store index_files -digest.txt \ No newline at end of file +digest.txt +*.bak \ No newline at end of file diff --git a/README.md b/README.md index ffab706cf..374ce8d16 100644 --- a/README.md +++ b/README.md @@ -1,106 +1,105 @@ # Turing.jl Documentation and Tutorials -**https://turinglang.org/docs/** +**https://turinglang.org/docs/** -## Contributing +## Contributing -The easiest way to contribute to the documentation is to simply open a pull request. -A preview version of the documentation is built for PRs, so you can see how your changes look without having to build the entire site locally. -(Note that if you are editing a tutorial that takes a long time to run, this feedback may take a while.) +The easiest way to contribute to the documentation is to simply open a pull request. +A preview version of the documentation is built for pull requests, so you can see how your changes look without having to build the entire site locally. +(Note that if you are editing a tutorial that takes a long time to run, this feedback may take a while.) -The `main` branch contains the Quarto source code. -The HTML documentation is automatically built using GitHub Actions, and deployed to the `gh-pages` branch, so you do not have to build and commit the HTML files yourself. +The `main` branch contains the Quarto source code. +The HTML documentation is automatically built using GitHub Actions, and deployed to the `gh-pages` branch, so you do not have to build and commit the HTML files yourself. -## Local development +## Local development -If you wish to render the docs website locally, you'll need to have [Quarto](https://quarto.org/docs/download/) installed (at least version 1.6.31) on your computer. -Then: +If you wish to render the docs website locally, you'll need to have [Quarto](https://quarto.org/docs/download/) installed (at least version 1.6.31) on your computer. +Then: -1. Clone this repository: +1. Clone this repository: ```bash git clone https://github.com/TuringLang/docs - ``` + ``` -2. Navigate into the cloned directory: +2. Navigate into the cloned directory: ```bash cd docs - ``` + ``` -3. Instantiate the project environment: +3. Instantiate the project environment: ```bash julia --project=. -e 'using Pkg; Pkg.instantiate()' - ``` + ``` -4. Preview the website using Quarto. +4. Preview the website using Quarto. -> [!WARNING] -> This will take a _very_ long time, as it will build every tutorial from scratch. See [below](#faster-rendering) for ways to speed this up. +> [!WARNING] +> This will take a _very_ long time, as it will build every tutorial from scratch. See [below](#faster-rendering) for ways to speed this up. ```bash quarto preview - ``` + ``` - This will launch a local server at http://localhost:4200/, which you can view in your web browser by navigating to the link shown in your terminal. + This will launch a local server at http://localhost:4200/, which you can view in your web browser by navigating to the link shown in your terminal. -5. Render the website locally: +5. Render the website locally: ```bash quarto render - ``` + ``` - This will build the entire documentation and place the output in the `_site` folder. - You can then view the rendered website by launching a HTTP server from that directory, e.g. using Python: + This will build the entire documentation and place the output in the `_site` folder. + You can then view the rendered website by launching an HTTP server from that directory, e.g. using Python: ```bash cd _site python -m http.server 8000 - ``` + ``` - Then, navigate to http://localhost:8000/ in your web browser. + Then, navigate to http://localhost:8000/ in your web browser. -## Faster rendering +## Faster rendering -Note that rendering the entire documentation site can take a long time (usually multiple hours). -If you wish to speed up local rendering, there are two options available: +Note that rendering the entire documentation site can take a long time (usually multiple hours). +If you wish to speed up local rendering, there are two options available: -1. Render a single tutorial or `qmd` file without compiling the entire site. - To do this, pass the `qmd` file as an argument to `quarto render`: +1. Render a single tutorial or `qmd` file without compiling the entire site. + To do this, pass the `qmd` file as an argument to `quarto render`: ``` quarto render path/to/index.qmd - ``` - - (Note that `quarto preview` does not support this single-file rendering.) + ``` + + (Note that `quarto preview` does not support this single-file rendering.) -2. Download the most recent `_freeze` folder from the [GitHub releases of this repo](https://github.com/turinglang/docs/releases), and place it in the root of the project. - The `_freeze` folder stores the cached outputs from a previous build of the documentation. - If it is present, Quarto will reuse the outputs of previous computations for any files for which the source is unchanged. +2. Download the most recent `_freeze` folder from the [GitHub releases of this repo](https://github.com/turinglang/docs/releases), and place it in the root of the project. + The `_freeze` folder stores the cached outputs from a previous build of the documentation. + If it is present, Quarto will reuse the outputs of previous computations for any files for which the source is unchanged. - Note that the validity of a `_freeze` folder depends on the Julia environment that it was created with, because different package versions may lead to different outputs. - In the GitHub release, the `Manifest.toml` is also provided, and you should also download this and place it in the root directory of the docs. + Note that the validity of a `_freeze` folder depends on the Julia environment that it was created with, because different package versions may lead to different outputs. + In the GitHub release, the `Manifest.toml` is also provided, and you should also download this and place it in the root directory of the docs. - If there isn't a suitably up-to-date `_freeze` folder in the releases, you can generate a new one by [triggering a run for the `create_release.yml` workflow](https://github.com/TuringLang/docs/actions/workflows/create_release.yml). - (You will need to have the appropriate permissions; please create an issue if you need help with this.) + If there isn't a suitably up-to-date `_freeze` folder in the releases, you can generate a new one by [triggering a run for the `create_release.yml` workflow](https://github.com/TuringLang/docs/actions/workflows/create_release.yml) (You will need to have the appropriate permissions; please create an issue if you need help with this). -## Troubleshooting build issues +## Troubleshooting build issues -As described in the [Quarto docs](https://quarto.org/docs/computations/julia.html#using-the-julia-engine), Quarto's Julia engine uses a worker process behind the scenes. -Sometimes this can result in issues with old package code not being unloaded (e.g. when package versions are upgraded). -If you find that Quarto's execution is failing with errors that aren't reproducible via a normal REPL, try adding the `--execute-daemon-restart` flag to the `quarto render` command: +As described in the [Quarto docs](https://quarto.org/docs/computations/julia.html#using-the-julia-engine), Quarto's Julia engine uses a worker process behind the scenes. +Sometimes this can result in issues with old package code not being unloaded (e.g. when package versions are upgraded). +If you find that Quarto's execution is failing with errors that aren't reproducible via a normal REPL, try adding the `--execute-daemon-restart` flag to the `quarto render` command: ```bash quarto render /path/to/index.qmd --execute-daemon-restart -``` +``` -And also, kill any stray Quarto processes that are still running (sometimes it keeps running in the background): +And also, kill any stray Quarto processes that are still running (sometimes it keeps running in the background): ```bash pkill -9 -f quarto -``` +``` -## License +## License -This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. +This project is licensed under the MIT License - see the [License](License) file for details. \ No newline at end of file diff --git a/_quarto.yml b/_quarto.yml index 38c47bfd6..befb8ae14 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -208,7 +208,7 @@ format:
diff --git a/core-functionality/index.qmd b/core-functionality/index.qmd index 400418607..9d95b6269 100755 --- a/core-functionality/index.qmd +++ b/core-functionality/index.qmd @@ -266,7 +266,7 @@ using Turing # Add four processes to use for sampling. addprocs(4; exeflags="--project=$(Base.active_project())") -# Initialize everything on all the processes. +# Initialise everything on all the processes. # Note: Make sure to do this after you've already loaded Turing, # so each process does not have to precompile. # Parallel sampling may fail silently if you do not do this. @@ -329,7 +329,7 @@ Inputs to the model that have a value `missing` are treated as parameters, aka r ```{julia} @model function gdemo(x, ::Type{T}=Float64) where {T} if x === missing - # Initialize `x` if missing + # Initialise `x` if missing x = Vector{T}(undef, 2) end s² ~ InverseGamma(2, 3) @@ -344,7 +344,7 @@ model = gdemo(missing) c = sample(model, HMC(0.05, 20), 500) ``` -Note the need to initialize `x` when missing since we are iterating over its elements later in the model. +Note the need to initialise `x` when missing since we are iterating over its elements later in the model. The generated values for `x` can be extracted from the `Chains` object using `c[:x]`. Turing also supports mixed `missing` and non-`missing` values in `x`, where the missing ones will be treated as random variables to be sampled while the others get treated as observations. @@ -376,7 +376,7 @@ using Turing @model function generative(x=missing, ::Type{T}=Float64) where {T<:Real} if x === missing - # Initialize x when missing + # Initialise x when missing x = Vector{T}(undef, 10) end s² ~ InverseGamma(2, 3) @@ -597,10 +597,10 @@ logging is enabled as default but might slow down inference. It can be turned on or off by setting the keyword argument `progress` of `sample` to `true` or `false`. Moreover, you can enable or disable progress logging globally by calling `setprogress!(true)` or `setprogress!(false)`, respectively. -Turing uses heuristics to select an appropriate visualization backend. If you +Turing uses heuristics to select an appropriate visualisation backend. If you use Jupyter notebooks, the default backend is [ConsoleProgressMonitor.jl](https://github.com/tkf/ConsoleProgressMonitor.jl). In all other cases, progress logs are displayed with [TerminalLoggers.jl](https://github.com/c42f/TerminalLoggers.jl). Alternatively, -if you provide a custom visualization backend, Turing uses it instead of the +if you provide a custom visualisation backend, Turing uses it instead of the default backend. diff --git a/developers/compiler/design-overview/index.qmd b/developers/compiler/design-overview/index.qmd index 6b27454dc..8b3ab112b 100755 --- a/developers/compiler/design-overview/index.qmd +++ b/developers/compiler/design-overview/index.qmd @@ -12,14 +12,14 @@ using Pkg; Pkg.instantiate(); ``` -In this section, the current design of Turing's model "compiler" is described which enables Turing to perform various types of Bayesian inference without changing the model definition. The "compiler" is essentially just a macro that rewrites the user's model definition to a function that generates a `Model` struct that Julia's dispatch can operate on and that Julia's compiler can successfully do type inference on for efficient machine code generation. +In this section, the current design of Turing's model "compiler" is described, which enables Turing to perform various types of Bayesian inference without changing the model definition. The "compiler" is essentially just a macro that rewrites the user's model definition to a function that generates a `Model` struct that Julia's dispatch can operate on and that Julia's compiler can successfully do type inference on for efficient machine code generation. # Overview The following terminology will be used in this section: - `D`: observed data variables conditioned upon in the posterior, - - `P`: parameter variables distributed according to the prior distributions, these will also be referred to as random variables, + - `P`: parameter variables distributed according to the prior distributions, these are also referred to as random variables, - `Model`: a fully defined probabilistic model with input data `Turing`'s `@model` macro rewrites the user-provided function definition such that it can be used to instantiate a `Model` by passing in the observed data `D`. @@ -27,7 +27,7 @@ The following terminology will be used in this section: The following are the main jobs of the `@model` macro: 1. Parse `~` and `.~` lines, e.g. `y .~ Normal.(c*x, 1.0)` - 2. Figure out if a variable belongs to the data `D` and or to the parameters `P` + 2. Figure out if a variable belongs to the data `D` and/or to the parameters `P` 3. Enable the handling of missing data variables in `D` when defining a `Model` and treating them as parameter variables in `P` instead 4. Enable the tracking of random variables using the data structures `VarName` and `VarInfo` 5. Change `~`/`.~` lines with a variable in `P` on the LHS to a call to `tilde_assume` or `dot_tilde_assume` @@ -43,20 +43,15 @@ A `model::Model` is a callable struct that one can sample from by calling (model::Model)([rng, varinfo, sampler, context]) ``` -where `rng` is a random number generator (default: `Random.default_rng()`), `varinfo` is a data structure that stores information -about the random variables (default: `DynamicPPL.VarInfo()`), `sampler` is a sampling algorithm (default: `DynamicPPL.SampleFromPrior()`), -and `context` is a sampling context that can, e.g., modify how the log probability is accumulated (default: `DynamicPPL.DefaultContext()`). +where `rng` is a random number generator (default: `Random.default_rng()`), `varinfo` is a data structure that stores information about the random variables (default: `DynamicPPL.VarInfo()`), `sampler` is a sampling algorithm (default: `DynamicPPL.SampleFromPrior()`), and `context` is a sampling context that can, for example, modify how the log probability is accumulated (default: `DynamicPPL.DefaultContext()`). -Sampling resets the log joint probability of `varinfo` and increases the evaluation counter of `sampler`. If `context` is a `LikelihoodContext`, -only the log likelihood of `D` will be accumulated, whereas with `PriorContext` only the log prior probability of `P` is. With the `DefaultContext` the log joint probability of both `P` and `D` is accumulated. +Sampling resets the log joint probability of `varinfo` and increases the evaluation counter of `sampler`. If `context` is a `LikelihoodContext`, only the log likelihood of `D` will be accumulated, whereas with `PriorContext` only the log prior probability of `P` is. With the `DefaultContext` the log joint probability of both `P` and `D` is accumulated. The `Model` struct contains the four internal fields `f`, `args`, `defaults`, and `context`. When `model::Model` is called, then the internal function `model.f` is called as `model.f(rng, varinfo, sampler, context, model.args...)` (for multithreaded sampling, instead of `varinfo` a threadsafe wrapper is passed to `model.f`). -The positional and keyword arguments that were passed to the user-defined model function when the model was created are saved as a `NamedTuple` -in `model.args`. The default values of the positional and keyword arguments of the user-defined model functions, if any, are saved as a `NamedTuple` -in `model.defaults`. They are used for constructing model instances with different arguments by the `logprob` and `prob` string macros. -The `context` variable sets an evaluation context that can be used to control for instance whether log probabilities should be evaluated for the prior, likelihood, or joint probability. By default it is set to evaluate the log joint. +The positional and keyword arguments that were passed to the user-defined model function when the model was created are saved as a `NamedTuple` in `model.args`. The default values of the positional and keyword arguments of the user-defined model functions, if any, are saved as a `NamedTuple` in `model.defaults`. They are used for constructing model instances with different arguments by the `logprob` and `prob` string macros. +The `context` variable sets an evaluation context that can be used to control, for instance, whether log probabilities should be evaluated for the prior, likelihood, or joint probability. By default it is set to evaluate the log joint. # Example @@ -79,9 +74,7 @@ Let's take the following model as an example: end ``` -The above call of the `@model` macro defines the function `gauss` with positional arguments `x`, `y`, and `::Type{TV}`, rewritten in -such a way that every call of it returns a `model::Model`. Note that only the function body is modified by the `@model` macro, and the -function signature is left untouched. It is also possible to implement models with keyword arguments such as +The above call of the `@model` macro defines the function `gauss` with positional arguments `x`, `y`, and `::Type{TV}`, rewritten in such a way that every call of it returns a `model::Model`. Note that only the function body is modified by the `@model` macro, and the function signature is left untouched. It is also possible to implement models with keyword arguments such as ```{julia} #| eval: false @@ -94,8 +87,7 @@ end This would allow us to generate a model by calling `gauss(; x = rand(3))`. -If an argument has a default value `missing`, it is treated as a random variable. For variables which require an initialization because we -need to loop or broadcast over its elements, such as `x` above, the following needs to be done: +If an argument has a default value `missing`, it is treated as a random variable. For variables which require an initialisation because we need to loop or broadcast over its elements, such as `x` above, the following needs to be done: ```{julia} #| eval: false @@ -104,8 +96,7 @@ if x === missing end ``` -Note that since `gauss` behaves like a regular function it is possible to define additional dispatches in a second step as well. For -instance, we could achieve the same behaviour by +Note that since `gauss` behaves like a regular function it is possible to define additional dispatches in a second step as well. For instance, we could achieve the same behaviour by ```{julia} #| eval: false @@ -119,13 +110,11 @@ function gauss(::Missing, y=1.0, ::Type{TV}=Vector{Float64}) where {TV<:Abstract end ``` -If `x` is sampled as a whole from a distribution and not indexed, e.g., `x ~ Normal(...)` or `x ~ MvNormal(...)`, -there is no need to initialize it in an `if`-block. +If `x` is sampled as a whole from a distribution and not indexed, e.g., `x ~ Normal(...)` or `x ~ MvNormal(...)`, there is no need to initialise it in an `if`-block. ## Step 1: Break up the model definition -First, the `@model` macro breaks up the user-provided function definition using `DynamicPPL.build_model_info`. This function -returns a dictionary consisting of: +First, the `@model` macro breaks up the user-provided function definition using `DynamicPPL.build_model_info`. This function returns a dictionary consisting of: - `allargs_exprs`: The expressions of the positional and keyword arguments, without default values. - `allargs_syms`: The names of the positional and keyword arguments, e.g., `[:x, :y, :TV]` above. @@ -135,10 +124,7 @@ returns a dictionary consisting of: ## Step 2: Generate the body of the internal model function -In a second step, `DynamicPPL.generate_mainbody` generates the main part of the transformed function body using the user-provided function body -and the provided function arguments, without default values, for figuring out if a variable denotes an observation or a random variable. -Hereby the function `DynamicPPL.generate_tilde` replaces the `L ~ R` lines in the model and the function `DynamicPPL.generate_dot_tilde` replaces -the `@. L ~ R` and `L .~ R` lines in the model. +In a second step, `DynamicPPL.generate_mainbody` generates the main part of the transformed function body using the user-provided function body and the provided function arguments, without default values, for figuring out if a variable denotes an observation or a random variable. Hereby the function `DynamicPPL.generate_tilde` replaces the `L ~ R` lines in the model and the function `DynamicPPL.generate_dot_tilde` replaces the `@. L ~ R` and `L .~ R` lines in the model. In the above example, `p[1] ~ InverseGamma(2, 3)` is replaced with something similar to @@ -166,21 +152,15 @@ begin end ``` -Here the first line is a so-called line number node that enables more helpful error messages by providing users with the exact location -of the error in their model definition. Then the right hand side (RHS) of the `~` is assigned to a variable (with an automatically generated name). -We check that the RHS is a distribution or an array of distributions, otherwise an error is thrown. -Next we extract a compact representation of the variable with its name and index (or indices). Finally, the `~` expression is replaced with -a call to `DynamicPPL.tilde_assume` since the compiler figured out that `p[1]` is a random variable using the following -heuristic: +Here the first line is a so-called line number node that enables more helpful error messages by providing users with the exact location of the error in their model definition. Then the right hand side (RHS) of the `~` is assigned to a variable (with an automatically generated name). We check that the RHS is a distribution or an array of distributions, otherwise an error is thrown. +Next we extract a compact representation of the variable with its name and index (or indices). Finally, the `~` expression is replaced with a call to `DynamicPPL.tilde_assume` since the compiler figured out that `p[1]` is a random variable using the following heuristic: - 1. If the symbol on the LHS of `~`, `:p` in this case, is not among the arguments to the model, `(:x, :y, :T)` in this case, it is a random variable. - 2. If the symbol on the LHS of `~`, `:p` in this case, is among the arguments to the model but has a value of `missing`, it is a random variable. - 3. If the value of the LHS of `~`, `p[1]` in this case, is `missing`, then it is a random variable. - 4. Otherwise, it is treated as an observation. + 1. If the symbol on the LHS of `~`, `:p` in this case, is not among the arguments to the model, `(:x, :y, :T)` in this case, it is a random variable. + 2. If the symbol on the LHS of `~`, `:p` in this case, is among the arguments to the model but has a value of `missing`, it is a random variable. + 3. If the value of the LHS of `~`, `p[1]` in this case, is `missing`, then it is a random variable. + 4. Otherwise, it is treated as an observation. -The `DynamicPPL.tilde_assume` function takes care of sampling the random variable, if needed, and updating its value and the accumulated log joint -probability in the `_varinfo` object. If `L ~ R` is an observation, `DynamicPPL.tilde_observe` is called with the same arguments except the -random number generator `_rng` (since observations are never sampled). +The `DynamicPPL.tilde_assume` function takes care of sampling the random variable, if needed, and updating its value and the accumulated log joint probability in the `_varinfo` object. If `L ~ R` is an observation, `DynamicPPL.tilde_observe` is called with the same arguments except the random number generator `_rng` (since observations are never sampled). A similar transformation is performed for expressions of the form `@. L ~ R` and `L .~ R`. For instance, `@. x[1:2] ~ Normal(p[2], sqrt(p[1]))` is replaced with @@ -232,26 +212,22 @@ begin end ``` -The main difference in the expanded code between `L ~ R` and `@. L ~ R` is that the former doesn't assume `L` to be defined, it can be a new Julia variable in the scope, while the latter assumes `L` already exists. Moreover, `DynamicPPL.dot_tilde_assume` and `DynamicPPL.dot_tilde_observe` are called -instead of `DynamicPPL.tilde_assume` and `DynamicPPL.tilde_observe`. +The main difference in the expanded code between `L ~ R` and `@. L ~ R` is that the former doesn't assume `L` to be defined, it can be a new Julia variable in the scope, while the latter assumes `L` already exists. Moreover, `DynamicPPL.dot_tilde_assume` and `DynamicPPL.dot_tilde_observe` are called instead of `DynamicPPL.tilde_assume` and `DynamicPPL.tilde_observe`. ## Step 3: Replace the user-provided function body -Finally, we replace the user-provided function body using `DynamicPPL.build_output`. This function uses `MacroTools.combinedef` to reassemble -the user-provided function with a new function body. In the modified function body an anonymous function is created whose function body -was generated in step 2 above and whose arguments are +Finally, we replace the user-provided function body using `DynamicPPL.build_output`. This function uses `MacroTools.combinedef` to reassemble the user-provided function with a new function body. In the modified function body an anonymous function is created whose function body was generated in step 2 above and whose arguments are - a random number generator `_rng`, - a model `_model`, - a datastructure `_varinfo`, - a sampler `_sampler`, - a sampling context `_context`, - - and all positional and keyword arguments of the user-provided model function as positional arguments - without any default values. Finally, in the new function body a `model::Model` with this anonymous function as internal function is returned. + - and all positional and keyword arguments of the user-provided model function as positional arguments without any default values. Finally, in the new function body a `model::Model` with this anonymous function as internal function is returned. # `VarName` -In order to track random variables in the sampling process, `Turing` uses the `VarName` struct which acts as a random variable identifier generated at runtime. The `VarName` of a random variable is generated from the expression on the LHS of a `~` statement when the symbol on the LHS is in the set `P` of unobserved random variables. Every `VarName` instance has a type parameter `sym` which is the symbol of the Julia variable in the model that the random variable belongs to. For example, `x[1] ~ Normal()` will generate an instance of `VarName{:x}` assuming `x` is an unobserved random variable. Every `VarName` also has a field `indexing`, which stores the indices required to access the random variable from the Julia variable indicated by `sym` as a tuple of tuples. Each element of the tuple thereby contains the indices of one indexing operation (`VarName` also supports hierarchical arrays and range indexing). Some examples: +In order to track random variables in the sampling process, `Turing` uses the `VarName` struct which acts as a random variable identifier generated at runtime. The `VarName` of a random variable is generated from the expression on the LHS of a `~` statement when the symbol on the LHS is in the set `P` of unobserved random variables. Every `VarName` instance has a type parameter `sym` which is the symbol of the Julia variable in the model that the random variable belongs to. For example, `x[1] ~ Normal()` will generate an instance of `VarName{:x}` assuming `x` is an unobserved random variable. Every `VarName` also has a field `indexing`, which stores the indices required to access the random variable from the Julia variable indicated by `sym` as a tuple of tuples. Each element of the tuple thereby contains the indices of one indexing operation (`VarName` also supports hierarchical arrays and range indexing). Some examples: - `x ~ Normal()` will generate a `VarName(:x, ())`. - `x[1] ~ Normal()` will generate a `VarName(:x, ((1,),))`. @@ -281,25 +257,20 @@ Based on the type of `metadata`, the `VarInfo` is either aliased `UntypedVarInfo ## `Metadata` -The `Metadata` struct stores some metadata about the random variables sampled. This helps -query certain information about a variable such as: its distribution, which samplers -sample this variable, its value and whether this value is transformed to real space or -not. Let `md` be an instance of `Metadata`: +The `Metadata` struct stores some metadata about the random variables sampled. This helps query certain information about a variable such as: its distribution, which samplers sample this variable, its value and whether this value is transformed to real space or not. Let `md` be an instance of `Metadata`: - `md.vns` is the vector of all `VarName` instances. Let `vn` be an arbitrary element of `md.vns` - - `md.idcs` is the dictionary that maps each `VarName` instance to its index in - `md.vns`, `md.ranges`, `md.dists`, `md.orders` and `md.flags`. + - `md.idcs` is the dictionary that maps each `VarName` instance to its index in `md.vns`, `md.ranges`, `md.dists`, `md.orders` and `md.flags`. - `md.vns[md.idcs[vn]] == vn`. - `md.dists[md.idcs[vn]]` is the distribution of `vn`. - `md.gids[md.idcs[vn]]` is the set of algorithms used to sample `vn`. This was used by the Gibbs sampler. Since Turing v0.36 it is unused and will eventually be deleted. - `md.orders[md.idcs[vn]]` is the number of `observe` statements before `vn` is sampled. - `md.ranges[md.idcs[vn]]` is the index range of `vn` in `md.vals`. - `md.vals[md.ranges[md.idcs[vn]]]` is the linearized vector of values of corresponding to `vn`. - - `md.flags` is a dictionary of true/false flags. `md.flags[flag][md.idcs[vn]]` is the - value of `flag` corresponding to `vn`. + - `md.flags` is a dictionary of true/false flags. `md.flags[flag][md.idcs[vn]]` is the value of `flag` corresponding to `vn`. Note that in order to make `md::Metadata` type stable, all the `md.vns` must have the same symbol and distribution type. However, one can have a single Julia variable, e.g. `x`, that is a matrix or a hierarchical array sampled in partitions, e.g. `x[1][:] ~ MvNormal(zeros(2), I); x[2][:] ~ MvNormal(ones(2), I)`. The symbol `x` can still be managed by a single `md::Metadata` without hurting the type stability since all the distributions on the RHS of `~` are of the same type. However, in `Turing` models one cannot have this restriction, so we must use a type unstable `Metadata` if we want to use one `Metadata` instance for the whole model. This is what `UntypedVarInfo` does. A type unstable `Metadata` will still work but will have inferior performance. -To strike a balance between flexibility and performance when constructing the `spl::Sampler` instance, the model is first run by sampling the parameters in `P` from their priors using an `UntypedVarInfo`, i.e. a type unstable `Metadata` is used for all the variables. Then once all the symbols and distribution types have been identified, a `vi::TypedVarInfo` is constructed where `vi.metadata` is a `NamedTuple` mapping each symbol in `P` to a specialized instance of `Metadata`. So as long as each symbol in `P` is sampled from only one type of distributions, `vi::TypedVarInfo` will have fully concretely typed fields which brings out the peak performance of Julia. +To strike a balance between flexibility and performance when constructing the `spl::Sampler` instance, the model is first run by sampling the parameters in `P` from their priors using an `UntypedVarInfo`, i.e. a type unstable `Metadata` is used for all the variables. Then once all the symbols and distribution types have been identified, a `vi::TypedVarInfo` is constructed where `vi.metadata` is a `NamedTuple` mapping each symbol in `P` to a specialized instance of `Metadata`. So as long as each symbol in `P` is sampled from only one type of distributions, `vi::TypedVarInfo` will have fully concretely typed fields which brings out the peak performance of Julia. \ No newline at end of file diff --git a/developers/compiler/minituring-compiler/index.qmd b/developers/compiler/minituring-compiler/index.qmd index 0ac5ab9b5..2be489c4e 100755 --- a/developers/compiler/minituring-compiler/index.qmd +++ b/developers/compiler/minituring-compiler/index.qmd @@ -192,7 +192,7 @@ function assume(context::SamplingContext{<:MHSampler}, varinfo, dist, var_id) old_value = varinfo.values[var_id] # propose a random-walk step, i.e, add the current value to a random - # value sampled from a Normal distribution centered at 0 + # value sampled from a Normal distribution centred at 0 value = rand(context.rng, Normal(old_value, sampler.sigma)) logp = Distributions.logpdf(dist, value) varinfo[var_id] = (value, logp) @@ -206,7 +206,7 @@ In the first step we sample values from the prior distributions and in the follo The two functions are identified by the different arguments they take. ```{julia} -# The fist step: Sampling from the prior distributions +# The first step: Sampling from the prior distributions function AbstractMCMC.step( rng::Random.AbstractRNG, model::MiniModel, sampler::MHSampler; kwargs... ) @@ -222,7 +222,7 @@ function AbstractMCMC.step( model::MiniModel, sampler::MHSampler, prev_state::VarInfo; # is just the old trace - kwargs..., + kwargs... ) vi = prev_state new_vi = deepcopy(vi) @@ -242,7 +242,7 @@ function AbstractMCMC.step( end; ``` -To make it easier to analyze the samples and compare them with results from Turing, additionally we define a version of `AbstractMCMC.bundle_samples` for our model and sampler that returns a `MCMCChains.Chains` object of samples. +To make it easier to analyse the samples and compare them with results from Turing, additionally we define a version of `AbstractMCMC.bundle_samples` for our model and sampler that returns a `MCMCChains.Chains` object of samples. ```{julia} function AbstractMCMC.bundle_samples( @@ -252,7 +252,7 @@ function AbstractMCMC.bundle_samples( values = [sample.values for sample in samples] params = [key for key in keys(values[1]) if key ∉ keys(model.data)] vals = reduce(hcat, [value[p] for value in values] for p in params) - # Composing the `Chains` data-structure, of which analyzing infrastructure is provided + # Composing the `Chains` data-structure, of which analysing infrastructure is provided chains = Chains(vals, params) return chains end; @@ -303,4 +303,4 @@ end sample(turing_m(3.0), MH(ScalMat(2, 1.0)), 1_000_000, progress=false) ``` -As you can see, with our simple probabilistic programming language and custom samplers we get similar results as Turing. +As you can see, with our simple probabilistic programming language and custom samplers we get similar results as Turing. \ No newline at end of file diff --git a/developers/compiler/minituring-contexts/index.qmd b/developers/compiler/minituring-contexts/index.qmd index aa125b703..5439e719b 100755 --- a/developers/compiler/minituring-contexts/index.qmd +++ b/developers/compiler/minituring-contexts/index.qmd @@ -101,7 +101,7 @@ mini_m = MiniModel(m, (x=3.0,)) Previously in the mini Turing case, at this point we defined `SamplingContext`, a structure that holds a random number generator and a sampler, and gets passed to `observe` and `assume`. We then used it to implement a simple Metropolis-Hastings sampler. -The notion of a context may have seemed overly complicated just to implement the sampler, but there are other things we may want to do with a model than sample from the posterior. Having the context passing in place lets us do that without having to touch the above macro at all. For instance, let's say we want to evaluate the log joint probability of the model for a given set of data and parameters. Using a new context type we can use the previously defined `model` function, but change its behavior by changing what the `observe` and `assume` functions do. +The notion of a context may have seemed overly complicated just to implement the sampler, but there are other things we may want to do with a model than sample from the posterior. Having the context passing in place lets us do that without having to touch the above macro at all. For instance, let's say we want to evaluate the log joint probability of the model for a given set of data and parameters. Using a new context type we can use the previously defined `model` function, but change its behaviour by changing what the `observe` and `assume` functions do. @@ -223,8 +223,8 @@ function assume(context::SamplingContext{<:MHSampler}, varinfo, dist, var_id) sampler = context.sampler old_value = varinfo.values[var_id] - # propose a random-walk step, i.e, add the current value to a random - # value sampled from a Normal distribution centered at 0 + # propose a random-walk step, i.e, add the current value to a random + # value sampled from a Normal distribution centred at 0 value = rand(context.rng, Normal(old_value, sampler.sigma)) varinfo[var_id] = (value, NaN) # Once the value has been sampled, let the subcontext handle evaluating the log @@ -274,7 +274,7 @@ function AbstractMCMC.bundle_samples( values = [sample.values for sample in samples] params = [key for key in keys(values[1]) if key ∉ keys(model.data)] vals = reduce(hcat, [value[p] for value in values] for p in params) - # Composing the `Chains` data-structure, of which analyzing infrastructure is provided + # Composing the `Chains` data-structure, of which analysing infrastructure is provided chains = Chains(vals, params) return chains end; @@ -307,4 +307,4 @@ The use of contexts also goes far beyond just evaluating log probabilities and s All of the above are what Turing calls _parent contexts_, which is to say that they all keep a subcontext just like our above `SamplingContext` did. Their implementations of `assume` and `observe` call the implementation of the subcontext once they are done doing their own work of fixing/conditioning/prefixing/etc. Contexts are often chained, so that e.g. a `DebugContext` may wrap within it a `PrefixContext`, which may in turn wrap a `ConditionContext`, etc. The only contexts that _don't_ have a subcontext in the Turing are the ones for evaluating the prior, likelihood, and joint distributions. These are called _leaf contexts_. -The above version of mini Turing is still much simpler than the full Turing language, but the principles of how contexts are used are the same. +The above version of mini Turing is still much simpler than the full Turing language, but the principles of how contexts are used are the same. \ No newline at end of file diff --git a/developers/compiler/model-manual/index.qmd b/developers/compiler/model-manual/index.qmd index a3e404122..a21916b11 100755 --- a/developers/compiler/model-manual/index.qmd +++ b/developers/compiler/model-manual/index.qmd @@ -26,8 +26,7 @@ model = gdemo([1.5, 2.0]) The `@model` macro accepts a function definition and rewrites it such that call of the function generates a `Model` struct for use by the sampler. -However, models can be constructed by hand without the use of a macro. -Taking the `gdemo` model above as an example, the macro-based definition can be implemented also (a bit less generally) with the macro-free version +However, models can be constructed by hand without the use of a macro. Taking the `gdemo` model above as an example, the macro-based definition can be implemented as well (a bit less generally) with the macro-free version ```{julia} using DynamicPPL @@ -67,4 +66,4 @@ We can sample from this model in the same way: chain = sample(model2, NUTS(), 1000; progress=false) ``` -The subsequent pages in this section will show how the `@model` macro does this behind-the-scenes. +The subsequent pages in this section will show how the `@model` macro does this behind-the-scenes. \ No newline at end of file diff --git a/developers/contributing/index.qmd b/developers/contributing/index.qmd index 2b07adc26..df006a9bb 100755 --- a/developers/contributing/index.qmd +++ b/developers/contributing/index.qmd @@ -5,7 +5,7 @@ aliases: --- Turing is an open-source project and is [hosted on GitHub](https://github.com/TuringLang). -We welcome contributions from the community in all forms large or small: bug reports, feature implementations, code contributions, or improvements to documentation or infrastructure are all extremely valuable. +We welcome contributions from the community in all forms, large or small: bug reports, feature implementations, code contributions, or improvements to documentation or infrastructure are all extremely valuable. We would also very much appreciate examples of models written using Turing. ### How to get involved @@ -186,9 +186,8 @@ These conventions were created from a variety of sources including Python's [PEP #### A Word on Consistency -When adhering to the Blue style, it's important to realize that these are guidelines, not rules. This is [stated best in the PEP8](http://legacy.python.org/dev/peps/pep-0008/#a-foolish-consistency-is-the-hobgoblin-of-little-minds): +When adhering to the Blue style, it's important to realise that these are guidelines, not rules. This is [stated best in the PEP8](http://legacy.python.org/dev/peps/pep-0008/#a-foolish-consistency-is-the-hobgoblin-of-little-minds): > A style guide is about consistency. Consistency with this style guide is important. Consistency within a project is more important. Consistency within one module or function is most important. -> But most importantly: know when to be inconsistent – sometimes the style guide just doesn't apply. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don't hesitate to ask! - +> But most importantly: know when to be inconsistent – sometimes the style guide just doesn't apply. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don't hesitate to ask! \ No newline at end of file diff --git a/developers/inference/abstractmcmc-interface/index.qmd b/developers/inference/abstractmcmc-interface/index.qmd index 993936209..82682808b 100755 --- a/developers/inference/abstractmcmc-interface/index.qmd +++ b/developers/inference/abstractmcmc-interface/index.qmd @@ -59,7 +59,7 @@ using Random An interface extension (like the one we're writing right now) typically requires that you overload or implement several functions. Specifically, you should `import` the functions you intend to overload. This next code block accomplishes that. -From `Distributions`, we need `Sampleable`, `VariateForm`, and `ValueSupport`, three abstract types that define a distribution. Models in the interface are assumed to be subtypes of `Sampleable{VariateForm, ValueSupport}`. In this section our model is going be be extremely simple, so we will not end up using these except to make sure that the inference functions are dispatching correctly. +From `Distributions`, we need `Sampleable`, `VariateForm`, and `ValueSupport`, three abstract types that define a distribution. Models in the interface are assumed to be subtypes of `Sampleable{VariateForm, ValueSupport}`. In this section our model is going to be extremely simple, so we will not end up using these except to make sure that the inference functions are dispatching correctly. ### Sampler @@ -79,7 +79,7 @@ function MetropolisHastings(init_θ::Vector{<:Real}) end ``` -Above, we have defined a sampler that stores the initial parameterization of the prior, and a distribution object from which proposals are drawn. You can have a struct that has no fields, and simply use it for dispatching onto the relevant functions, or you can store a large amount of state information in your sampler. +Above, we have defined a sampler that stores the initial parameterisation of the prior, and a distribution object from which proposals are drawn. You can have a struct that has no fields, and simply use it for dispatching onto the relevant functions, or you can store a large amount of state information in your sampler. The general intuition for what to store in your sampler struct is that anything you may need to perform inference between samples but you don't want to store in a transition should go into the sampler struct. It's the only way you can carry non-sample related state information between `step!` calls. @@ -124,7 +124,7 @@ As a refresher, Metropolis-Hastings implements a very basic algorithm: 2. For ``t`` in ``[1,N],`` do - + Generate a proposal parameterization ``\theta^\prime_t \sim q(\theta^\prime_t \mid \theta_{t-1}).`` + + Generate a proposal parameterisation ``\theta^\prime_t \sim q(\theta^\prime_t \mid \theta_{t-1}).`` + Calculate the acceptance probability, ``\alpha = \text{min}\left[1,\frac{\pi(\theta'_t)}{\pi(\theta_{t-1})} \frac{q(\theta_{t-1} \mid \theta'_t)}{q(\theta'_t \mid \theta_{t-1})}) \right].`` @@ -163,19 +163,19 @@ function AbstractMCMC.step!( end ``` -The first `step!` function just packages up the initial parameterization inside the sampler, and returns it. We implicitly accept the very first parameterization. +The first `step!` function just packages up the initial parameterisation inside the sampler, and returns it. We implicitly accept the very first parameterisation. The other `step!` function performs the usual steps from Metropolis-Hastings. Included are several helper functions, `proposal` and `q`, which are designed to replicate the functions in the pseudocode above. - `proposal` generates a new proposal in the form of a `Transition`, which can be univariate if the value passed in is univariate, or it can be multivariate if the `Transition` given is multivariate. Proposals use a basic `Normal` or `MvNormal` proposal distribution. -- `q` returns the log density of one parameterization conditional on another, according to the proposal distribution. +- `q` returns the log density of one parameterisation conditional on another, according to the proposal distribution. - `step!` generates a new proposal, checks the acceptance probability, and then returns either the previous transition or the proposed transition. ```{julia} #| eval: false # Define a function that makes a basic proposal depending on a univariate -# parameterization or a multivariate parameterization. +# parameterisation or a multivariate parameterisation. function propose(spl::MetropolisHastings, model::DensityModel, θ::Real) return Transition(model, θ + rand(spl.proposal)) end @@ -193,7 +193,7 @@ function q(spl::MetropolisHastings, θ::Vector{<:Real}, θcond::Vector{<:Real}) end q(spl::MetropolisHastings, t1::Transition, t2::Transition) = q(spl, t1.θ, t2.θ) -# Calculate the density of the model given some parameterization. +# Calculate the density of the model given some parameterisation. ℓπ(model::DensityModel, θ) = model.ℓπ(θ) ℓπ(model::DensityModel, t::Transition) = t.lp @@ -320,4 +320,4 @@ It looks like we're extremely close to our true parameters of `Normal(5,3)`, tho ## Conclusion -We've seen how to implement the sampling interface for general projects. Turing's interface methods are ever-evolving, so please open an issue at [AbstractMCMC](https://github.com/TuringLang/AbstractMCMC.jl) with feature requests or problems. +We've seen how to implement the sampling interface for general projects. Turing's interface methods are ever-evolving, so please open an issue at [AbstractMCMC](https://github.com/TuringLang/AbstractMCMC.jl) with feature requests or problems. \ No newline at end of file diff --git a/developers/inference/abstractmcmc-turing/index.qmd b/developers/inference/abstractmcmc-turing/index.qmd index 42c1fd308..642b623b0 100755 --- a/developers/inference/abstractmcmc-turing/index.qmd +++ b/developers/inference/abstractmcmc-turing/index.qmd @@ -35,9 +35,9 @@ n_samples = 1000 chn = sample(mod, alg, n_samples, progress=false) ``` -The function `sample` is part of the AbstractMCMC interface. As explained in the interface guide, building a sampling method that can be used by `sample` consists in overloading the structs and functions in `AbstractMCMC`. The interface guide also gives a standalone example of their implementation, [`AdvancedMH.jl`](). +The function `sample` is part of the AbstractMCMC interface. As explained in the interface guide, building a sampling method that can be used by `sample` consists of overloading the structs and functions in `AbstractMCMC`. The interface guide also gives a standalone example of their implementation, [`AdvancedMH.jl`](). -Turing sampling methods (most of which are written [here](https://github.com/TuringLang/Turing.jl/tree/main/src/mcmc)) also implement `AbstractMCMC`. Turing defines a particular architecture for `AbstractMCMC` implementations, that enables working with models defined by the `@model` macro, and uses DynamicPPL as a backend. The goal of this page is to describe this architecture, and how you would go about implementing your own sampling method in Turing, using Importance Sampling as an example. I don't go into all the details: for instance, I don't address selectors or parallelism. +Turing sampling methods (most of which are written [here](https://github.com/TuringLang/Turing.jl/tree/main/src/mcmc)) also implement `AbstractMCMC`. Turing defines a particular architecture for `AbstractMCMC` implementations, which enables working with models defined by the `@model` macro, and uses DynamicPPL as a backend. The goal of this page is to describe this architecture, and how you would go about implementing your own sampling method in Turing, using Importance Sampling as an example. I don't go into all the details: for instance, I don't address selectors or parallelism. First, we explain how Importance Sampling works in the abstract. Consider the model defined in the first code block. Mathematically, it can be written: @@ -124,7 +124,7 @@ When you call `sample(mod, alg, n_samples)`, Turing first uses `model` and `alg` When you define your own Turing sampling method, you must therefore build: -- a **sampler constructor** that uses a model and an algorithm to initialize an instance of `Sampler`. For Importance Sampling: +- a **sampler constructor** that uses a model and an algorithm to initialise an instance of `Sampler`. For Importance Sampling: ```{julia} #| eval: false @@ -158,7 +158,7 @@ $$ of the sample weights is a particularly important quantity: -- it is used to **normalize** the **empirical approximation** of the posterior distribution +- it is used to **normalise** the **empirical approximation** of the posterior distribution - its logarithm is the importance sampling **estimate** of the **log evidence** $\log p(x, y)$ To avoid having to compute it over and over again, `is.jl`defines an IS-specific concrete type `ISState` for sampler states, with an additional field `final_logevidence` containing @@ -326,4 +326,4 @@ We focus on the AbstractMCMC functions that are overridden in `is.jl` and execut - When the `n_samples` iterations are completed, `sample_end!` fills the `final_logevidence` field of `spl.state` - + It simply takes the logarithm of the average of the sample weights, using the log weights for numerical stability + + It simply takes the logarithm of the average of the sample weights, using the log weights for numerical stability \ No newline at end of file diff --git a/developers/inference/implementing-samplers/index.qmd b/developers/inference/implementing-samplers/index.qmd index 1d8aa3d6c..dd1ad604e 100644 --- a/developers/inference/implementing-samplers/index.qmd +++ b/developers/inference/implementing-samplers/index.qmd @@ -23,13 +23,13 @@ Note that we will implement this sampler in the [AbstractMCMC.jl](https://github ## Quick overview of MALA -We can view MALA as a single step of the leapfrog intergrator with resampling of momentum $p$ at every step.[^2] To make that statement a bit more concrete, we first define the *extended* target $\bar{\gamma}(x, p)$ as +We can view MALA as a single step of the leapfrog integrator with resampling of momentum $p$ at every step.[^2] To make that statement a bit more concrete, we first define the *extended* target $\bar{\gamma}(x, p)$ as \begin{equation*} \log \bar{\gamma}(x, p) \propto \log \gamma(x) + \log \gamma_{\mathcal{N}(0, M)}(p) \end{equation*} -where $\gamma_{\mathcal{N}(0, M)}$ denotes the density for a zero-centered Gaussian with covariance matrix $M$. +where $\gamma_{\mathcal{N}(0, M)}$ denotes the density for a zero-centred Gaussian with covariance matrix $M$. We then consider targeting this joint distribution over both $x$ and $p$ as follows. First we define the map @@ -100,7 +100,7 @@ end This gives us all of the properties we want for our MALA sampler with the exception of the computation of the *gradient* $\nabla \log \gamma(x)$. There is the method `LogDensityProblems.logdensity_and_gradient` which should return a 2-tuple where the first entry is the evaluation of the logdensity $\log \gamma(x)$ and the second entry is the gradient $\nabla \log \gamma(x)$. -There are two ways to "implement" this method: 1) we implement it by hand, which is feasible in the case of our `IsotropicNormalModel`, or b) we defer the implementation of this to a automatic differentiation backend. +There are two ways to "implement" this method: 1) we implement it by hand, which is feasible in the case of our `IsotropicNormalModel`, or b) we defer the implementation of this to an automatic differentiation backend. To implement it by hand we can simply do @@ -141,9 +141,8 @@ LogDensityProblems.logdensity(model_with_grad, x_example) We'll continue with the second approach in this tutorial since this is typically what one does in practice, because there are better hobbies to spend time on than deriving gradients by hand. -At this point, one might wonder how we're going to tie this back to Turing.jl in the end. Effectively, when working with inference methods that only require log-density evaluations and / or higher-order information of the log-density, Turing.jl actually converts the user-provided `Model` into an object implementing the above methods for [LogDensityProblems.jl](https://github.com/tpapp/LogDensityProblems.jl). As a result, most samplers provided by Turing.jl are actually implemented to work with [LogDensityProblems.jl](https://github.com/tpapp/LogDensityProblems.jl), enabling their use both *within* Turing.jl and *outside* of Turing.jl! Morever, there exists similar conversions for Stan through BridgeStan and Stan[LogDensityProblems.jl](https://github.com/tpapp/LogDensityProblems.jl), which means that a sampler supporting the [LogDensityProblems.jl](https://github.com/tpapp/LogDensityProblems.jl) interface can easily be used on both Turing.jl *and* Stan models (in addition to user-provided models, as our `IsotropicNormalModel` above)! +At this point, one might wonder how we're going to tie this back to Turing.jl in the end. Effectively, when working with inference methods that only require log-density evaluations and / or higher-order information of the log-density, Turing.jl actually converts the user-provided `Model` into an object implementing the above methods for [LogDensityProblems.jl](https://github.com/tpapp/LogDensityProblems.jl). As a result, most samplers provided by Turing.jl are actually implemented to work with [LogDensityProblems.jl](https://github.com/tpapp/LogDensityProblems.jl), enabling their use both *within* Turing.jl and *outside* of Turing.jl! Moreover, there exists similar conversions for Stan through BridgeStan and Stan[LogDensityProblems.jl](https://github.com/tpapp/LogDensityProblems.jl), which means that a sampler supporting the [LogDensityProblems.jl](https://github.com/tpapp/LogDensityProblems.jl) interface can easily be used on both Turing.jl *and* Stan models (in addition to user-provided models, as our `IsotropicNormalModel` above)! -Anyways, let's move on to actually implementing the sampler. ## Implementing MALA in [AbstractMCMC.jl](https://github.com/TuringLang/AbstractMCMC.jl) @@ -343,7 +342,7 @@ samples_matrix = stack(sample -> sample.x, samples) hcat(mean(samples_matrix; dims=2), std(samples_matrix; dims=2)) ``` -Let's visualize the samples +Let's visualise the samples ```{julia} using StatsPlots @@ -459,7 +458,7 @@ end sample(demo(), externalsampler(sampler; unconstrained=false), 10_000; progress=false) ``` -As expected, we run into a `DomainError` at some point, while if we set `unconstrained=true`, letting Turing.jl transform the model to a unconstrained form behind the scenes, everything works as expected: +As expected, we run into a `DomainError` at some point, while if we set `unconstrained=true`, letting Turing.jl transform the model to an unconstrained form behind the scenes, everything works as expected: ```{julia} sample(demo(), externalsampler(sampler; unconstrained=true), 10_000; progress=false) diff --git a/developers/inference/variational-inference/index.qmd b/developers/inference/variational-inference/index.qmd index 0965b07c7..19185b950 100755 --- a/developers/inference/variational-inference/index.qmd +++ b/developers/inference/variational-inference/index.qmd @@ -78,7 +78,7 @@ $$ Otherwise, there might be a point $z_0 \sim q(z)$ such that $p(z_0 \mid \\{ x_i \\}_{i = 1}^n) = 0$, resulting in $\log\left(\frac{q(z)}{0}\right)$ which doesn't make sense! -One major problem: as we can see in the definition of the KL-divergence, we need $p(z \mid \\{ x_i \\}_{i = 1}^n)$ for any $z$ if we want to compute the KL-divergence between this and $q(z)$. We don't have that. The entire reason we even do Bayesian inference is that we don't know the posterior! Cleary this isn't going to work. _Or is it?!_ +One major problem: as we can see in the definition of the KL-divergence, we need $p(z \mid \\{ x_i \\}_{i = 1}^n)$ for any $z$ if we want to compute the KL-divergence between this and $q(z)$. We don't have that. The entire reason we even do Bayesian inference is that we don't know the posterior! Clearly this isn't going to work. _Or is it?!_ ## Computing KL-divergence without knowing the posterior @@ -151,7 +151,7 @@ $$ Hence, as long as we can sample from $q(z)$ somewhat efficiently, we can indeed minimize the KL-divergence! Neat, eh? -Sidenote: in the case where $q(z)$ is tractable but $\mathbb{H} \left(q(z) \right)$ is _not_ , we can use an Monte-Carlo estimate for this term too but this generally results in a higher-variance estimate. +Sidenote: in the case where $q(z)$ is tractable but $\mathbb{H} \left(q(z) \right)$ is _not_ , we can use a Monte-Carlo estimate for this term too but this generally results in a higher-variance estimate. Also, I fooled you real good: the ELBO _isn't_ an arbitrary name, hah! In fact it's an abbreviation for the **expected lower bound (ELBO)** because it, uhmm, well, it's the _expected_ lower bound (remember $\mathrm{D_{KL}} \ge 0$). Yup. @@ -174,7 +174,7 @@ $$ for some function $g\_{\theta}$ differentiable wrt. $\theta$. So all $q_{\theta} \in \mathscr{Q}\_{\Theta}$ are using the *same* reparameterization-function $g$ but each $q\_{\theta}$ correspond to different choices of $\theta$ for $f\_{\theta}$. -Under this assumption we can differentiate the sampling process by taking the derivative of $g\_{\theta}$ wrt. $\theta$, and thus we can differentiate the entire $\widehat{\mathrm{ELBO}}(q\_{\theta})$ wrt. $\theta$! With the gradient available we can either try to solve for optimality either by setting the gradient equal to zero or maximize $\widehat{\mathrm{ELBO}}(q\_{\theta})$ stepwise by traversing $\mathscr{Q}\_{\Theta}$ in the direction of steepest ascent. For the sake of generality, we're going to go with the stepwise approach. +Under this assumption we can differentiate the sampling process by taking the derivative of $g\_{\theta}$ wrt. $\theta$, and thus we can differentiate the entire $\widehat{\mathrm{ELBO}}(q\_{\theta})$ wrt. $\theta$! With the gradient available we can either try to solve for optimality either by setting the gradient equal to zero or maximise $\widehat{\mathrm{ELBO}}(q\_{\theta})$ stepwise by traversing $\mathscr{Q}\_{\Theta}$ in the direction of steepest ascent. For the sake of generality, we're going to go with the stepwise approach. With all this nailed down, we eventually reach the section on **Automatic Differentiation Variational Inference (ADVI)**. @@ -186,7 +186,7 @@ So let's revisit the assumptions we've made at this point: 2. $\mathscr{Q}\_{\Theta}$ is a space of _reparameterizable_ densities with $\bar{q}(z)$ as the base-density. - 3. The parameterization function $g\_{\theta}$ is differentiable wrt. $\theta$. + 3. The parameterisation function $g\_{\theta}$ is differentiable wrt. $\theta$. 4. Evaluation of the probability density $q\_{\theta}(z)$ is differentiable wrt. $\theta$. @@ -335,7 +335,7 @@ $$ #### Back to VI -So why is this is useful? Well, we're looking to generalize our approach using a normal distribution to cases where the supports don't match up. How about defining $q(z)$ by +So why is this is useful? Well, we're looking to generalise our approach using a normal distribution to cases where the supports don't match up. How about defining $q(z)$ by ::: {.column-page} $$ diff --git a/developers/transforms/bijectors/index.qmd b/developers/transforms/bijectors/index.qmd index e8c8bb840..761bc2927 100644 --- a/developers/transforms/bijectors/index.qmd +++ b/developers/transforms/bijectors/index.qmd @@ -10,7 +10,7 @@ using Pkg; Pkg.instantiate(); ``` -All the above has purely been a mathematical discussion of how distributions can be transformed. +All the above has been purely a mathematical discussion of how distributions can be transformed. Now, we turn to their implementation in Julia, specifically using the [Bijectors.jl package](https://github.com/TuringLang/Bijectors.jl). ## Bijectors.jl @@ -260,7 +260,7 @@ println("went out of bounds $n_oob_transformed/10000 times") In the subsections above, we've seen two different methods of sampling from a constrained distribution: 1. Sample directly from the distribution and reject any samples outside of its support. -2. Transform the distribution to an unconstrained one and sample from that instead. +2. Transform the distribution to an unconstrained one and sample from that instead. (Note that both of these methods are applicable to other samplers as well, such as Hamiltonian Monte Carlo.) @@ -337,4 +337,4 @@ println("mean: $(mean(samples_questionable_untransformed))") You can see that even though we used ten times more samples, the mean is quite wrong, which implies that our samples are not being drawn from the correct distribution. -In the next page, we'll see how to use these transformations in the context of a probabilistic programming language, paying particular attention to their handling in DynamicPPL. +In the next page, we'll see how to use these transformations in the context of a probabilistic programming language, paying particular attention to their handling in DynamicPPL. \ No newline at end of file diff --git a/developers/transforms/dynamicppl/index.qmd b/developers/transforms/dynamicppl/index.qmd index a02dde7af..3de1e09e1 100644 --- a/developers/transforms/dynamicppl/index.qmd +++ b/developers/transforms/dynamicppl/index.qmd @@ -10,7 +10,7 @@ using Pkg; Pkg.instantiate(); ``` -In the final part of this chapter, we'll discuss the higher-level implications of constrained distributions in the Turing.jl framework. +In the final part of this chapter, we will discuss the higher-level implications of constrained distributions in the Turing.jl framework. ## Linked and unlinked VarInfos in DynamicPPL @@ -22,7 +22,7 @@ Random.seed!(468); using Turing ``` -When we are performing Bayesian inference, we're trying to sample from a joint probability distribution, which isn't usually a single, well-defined distribution like in the rather simplified example above. +When we are performing Bayesian inference, we are trying to sample from a joint probability distribution, which isn't usually a single, well-defined distribution like in the rather simplified example above. However, each random variable in the model will have its own distribution, and often some of these will be constrained. For example, if `b ~ LogNormal()` is a random variable in a model, then $p(b)$ will be zero for any $b \leq 0$. Consequently, any joint probability $p(b, c, \ldots)$ will also be zero for any combination of parameters where $b \leq 0$, and so that joint distribution is itself constrained. @@ -40,7 +40,7 @@ end model = demo() vi = VarInfo(model) vn_x = @varname(x) -# Retrieve the 'internal' representation of x – we'll explain this later +# Retrieve the 'internal' representation of x – we'll explain this later DynamicPPL.getindex_internal(vi, vn_x) ``` @@ -59,7 +59,7 @@ logpdf(LogNormal(), DynamicPPL.getindex_internal(vi, vn_x)) ``` In DynamicPPL, the `link` function can be used to transform the variables. -This function does three things: firstly, it transforms the variables; secondly, it updates the value of logp (by adding the Jacobian term); and thirdly, it sets a flag on the variables to indicate that it has been transformed. +This function does three things: first, it transforms the variables; secondly, it updates the value of logp (by adding the Jacobian term); and thirdly, it sets a flag on the variables to indicate that it has been transformed. Note that this acts on _all_ variables in the model, including unconstrained ones. (Unconstrained variables just have an identity transformation.) @@ -123,8 +123,8 @@ This sequence of events is summed up in the following diagram, where `f(..., arg  -In the final part of this article, we'll take a more in-depth look at the internal DynamicPPL machinery that allows us to convert between representations and obtain the correct probability densities. -Before that, though, we'll take a quick high-level look at how the HMC sampler in Turing.jl uses the functions introduced so far. +In the final part of this article, we will take a more in-depth look at the internal DynamicPPL machinery that allows us to convert between representations and obtain the correct probability densities. +Before that, though, we will take a quick high-level look at how the HMC sampler in Turing.jl uses the functions introduced so far. ## Case study: HMC in Turing.jl @@ -249,7 +249,7 @@ graph TD D(["if istrans(vi, vn)"]):::boxStyle E["f = from_internal_transform(vi, vn, dist)"]:::boxStyle F["f = from_linked_internal_transform(vi, vn, dist)"]:::boxStyle - G["x, logjac = with_logabsdet_jacobian(f, getindex_internal(vi, vn, dist))"]:::boxStyle + G["x, logjac = with_logabsdet_jacobian(f, getindex_internal(vi, vn, dist))"]:::boxStyle H["return x, logpdf(dist, x) - logjac, vi"]:::boxStyle A -.->|@model| B @@ -415,4 +415,4 @@ In this chapter of the Turing docs, we've looked at: - the higher-level usage of transforms in DynamicPPL and Turing. This will hopefully have equipped you with a better understanding of how constrained variables are handled in the Turing framework. -With this knowledge, you should especially find it easier to navigate DynamicPPL's `VarInfo` type, which forms the backbone of model evaluation. +With this knowledge, you should especially find it easier to navigate DynamicPPL's `VarInfo` type, which forms the backbone of model evaluation. \ No newline at end of file diff --git a/faq/index.qmd b/faq/index.qmd index 8519afae9..bac8f2e06 100644 --- a/faq/index.qmd +++ b/faq/index.qmd @@ -76,7 +76,7 @@ end - **Assume statements** (sampling statements): Often crash unpredictably or produce incorrect results - **AD backend compatibility**: Many AD backends don't support threading. Check the [multithreaded column in ADTests](https://turinglang.org/ADTests/) for compatibility -For safe parallelism within models, consider vectorized operations instead of explicit threading. +For safe parallelism within models, consider vectorised operations instead of explicit threading. ## How do I check the type stability of my Turing model? @@ -140,7 +140,7 @@ The choice of AD backend can significantly impact performance. See: Small changes can have big performance impacts. Common culprits include: - Type instability introduced by the change -- Switching from vectorized to scalar operations (or vice versa) +- Switching from vectorised to scalar operations (or vice versa) - Inadvertently causing AD backend incompatibilities - Breaking assumptions that allowed compiler optimizations diff --git a/tutorials/bayesian-differential-equations/index.qmd b/tutorials/bayesian-differential-equations/index.qmd index 8479215e6..9c911281c 100755 --- a/tutorials/bayesian-differential-equations/index.qmd +++ b/tutorials/bayesian-differential-equations/index.qmd @@ -14,8 +14,8 @@ Pkg.instantiate(); A basic scientific problem is to mathematically model a system of interest, then compare this model to the observable reality around us. Such models often involve dynamical systems of differential equations. -In practice, these equations often have unkown parameters we would like to estimate. -The “forward problem” of simulation consists of solving the differential equations for a given set of parameters, while the “inverse problem," also known as parameter estimation, is the process of utilizing data to determine these model parameters. +In practice, these equations often have unknown parameters we would like to estimate. +The "[forward problem](https://en.wikipedia.org/wiki/Well-posed_problem)" of simulation consists of solving the differential equations for a given set of parameters, while the "[inverse problem](https://en.wikipedia.org/wiki/Inverse_problem)" of parameter estimation uses observed data to infer the unknown model parameters. Bayesian inference provides a robust approach to parameter estimation with quantified uncertainty. ```{julia} @@ -78,9 +78,9 @@ plot(solve(prob, Tsit5())) We generate noisy observations to use for the parameter estimation tasks in this tutorial. With the [`saveat` argument](https://docs.sciml.ai/DiffEqDocs/stable/basics/common_solver_opts/) to the differential equation solver, we specify that the solution is stored only at `0.1` time units. -To make the example more realistic, we generate data as random Poisson counts based on the "true" population densities of predator and prey from the simulation. +To make the example more realistic, we generate data as random [Poisson](https://en.wikipedia.org/wiki/Poisson_distribution) counts based on the "true" population densities of predator and prey from the simulation. Poisson-distributed data are common in ecology (for instance, counts of animals detected by a camera trap). -We'll assume that the rate $\lambda$, which parameterizes the Poisson distribution, is proportional to the underlying animal densities via a constant factor $q = 1.7$. +We assume the Poisson rate parameter $\lambda$ is proportional to the underlying animal densities, with proportionality constant $q = 1.7$ (representing observation efficiency). ```{julia} @@ -309,7 +309,7 @@ plot(chain_dde) ``` Finally, we plot trajectories of 300 randomly selected samples from the posterior. -Again, the dots indicate our observations, the colored lines are the "true" simulations without noise, and the gray lines are trajectories from the posterior samples. +Again, the dots indicate our observations, the coloured lines are the "true" simulations without noise, and the gray lines are trajectories from the posterior samples. ```{julia} plot(; legend=false) @@ -329,12 +329,12 @@ The fit is pretty good even though the data was quite noisy to start. ## Scaling to Large Models: Adjoint Sensitivities DifferentialEquations.jl's efficiency for large stiff models has been shown in [multiple benchmarks](https://github.com/SciML/DiffEqBenchmarks.jl). -To learn more about how to optimize solving performance for stiff problems you can take a look at the [docs](https://docs.sciml.ai/DiffEqDocs/stable/tutorials/advanced_ode_example/). +To learn more about how to optimise solving performance for stiff problems you can take a look at the [docs](https://docs.sciml.ai/DiffEqDocs/stable/tutorials/advanced_ode_example/). _Sensitivity analysis_ is provided by the [SciMLSensitivity.jl package](https://docs.sciml.ai/SciMLSensitivity/stable/), which forms part of SciML's differential equation suite. The model sensitivities are the derivatives of the solution with respect to the parameters. Specifically, the local sensitivity of the solution to a parameter is defined by how much the solution would change if the parameter were changed by a small amount. -Sensitivity analysis provides a cheap way to calculate the gradient of the solution which can be used in parameter estimation and other optimization tasks. +Sensitivity analysis provides a cheap way to calculate the gradient of the solution which can be used in parameter estimation and other optimisation tasks. The sensitivity analysis methods in SciMLSensitivity.jl are based on automatic differentiation (AD), and are compatible with many of Julia's AD backends. More details on the mathematical theory that underpins these methods can be found in [the SciMLSensitivity documentation](https://docs.sciml.ai/SciMLSensitivity/stable/sensitivity_math/). @@ -355,4 +355,4 @@ sample(model, NUTS(; adtype=adtype), 1000; progress=false) In this case, SciMLSensitivity will automatically choose an appropriate sensitivity analysis algorithm for you. You can also manually specify an algorithm by providing the `sensealg` keyword argument to the `solve` function; the existing algorithms are covered in [this page of the SciMLSensitivity docs](https://docs.sciml.ai/SciMLSensitivity/stable/manual/differential_equation_sensitivities/). -For more examples of adjoint usage on large parameter models, consult the [DiffEqFlux documentation](https://docs.sciml.ai/DiffEqFlux/stable/). +For more examples of adjoint usage on large parameter models, consult the [DiffEqFlux documentation](https://docs.sciml.ai/DiffEqFlux/stable/). \ No newline at end of file diff --git a/tutorials/bayesian-linear-regression/index.qmd b/tutorials/bayesian-linear-regression/index.qmd index c28148371..7c6cf5c66 100755 --- a/tutorials/bayesian-linear-regression/index.qmd +++ b/tutorials/bayesian-linear-regression/index.qmd @@ -12,7 +12,7 @@ using Pkg; Pkg.instantiate(); ``` -Turing is powerful when applied to complex hierarchical models, but it can also be put to task at common statistical procedures, like [linear regression](https://en.wikipedia.org/wiki/Linear_regression). +Turing is powerful when applied to complex hierarchical models, but it can also be applied to common statistical procedures, like [linear regression](https://en.wikipedia.org/wiki/Linear_regression). This tutorial covers how to implement a linear regression model in Turing. ## Set Up @@ -26,7 +26,7 @@ using Turing # Package for loading the data set. using RDatasets -# Package for visualization. +# Package for visualisation. using StatsPlots # Functionality for splitting the data. @@ -35,7 +35,7 @@ using MLUtils: splitobs # Functionality for constructing arrays with identical elements efficiently. using FillArrays -# Functionality for normalizing the data and evaluating the model predictions. +# Functionality for normalising the data and evaluating the model predictions. using StatsBase # Functionality for working with scaled identity matrices. @@ -54,7 +54,7 @@ setprogress!(false) We will use the `mtcars` dataset from the [RDatasets](https://github.com/JuliaStats/RDatasets.jl) package. `mtcars` contains a variety of statistics on different car models, including their miles per gallon, number of cylinders, and horsepower, among others. -We want to know if we can construct a Bayesian linear regression model to predict the miles per gallon of a car, given the other statistics it has. +We want to know if we can construct a Bayesian linear regression model to predict the miles per gallon of a car, given the other statistics it possesses. Let us take a look at the data we have. ```{julia} @@ -69,7 +69,7 @@ first(data, 6) size(data) ``` -The next step is to get our data ready for testing. We'll split the `mtcars` dataset into two subsets, one for training our model and one for evaluating our model. Then, we separate the targets we want to learn (`MPG`, in this case) and standardize the datasets by subtracting each column's means and dividing by the standard deviation of that column. The resulting data is not very familiar looking, but this standardization process helps the sampler converge far easier. +The next step is to get our data ready for testing. We'll split the `mtcars` dataset into two subsets, one for training our model and one for evaluating our model. Then, we separate the targets we want to learn (`MPG`, in this case) and standardise the datasets by subtracting each column's mean and dividing by the standard deviation of that column. This [standardisation](https://en.wikipedia.org/wiki/Feature_scaling) ensures all features have similar scales (mean 0, standard deviation 1), which helps the sampler explore the parameter space more efficiently. ```{julia} # Remove the model column. @@ -85,12 +85,12 @@ test = Matrix(select(testset, Not(target))) train_target = trainset[:, target] test_target = testset[:, target] -# Standardize the features. +# Standardise the features. dt_features = fit(ZScoreTransform, train; dims=1) StatsBase.transform!(dt_features, train) StatsBase.transform!(dt_features, test) -# Standardize the targets. +# Standardise the targets. dt_targets = fit(ZScoreTransform, train_target) StatsBase.transform!(dt_targets, train_target) StatsBase.transform!(dt_targets, test_target); @@ -115,9 +115,9 @@ where $\alpha$ is an intercept term common to all observations, $\boldsymbol{\be For $\sigma^2$, we assign a prior of `truncated(Normal(0, 100); lower=0)`. This is consistent with [Andrew Gelman's recommendations](http://www.stat.columbia.edu/%7Egelman/research/published/taumain.pdf) on noninformative priors for variance. The intercept term ($\alpha$) is assumed to be normally distributed with a mean of zero and a variance of three. -This represents our assumptions that miles per gallon can be explained mostly by our assorted variables, but a high variance term indicates our uncertainty about that. +This represents our assumptions that miles per gallon can be explained mostly by our various variables, but a high variance term indicates our uncertainty about that. Each coefficient is assumed to be normally distributed with a mean of zero and a variance of 10. -We do not know that our coefficients are different from zero, and we don't know which ones are likely to be the most important, so the variance term is quite high. +We do not know that our coefficients are different from zero, and we do not know which ones are likely to be the most important, so the variance term is quite high. Lastly, each observation $y_i$ is distributed according to the calculated `mu` term given by $\alpha + \boldsymbol{\beta}^\mathsf{T}\boldsymbol{X_i}$. ```{julia} @@ -176,17 +176,17 @@ using GLM train_with_intercept = hcat(ones(size(train, 1)), train) ols = lm(train_with_intercept, train_target) -# Compute predictions on the training data set and unstandardize them. +# Compute predictions on the training data set and unstandardise them. train_prediction_ols = GLM.predict(ols) StatsBase.reconstruct!(dt_targets, train_prediction_ols) -# Compute predictions on the test data set and unstandardize them. +# Compute predictions on the test data set and unstandardise them. test_with_intercept = hcat(ones(size(test, 1)), test) test_prediction_ols = GLM.predict(ols, test_with_intercept) StatsBase.reconstruct!(dt_targets, test_prediction_ols); ``` -The function below accepts a chain and an input matrix and calculates predictions. We use the samples of the model parameters in the chain starting with sample 200. +The function below accepts a chain and an input matrix and calculates predictions. We use the samples starting from sample 200 onwards, discarding the initial samples as [burn-in](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) to allow the sampler to reach the typical set. ```{julia} # Make a prediction given an input vector. @@ -197,10 +197,10 @@ function prediction(chain, x) end ``` -When we make predictions, we unstandardize them so they are more understandable. +When we make predictions, we unstandardise them so they are more understandable. ```{julia} -# Calculate the predictions for the training and testing sets and unstandardize them. +# Calculate the predictions for the training and testing sets and unstandardise them. train_prediction_bayes = prediction(chain, train) StatsBase.reconstruct!(dt_targets, train_prediction_bayes) test_prediction_bayes = prediction(chain, test) @@ -214,7 +214,7 @@ Now let's evaluate the loss for each method, and each prediction set. We will us $$ \mathrm{MSE} = \frac{1}{n} \sum_{i=1}^n {(y_i - \hat{y_i})^2} $$ -where $y_i$ is the actual value (true MPG) and $\hat{y_i}$ is the predicted value using either OLS or Bayesian linear regression. A lower SSE indicates a closer fit to the data. +where $y_i$ is the actual value (true MPG) and $\hat{y_i}$ is the predicted value using either OLS or Bayesian linear regression. A lower MSE indicates a closer fit to the data. ```{julia} println( @@ -248,4 +248,4 @@ let end ``` -As we can see above, OLS and our Bayesian model fit our training and test data set about the same. +As we can see above, OLS and our Bayesian model fit our training and test data set about the same. \ No newline at end of file diff --git a/tutorials/bayesian-logistic-regression/index.qmd b/tutorials/bayesian-logistic-regression/index.qmd index 0bdf325ae..8d46cf9ad 100755 --- a/tutorials/bayesian-logistic-regression/index.qmd +++ b/tutorials/bayesian-logistic-regression/index.qmd @@ -12,7 +12,7 @@ using Pkg; Pkg.instantiate(); ``` -[Bayesian logistic regression](https://en.wikipedia.org/wiki/Logistic_regression#Bayesian) is the Bayesian counterpart to a common tool in machine learning, logistic regression. +Bayesian logistic regression is the Bayesian counterpart to a common tool in machine learning, logistic regression. The goal of logistic regression is to predict a one or a zero for a given training item. An example might be predicting whether someone is sick or ill given their symptoms and personal information. @@ -27,13 +27,13 @@ using Turing, Distributions # Import RDatasets. using RDatasets -# Import MCMCChains, Plots, and StatsPlots for visualizations and diagnostics. +# Import MCMCChains, Plots, and StatsPlots for visualisations and diagnostics. using MCMCChains, Plots, StatsPlots # We need a logistic function, which is provided by StatsFuns. using StatsFuns: logistic -# Functionality for splitting and normalizing the data +# Functionality for splitting and normalising the data using MLDataUtils: shuffleobs, stratifiedobs, rescale! # Set a seed for reproducibility. @@ -69,7 +69,7 @@ first(data, 6) After we've done that tidying, it's time to split our dataset into training and testing sets, and separate the labels from the data. We separate our data into two halves, `train` and `test`. You can use a higher percentage of splitting (or a lower one) by modifying the `at = 0.05` argument. We have highlighted the use of only a 5% sample to show the power of Bayesian inference with small sample sizes. -We must rescale our variables so that they are centered around zero by subtracting each column by the mean and dividing it by the standard deviation. Without this step, Turing's sampler will have a hard time finding a place to start searching for parameter estimates. To do this we will leverage `MLDataUtils`, which also lets us effortlessly shuffle our observations and perform a stratified split to get a representative test set. +We must rescale our variables so that they are centred around zero by subtracting each column by the mean and dividing it by the standard deviation. This rescaling ensures features are on comparable scales, which improves sampler initialisation and convergence. To do this we will leverage `MLDataUtils`, which also lets us effortlessly shuffle our observations and perform a [stratified split](https://en.wikipedia.org/wiki/Stratified_sampling) to get a representative test set. ```{julia} function split_data(df, target; at=0.70) @@ -232,7 +232,7 @@ function prediction(x::Matrix, chain, threshold) end; ``` -Let's see how we did! We run the test matrix through the prediction function, and compute the [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error) (MSE) for our prediction. The `threshold` variable sets the sensitivity of the predictions. For example, a threshold of 0.07 will predict a defualt value of 1 for any predicted value greater than 0.07 and no default if it is less than 0.07. +Let's see how we did! We run the test matrix through the prediction function, and compute the [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error) (MSE) for our prediction. The `threshold` variable sets the decision boundary for classification. For example, a threshold of 0.07 will predict a default (value of 1) for any predicted probability greater than 0.07 and no default otherwise. Lower thresholds increase sensitivity but may increase false positives. ```{julia} # Set the prediction threshold. @@ -273,4 +273,4 @@ end The above shows that with a threshold of 0.07, we correctly predict a respectable portion of the defaults, and correctly identify most non-defaults. This is fairly sensitive to a choice of threshold, and you may wish to experiment with it. -This tutorial has demonstrated how to use Turing to perform Bayesian logistic regression. +This tutorial has demonstrated how to use Turing to perform Bayesian logistic regression. \ No newline at end of file diff --git a/tutorials/bayesian-neural-networks/index.qmd b/tutorials/bayesian-neural-networks/index.qmd index c5652c057..e2a8774f9 100755 --- a/tutorials/bayesian-neural-networks/index.qmd +++ b/tutorials/bayesian-neural-networks/index.qmd @@ -28,8 +28,7 @@ using LinearAlgebra using Random ``` -Our goal here is to use a Bayesian neural network to classify points in an artificial dataset. -The code below generates data points arranged in a box-like pattern and displays a graph of the dataset we will be working with. +Our goal here is to use a Bayesian neural network to classify points in an artificial dataset. The code below generates data points arranged in a box-like pattern and displays a graph of the dataset we will be working with. ```{julia} # Number of points to generate @@ -73,9 +72,7 @@ plot_data() ## Building a Neural Network -The next step is to define a [feedforward neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network) where we express our parameters as distributions, and not single points as with traditional neural networks. -For this we will use `Dense` to define liner layers and compose them via `Chain`, both are neural network primitives from Lux. -The network `nn_initial` we created has two hidden layers with `tanh` activations and one output layer with sigmoid (`σ`) activation, as shown below. +The next step is to define a [feedforward neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network) where we express our parameters as distributions, and not single points as with traditional neural networks. For this we will use `Dense` to define linear layers and compose them via `Chain`, both are neural network primitives from Lux. The network `nn_initial` we created has two hidden layers with `tanh` activations and one output layer with sigmoid (`σ`) activation, as shown below. ```{dot} //| echo: false @@ -142,8 +139,7 @@ graph G { } ``` -The `nn_initial` is an instance that acts as a function and can take data as inputs and output predictions. -We will define distributions on the neural network parameters. +The `nn_initial` is an instance that acts as a function and can take data as inputs and output predictions. We will define distributions on the neural network parameters. ```{julia} # Construct a neural network using Lux @@ -163,8 +159,7 @@ alpha = 0.09 sigma = sqrt(1.0 / alpha) ``` -We also define a function to construct a named tuple from a vector of sampled parameters. -(We could use [`ComponentArrays`](https://github.com/jonniedie/ComponentArrays.jl) here and broadcast to avoid doing this, but this way avoids introducing an extra dependency.) +We also define a function to construct a named tuple from a vector of sampled parameters. (We could use [`ComponentArrays`](https://github.com/jonniedie/ComponentArrays.jl) here and broadcast to avoid doing this, but this way avoids introducing an extra dependency.) ```{julia} function vector_to_parameters(ps_new::AbstractVector, ps::NamedTuple) @@ -213,8 +208,7 @@ n_iters = 2_000 ch = sample(bayes_nn(reduce(hcat, xs), ts), NUTS(; adtype=AutoMooncake()), n_iters); ``` -Now we extract the parameter samples from the sampled chain as `θ` (this is of size `5000 x 20` where `5000` is the number of iterations and `20` is the number of parameters). -We'll use these primarily to determine how good our model's classifier is. +Now we extract the parameter samples from the sampled chain as `θ` (this is of size `5000 x 20` where `5000` is the number of iterations and `20` is the number of parameters). We'll use these primarily to determine how good our model's classifier is. ```{julia} # Extract all weight and bias parameters. @@ -248,7 +242,7 @@ fig The contour plot above shows that the MAP method is not too bad at classifying our data. -Now we can visualize our predictions. +Now we can visualise our predictions. $$ p(\tilde{x} | X, \alpha) = \int_{\theta} p(\tilde{x} | \theta) p(\theta | X, \alpha) \approx \sum_{\theta \sim p(\theta | X, \alpha)}f_{\theta}(\tilde{x}) @@ -292,4 +286,4 @@ anim = @gif for i in 1:n_end end every 5 ``` -This has been an introduction to the applications of Turing and Lux in defining Bayesian neural networks. +This has been an introduction to the applications of Turing and Lux in defining Bayesian neural networks. \ No newline at end of file diff --git a/tutorials/bayesian-poisson-regression/index.qmd b/tutorials/bayesian-poisson-regression/index.qmd index dbbd5fb87..9a8df4430 100755 --- a/tutorials/bayesian-poisson-regression/index.qmd +++ b/tutorials/bayesian-poisson-regression/index.qmd @@ -26,7 +26,7 @@ We start by importing the required libraries. #Import Turing, Distributions and DataFrames using Turing, Distributions, DataFrames, Distributed -# Import MCMCChain, Plots, and StatsPlots for visualizations and diagnostics. +# Import MCMCChain, Plots, and StatsPlots for visualisations and diagnostics. using MCMCChains, Plots, StatsPlots # Set a seed for reproducibility. @@ -36,7 +36,7 @@ Random.seed!(12); # Generating data -We start off by creating a toy dataset. We take the case of a person who takes medicine to prevent excessive sneezing. Alcohol consumption increases the rate of sneezing for that person. Thus, the two factors affecting the number of sneezes in a given day are alcohol consumption and whether the person has taken his medicine. Both these variable are taken as boolean valued while the number of sneezes will be a count valued variable. We also take into consideration that the interaction between the two boolean variables will affect the number of sneezes +We start off by creating a toy dataset. We take the case of a person who takes medicine to prevent excessive sneezing. Alcohol consumption increases the rate of sneezing for that person. Thus, the two factors affecting the number of sneezes in a given day are alcohol consumption and whether the person has taken his medicine. Both these variables are taken as boolean valued while the number of sneezes will be a count valued variable. We also take into consideration that the interaction between the two boolean variables will affect the number of sneezes 5 random rows are printed from the generated data to get a gist of the data generated. @@ -108,7 +108,7 @@ data_labels = df[:, :nsneeze] data ``` -We must recenter our data about 0 to help the Turing sampler in initialising the parameter estimates. So, normalising the data in each column by subtracting the mean and dividing by the standard deviation: +We must standardise our data (centring about 0 with unit variance) to help the Turing sampler initialise parameter estimates effectively. We do this by subtracting the mean and dividing by the standard deviation for each column: ```{julia} # Rescale our matrices. @@ -220,10 +220,9 @@ The exponentiated mean of the coefficient `b1` is roughly half of that of `b2`. # Removing the Warmup Samples As can be seen from the plots above, the parameters converge to their final distributions after a few iterations. -The initial values during the warmup phase increase the standard deviations of the parameters and are not required after we get the desired distributions. -Thus, we remove these warmup values and once again view the diagnostics. -To remove these warmup values, we take all values except the first 200. -This is because we set the second parameter of the NUTS sampler (which is the number of adaptations) to be equal to 200. +The initial samples during the [warmup phase](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) show higher variability and should be discarded before computing posterior statistics. +Thus, we remove these warmup samples and view the diagnostics again. +We discard the first 200 samples, corresponding to the adaptation phase used by the NUTS sampler (specified via the `discard_adapt=false` argument above). ```{julia} chains_new = chain[201:end, :, :] @@ -233,4 +232,4 @@ chains_new = chain[201:end, :, :] plot(chains_new) ``` -As can be seen from the numeric values and the plots above, the standard deviation values have decreased and all the plotted values are from the estimated posteriors. The exponentiated mean values, with the warmup samples removed, have not changed by much and they are still in accordance with their intuitive meanings as described earlier. +As can be seen from the numeric values and the plots above, the standard deviation values have decreased and all the plotted values are from the estimated posteriors. The exponentiated mean values, with the warmup samples removed, have not changed by much and they are still in accordance with their intuitive meanings as described earlier. \ No newline at end of file diff --git a/tutorials/bayesian-time-series-analysis/index.qmd b/tutorials/bayesian-time-series-analysis/index.qmd index 0cdf201d6..4ed1a64aa 100755 --- a/tutorials/bayesian-time-series-analysis/index.qmd +++ b/tutorials/bayesian-time-series-analysis/index.qmd @@ -30,10 +30,10 @@ or we can decompose $g(t)$ into a product of $m$ components $$g(t) = \prod_{i=1}^m g_i(t).$$ -We refer to this as *additive* or *multiplicative* decomposition respectively. -This type of decomposition is great since it lets us reason about individual components, which makes encoding prior information and interpreting model predictions very easy. +We refer to this as *additive* or *multiplicative* [decomposition](https://en.wikipedia.org/wiki/Decomposition_of_time_series) respectively. +This type of decomposition allows us to reason about individual components separately, which simplifies encoding prior information and interpreting model predictions. Two common components are *trends*, which represent the overall change of the time series (often assumed to be linear), -and *cyclic effects* which contribute oscillating effects around the trend. +and *cyclic effects*, which contribute periodic oscillations around the trend. Let us simulate some data with an additive linear trend and oscillating effects. ```{julia} diff --git a/tutorials/coin-flipping/index.qmd b/tutorials/coin-flipping/index.qmd index 757525249..dc1674cd1 100755 --- a/tutorials/coin-flipping/index.qmd +++ b/tutorials/coin-flipping/index.qmd @@ -27,7 +27,7 @@ using Random Random.seed!(12); # Set seed for reproducibility ``` -and to visualize our results. +and to visualise our results. ```{julia} using StatsPlots @@ -63,7 +63,7 @@ data[1:5] The following example illustrates the effect of updating our beliefs with every piece of new evidence we observe. -Assume that we are unsure about the probability of heads in a coin flip. To get an intuitive understanding of what "updating our beliefs" is, we will visualize the probability of heads in a coin flip after each observed evidence. +Assume that we are unsure about the probability of heads in a coin flip. To get an intuitive understanding of what "updating our beliefs" is, we will visualise the probability of heads in a coin flip after each observed evidence. We begin by specifying a prior belief about the distribution of heads and tails in a coin toss. Here we choose a [Beta](https://en.wikipedia.org/wiki/Beta_distribution) distribution as prior distribution for the probability of heads. Before any coin flip is observed, we assume a uniform distribution $\operatorname{U}(0, 1) = \operatorname{Beta}(1, 1)$ of the probability of heads. I.e., every probability is equally likely initially. @@ -73,7 +73,7 @@ prior_belief = Beta(1, 1); With our priors set and our data at hand, we can perform Bayesian inference. -This is a fairly simple process. We expose one additional coin flip to our model every iteration, such that the first run only sees the first coin flip, while the last iteration sees all the coin flips. In each iteration we update our belief to an updated version of the original Beta distribution that accounts for the new proportion of heads and tails. The update is particularly simple since our prior distribution is a [conjugate prior](https://en.wikipedia.org/wiki/Conjugate_prior). Note that a closed-form expression for the posterior (implemented in the `updated_belief` expression below) is not accessible in general and usually does not exist for more interesting models. +This is a fairly simple process. We expose one additional coin flip to our model every iteration, such that the first run only sees the first coin flip, while the last iteration sees all the coin flips. In each iteration we update our belief to an updated version of the original Beta distribution that accounts for the new proportion of heads and tails. The update is particularly simple since our prior distribution is a [conjugate prior](https://en.wikipedia.org/wiki/Conjugate_prior), which allows for analytical posterior computation. Note that such closed-form expressions (as implemented in the `updated_belief` function below) are not available for most models, which is why we need sampling methods like MCMC. ```{julia} function updated_belief(prior_belief::Beta, data::AbstractArray{Bool}) @@ -114,7 +114,7 @@ The mean of the $\operatorname{Beta}(\alpha, \beta)$ distribution is $$\operatorname{E}[X] = \dfrac{\alpha}{\alpha+\beta}.$$ -This implies that the plot of the distribution will become centered around 0.5 for a large enough number of coin flips, as we expect $\alpha \approx \beta$. +This implies that the plot of the distribution will become centred around 0.5 for a large enough number of coin flips, as we expect $\alpha \approx \beta$. The variance of the $\operatorname{Beta}(\alpha, \beta)$ distribution is @@ -122,7 +122,7 @@ $$\operatorname{var}[X] = \dfrac{\alpha\beta}{(\alpha + \beta)^2 (\alpha + \beta Thus the variance of the distribution will approach 0 with more and more samples, as the denominator will grow faster than will the numerator. More samples means less variance. -This implies that the distribution will reflect less uncertainty about the probability of receiving a heads and the plot will become more tightly centered around 0.5 for a large enough number of coin flips. +This implies that the distribution will reflect less uncertainty about the probability of receiving a heads and the plot will become more tightly centred around 0.5 for a large enough number of coin flips. ### Coin Flipping With Turing @@ -134,7 +134,7 @@ To do so, we first need to load `Turing`. using Turing ``` -Additionally, we load `MCMCChains`, a library for analyzing and visualizing the samples with which we approximate the posterior distribution. +Additionally, we load `MCMCChains`, a library for analysing and visualising the samples with which we approximate the posterior distribution. ```{julia} using MCMCChains @@ -165,9 +165,9 @@ Here we defined a model that is not conditioned on any specific observations as rand(coinflip(; N)) ``` -The model can be conditioned on some observations with `|`. +The model can be conditioned on observations using the `|` operator, which fixes certain variables to observed values. See the [documentation of the `condition` syntax](https://turinglang.github.io/DynamicPPL.jl/stable/api/#Condition-and-decondition) in `DynamicPPL.jl` for more details. -In the conditioned `model` the observations `y` are fixed to `data`. +In the conditioned model below, the observations `y` are fixed to `data`. ```{julia} coinflip(y::AbstractVector{<:Real}) = coinflip(; N=length(y)) | (; y) @@ -205,10 +205,10 @@ Now we can build our plot: ``` ```{julia} -# Visualize a blue density plot of the approximate posterior distribution using HMC (see Chain 1 in the legend). +# Visualise a blue density plot of the approximate posterior distribution using HMC (see Chain 1 in the legend). density(chain; xlim=(0, 1), legend=:best, w=2, c=:blue) -# Visualize a green density plot of the posterior distribution in closed-form. +# Visualise a green density plot of the posterior distribution in closed-form. plot!( 0:0.01:1, pdf.(updated_belief(prior_belief, data), 0:0.01:1); @@ -223,10 +223,10 @@ plot!( c=:lightgreen, ) -# Visualize the true probability of heads in red. +# Visualise the true probability of heads in red. vline!([p_true]; label="True probability", c=:red) ``` As we can see, the samples obtained with Turing closely approximate the true posterior distribution. Hopefully this tutorial has provided an easy-to-follow, yet informative introduction to Turing's simpler applications. -More advanced usage is demonstrated in other tutorials. +More advanced usage is demonstrated in other tutorials. \ No newline at end of file diff --git a/tutorials/gaussian-mixture-models/index.qmd b/tutorials/gaussian-mixture-models/index.qmd index 99db86bc2..2cec11b69 100755 --- a/tutorials/gaussian-mixture-models/index.qmd +++ b/tutorials/gaussian-mixture-models/index.qmd @@ -57,8 +57,8 @@ scatter(x[1, :], x[2, :]; legend=false, title="Synthetic Dataset") We are interested in recovering the grouping from the dataset. More precisely, we want to infer the mixture weights, the parameters $\mu_1$ and $\mu_2$, and the assignment of each datum to a cluster for the generative Gaussian mixture model. -In a Bayesian Gaussian mixture model with $K$ components each data point $x_i$ ($i = 1,\ldots,N$) is generated according to the following generative process. -First we draw the model parameters, i.e., in our example we draw parameters $\mu_k$ for the mean of the isotropic normal distributions and the mixture weights $w$ of the $K$ clusters. +In a Bayesian [Gaussian mixture model](https://en.wikipedia.org/wiki/Mixture_model#Gaussian_mixture_model) with $K$ components, each data point $x_i$ ($i = 1,\ldots,N$) is generated according to the following generative process. +First we draw the model parameters: the cluster means $\mu_k$ and the mixture weights $w$ that determine the probability of each cluster. We use standard normal distributions as priors for $\mu_k$ and a Dirichlet distribution with parameters $\alpha_1 = \cdots = \alpha_K = 1$ as prior for $w$: $$ \begin{aligned} @@ -152,7 +152,7 @@ end ## Inferred Mixture Model -After sampling we can visualize the trace and density of the parameters of interest. +After sampling we can visualise the trace and density of the parameters of interest. We consider the samples of the location parameters $\mu_1$ and $\mu_2$ for the two clusters. @@ -227,7 +227,7 @@ We also inspect the samples of the mixture weights $w$. plot(chains[["w[1]", "w[2]"]]; legend=true) ``` -As the distributions of the samples for the parameters $\mu_1$, $\mu_2$, $w_1$, and $w_2$ are unimodal, we can safely visualize the density region of our model using the average values. +As the distributions of the samples for the parameters $\mu_1$, $\mu_2$, $w_1$, and $w_2$ are unimodal, we can safely visualise the density region of our model using the average values. ```{julia} # Model with mean of samples as parameters. diff --git a/tutorials/gaussian-process-latent-variable-models/index.qmd b/tutorials/gaussian-process-latent-variable-models/index.qmd index e6a9c3f62..906332315 100755 --- a/tutorials/gaussian-process-latent-variable-models/index.qmd +++ b/tutorials/gaussian-process-latent-variable-models/index.qmd @@ -62,7 +62,7 @@ dat[:, 2] = dat[:, 2] .^ 3 + 0.2 * dat[:, 2] .^ 4 dat[:, 3] = 0.1 * exp.(dat[:, 3]) - 0.2 * dat[:, 3] .^ 2 dat[:, 4] = 0.5 * log.(dat[:, 4]) .^ 2 + 0.01 * dat[:, 3] .^ 5 -# normalize data +# normalise data dt = fit(ZScoreTransform, dat; dims=1); StatsBase.transform!(dt, dat); ``` diff --git a/tutorials/gaussian-processes-introduction/index.qmd b/tutorials/gaussian-processes-introduction/index.qmd index 6db5a1388..9480b87da 100755 --- a/tutorials/gaussian-processes-introduction/index.qmd +++ b/tutorials/gaussian-processes-introduction/index.qmd @@ -14,7 +14,7 @@ Pkg.instantiate(); [JuliaGPs](https://github.com/JuliaGaussianProcesses/#welcome-to-juliagps) packages integrate well with Turing.jl because they implement the Distributions.jl interface. -You should be able to understand what is going on in this tutorial if you know what a GP is. +This tutorial assumes familiarity with [Gaussian processes](https://en.wikipedia.org/wiki/Gaussian_process); for an introduction, see [Rasmussen and Williams (2006)](http://www.gaussianprocess.org/gpml/). For a more in-depth understanding of the [JuliaGPs](https://github.com/JuliaGaussianProcesses/#welcome-to-juliagps) functionality used here, please consult the diff --git a/tutorials/hidden-markov-models/index.qmd b/tutorials/hidden-markov-models/index.qmd index 9448b5794..c83c6882b 100755 --- a/tutorials/hidden-markov-models/index.qmd +++ b/tutorials/hidden-markov-models/index.qmd @@ -16,7 +16,7 @@ This tutorial illustrates training Bayesian [hidden Markov models](https://en.wi The main goals are learning the transition matrix, emission parameter, and hidden states. For a more rigorous academic overview of hidden Markov models, see [An Introduction to Hidden Markov Models and Bayesian Networks](https://mlg.eng.cam.ac.uk/zoubin/papers/ijprai.pdf) (Ghahramani, 2001). -In this tutorial, we assume there are $k$ discrete hidden states; the observations are continuous and normally distributed - centered around the hidden states. This assumption reduces the number of parameters to be estimated in the emission matrix. +In this tutorial, we assume there are $k$ discrete hidden states; the observations are continuous and normally distributed - centred around the hidden states. This assumption reduces the number of parameters to be estimated in the emission matrix. Let's load the libraries we'll need, and set a random seed for reproducibility. @@ -58,7 +58,7 @@ $$ m_i \sim \mathrm{Normal}(i, 0.5) \quad \text{where} \quad m = \{1,2,3\} $$ -Simply put, this says that we expect state one to emit values in a Normally distributed manner, where the mean of each state's emissions is that state's value. The variance of 0.5 helps the model converge more quickly — consider the case where we have a variance of 1 or 2. In this case, the likelihood of observing a 2 when we are in state 1 is actually quite high, as it is within a standard deviation of the true emission value. Applying the prior that we are likely to be tightly centered around the mean prevents our model from being too confused about the state that is generating our observations. +Simply put, this says that we expect state one to emit values in a Normally distributed manner, where the mean of each state's emissions is that state's value. The variance of 0.5 helps the model converge more quickly — consider the case where we have a variance of 1 or 2. In this case, the likelihood of observing a 2 when we are in state 1 is actually quite high, as it is within a standard deviation of the true emission value. Applying the prior that we are likely to be tightly centred around the mean prevents our model from being too confused about the state that is generating our observations. The priors on our transition matrix are noninformative, using `T[i] ~ Dirichlet(ones(K)/K)`. The Dirichlet prior used in this way assumes that the state is likely to change to any other state with equal probability. As we'll see, this transition matrix prior will be overwritten as we observe data. @@ -96,7 +96,7 @@ The priors on our transition matrix are noninformative, using `T[i] ~ Dirichlet( end; ``` -We will use a combination of two samplers (HMC and Particle Gibbs) by passing them to the Gibbs sampler. The Gibbs sampler allows for compositional inference, where we can utilize different samplers on different parameters. (For API details of these samplers, please see [Turing.jl's API documentation](https://turinglang.org/Turing.jl/stable/api/Inference/).) +We will use a combination of two samplers (HMC and Particle Gibbs) by passing them to the [Gibbs sampler](https://en.wikipedia.org/wiki/Gibbs_sampling). The Gibbs sampler allows for compositional inference, where different samplers can be applied to different parameters based on their properties. (For API details of these samplers, please see [Turing.jl's API documentation](https://turinglang.org/Turing.jl/stable/api/Inference/).) In this case, we use HMC for `m` and `T`, representing the emission and transition matrices respectively. We use the Particle Gibbs sampler for `s`, the state sequence. You may wonder why it is that we are not assigning `s` to the HMC sampler, and why it is that we need compositional Gibbs sampling at all. diff --git a/tutorials/infinite-mixture-models/index.qmd b/tutorials/infinite-mixture-models/index.qmd index 8b14aac49..464694389 100755 --- a/tutorials/infinite-mixture-models/index.qmd +++ b/tutorials/infinite-mixture-models/index.qmd @@ -12,9 +12,9 @@ using Pkg; Pkg.instantiate(); ``` -In many applications it is desirable to allow the model to adjust its complexity to the amount of data. Consider for example the task of assigning objects into clusters or groups. This task often involves the specification of the number of groups. However, often times it is not known beforehand how many groups exist. Moreover, in some applictions, e.g. modelling topics in text documents or grouping species, the number of examples per group is heavy tailed. This makes it impossible to predefine the number of groups and requiring the model to form new groups when data points from previously unseen groups are observed. +In many applications it is desirable to allow the model to adjust its complexity to the amount of data. Consider for example the task of assigning objects into clusters or groups. This task often involves the specification of the number of groups. However, often times it is not known beforehand how many groups exist. Moreover, in some applications, e.g. modelling topics in text documents or grouping species, the number of examples per group is heavy tailed. This makes it impossible to predefine the number of groups and requiring the model to form new groups when data points from previously unseen groups are observed. -A natural approach for such applications is the use of non-parametric models. This tutorial will introduce how to use the Dirichlet process in a mixture of infinitely many Gaussians using Turing. For further information on Bayesian nonparametrics and the Dirichlet process we refer to the [introduction by Zoubin Ghahramani](http://mlg.eng.cam.ac.uk/pub/pdf/Gha12.pdf) and the book "Fundamentals of Nonparametric Bayesian Inference" by Subhashis Ghosal and Aad van der Vaart. +A natural approach for such applications is the use of [non-parametric models](https://en.wikipedia.org/wiki/Nonparametric_statistics#Non-parametric_models), which can grow in complexity as more data are observed. This tutorial demonstrates how to use the Dirichlet process in a mixture of infinitely many Gaussians using Turing. For further information on Bayesian nonparametrics and the Dirichlet process, see the [introduction by Zoubin Ghahramani](http://mlg.eng.cam.ac.uk/pub/pdf/Gha12.pdf) and the book "Fundamentals of Nonparametric Bayesian Inference" by Subhashis Ghosal and Aad van der Vaart. ```{julia} using Turing @@ -123,7 +123,7 @@ $$ Those equations show that the conditional prior for component assignments is proportional to the number of such observations, meaning that the Chinese restaurant process has a rich get richer property. -To get a better understanding of this property, we can plot the cluster choosen by for each new observation drawn from the conditional prior. +To get a better understanding of this property, we can plot the cluster chosen by for each new observation drawn from the conditional prior. ```{julia} # Concentration parameter. @@ -256,7 +256,7 @@ k = map( plot(k; xlabel="Iteration", ylabel="Number of clusters", label="Chain 1") ``` -If we visualize the histogram of the number of clusters sampled from our posterior, we observe that the model seems to prefer 3 clusters, which is the true number of clusters. Note that the number of clusters in a Dirichlet process mixture model is not limited a priori and will grow to infinity with probability one. However, if conditioned on data the posterior will concentrate on a finite number of clusters enforcing the resulting model to have a finite amount of clusters. It is, however, not given that the posterior of a Dirichlet process Gaussian mixture model converges to the true number of clusters, given that data comes from a finite mixture model. See Jeffrey Miller and Matthew Harrison: [A simple example of Dirichlet process mixture inconsitency for the number of components](https://arxiv.org/pdf/1301.2708.pdf) for details. +If we visualise the histogram of the number of clusters sampled from our posterior, we observe that the model seems to prefer 3 clusters, which is the true number of clusters. Note that the number of clusters in a Dirichlet process mixture model is not limited a priori and will grow to infinity with probability one. However, if conditioned on data the posterior will concentrate on a finite number of clusters enforcing the resulting model to have a finite amount of clusters. It is, however, not given that the posterior of a Dirichlet process Gaussian mixture model converges to the true number of clusters, given that data comes from a finite mixture model. See Jeffrey Miller and Matthew Harrison: [A simple example of Dirichlet process mixture inconsistency for the number of components](https://arxiv.org/pdf/1301.2708.pdf) for details. ```{julia} histogram(k; xlabel="Number of clusters", legend=false) @@ -283,4 +283,4 @@ p(z_n = k | z_{1:n-1}) \propto \begin{cases} \end{cases} $$ -For more details see [this article](https://www.stats.ox.ac.uk/%7Eteh/research/npbayes/Teh2010a.pdf). +For more details see [this article](https://www.stats.ox.ac.uk/%7Eteh/research/npbayes/Teh2010a.pdf). \ No newline at end of file diff --git a/tutorials/multinomial-logistic-regression/index.qmd b/tutorials/multinomial-logistic-regression/index.qmd index e788475e8..3cc306a0d 100755 --- a/tutorials/multinomial-logistic-regression/index.qmd +++ b/tutorials/multinomial-logistic-regression/index.qmd @@ -12,9 +12,9 @@ using Pkg; Pkg.instantiate(); ``` -[Multinomial logistic regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression) is an extension of logistic regression. Logistic regression is used to model problems in which there are exactly two possible discrete outcomes. Multinomial logistic regression is used to model problems in which there are two or more possible discrete outcomes. +Multinomial logistic regression is an extension of logistic regression. Logistic regression is used to model problems in which there are exactly two possible discrete outcomes. Multinomial logistic regression is used to model problems in which there are two or more possible discrete outcomes. -In our example, we'll be using the iris dataset. The iris multiclass problem aims to predict the species of a flower given measurements (in centimeters) of sepal length and width and petal length and width. There are three possible species: Iris setosa, Iris versicolor, and Iris virginica. +In our example, we'll be using the iris dataset. The iris multiclass problem aims to predict the species of a flower given measurements (in centimetres) of sepal length and width and petal length and width. There are three possible species: Iris setosa, Iris versicolor, and Iris virginica. To start, let's import all the libraries we'll need. @@ -25,10 +25,10 @@ using Turing # Load RDatasets. using RDatasets -# Load StatsPlots for visualizations and diagnostics. +# Load StatsPlots for visualisations and diagnostics. using StatsPlots -# Functionality for splitting and normalizing the data. +# Functionality for splitting and normalising the data. using MLDataUtils: shuffleobs, splitobs, rescale! # We need a softmax function which is provided by NNlib. @@ -57,7 +57,7 @@ data = RDatasets.dataset("datasets", "iris"); data[rand(1:size(data, 1), 20), :] ``` -In this data set, the outcome `Species` is currently coded as a string. We convert it to a numerical value by using indices `1`, `2`, and `3` to indicate species `setosa`, `versicolor`, and `virginica`, respectively. +In this data set, the outcome `Species` is currently coded as a string. We convert it to a numerical value by using indices 1, 2, and 3 to indicate species setosa, versicolor, and virginica, respectively. ```{julia} # Recode the `Species` column. @@ -68,7 +68,7 @@ data[!, :Species_index] = indexin(data[!, :Species], species) data[rand(1:size(data, 1), 20), [:Species, :Species_index]] ``` -After we've done that tidying, it's time to split our dataset into training and testing sets, and separate the features and target from the data. Additionally, we must rescale our feature variables so that they are centered around zero by subtracting each column by the mean and dividing it by the standard deviation. Without this step, Turing's sampler will have a hard time finding a place to start searching for parameter estimates. +After we've done that tidying, it's time to split our dataset into training and testing sets, and separate the features and target from the data. Additionally, we must rescale our feature variables so that they are centred around zero by subtracting each column by the mean and dividing it by the standard deviation. This standardisation improves sampler efficiency by ensuring all features are on comparable scales. ```{julia} # Split our dataset 50%/50% into training/test sets. @@ -84,7 +84,7 @@ test_features = Matrix(testset[!, features]) train_target = trainset[!, target] test_target = testset[!, target] -# Standardize the features. +# Standardise the features. μ, σ = rescale!(train_features; obsdim=1) rescale!(test_features, μ, σ; obsdim=1); ``` @@ -97,7 +97,7 @@ Finally, we can define our model `logistic_regression`. It is a function that ta - `y` is the element we want to predict; - `σ` is the standard deviation we want to assume for our priors. -We select the `setosa` species as the baseline class (the choice does not matter). Then we create the intercepts and vectors of coefficients for the other classes against that baseline. More concretely, we create scalar intercepts `intercept_versicolor` and `intersept_virginica` and coefficient vectors `coefficients_versicolor` and `coefficients_virginica` with four coefficients each for the features `SepalLength`, `SepalWidth`, `PetalLength` and `PetalWidth`. We assume a normal distribution with mean zero and standard deviation `σ` as prior for each scalar parameter. We want to find the posterior distribution of these, in total ten, parameters to be able to predict the species for any given set of features. +We select the setosa species as the baseline class (the choice does not matter). Then we create the intercepts and vectors of coefficients for the other classes against that baseline. More concretely, we create scalar intercepts `intercept_versicolor` and `intersept_virginica` and coefficient vectors `coefficients_versicolor` and `coefficients_virginica` with four coefficients each for the features `SepalLength`, `SepalWidth`, `PetalLength` and `PetalWidth`. We assume a normal distribution with mean zero and standard deviation `σ` as prior for each scalar parameter. We want to find the posterior distribution of these, in total ten, parameters to be able to predict the species for any given set of features. ```{julia} # Bayesian multinomial logistic regression @@ -232,4 +232,4 @@ for s in 1:3 end ``` -This tutorial has demonstrated how to use Turing to perform Bayesian multinomial logistic regression. +This tutorial has demonstrated how to use Turing to perform Bayesian multinomial logistic regression. \ No newline at end of file diff --git a/tutorials/probabilistic-pca/index.qmd b/tutorials/probabilistic-pca/index.qmd index 6f8982923..4eae66da7 100755 --- a/tutorials/probabilistic-pca/index.qmd +++ b/tutorials/probabilistic-pca/index.qmd @@ -21,7 +21,7 @@ For example, we have a data matrix $\mathbf{X} \in \mathbb{R}^{N \times D}$, and The goal is to understand $\mathbf{X}$ through a lower dimensional subspace (e.g. two-dimensional subspace for visualisation convenience) spanned by the principal components. In order to project the original data matrix into low dimensions, we need to find the principal directions where most of the variations of $\mathbf{X}$ lie in. -Traditionally, this is implemented via [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition) which provides a robust and accurate computational framework for decomposing matrix into products of rotation-scaling-rotation matrices, particularly for large datasets(see an illustration [here](https://intoli.com/blog/pca-and-svd/)): +Traditionally, this is implemented via [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition) which provides a robust and accurate computational framework for decomposing matrix into products of rotation-scaling-rotation matrices, particularly for large datasets (see an illustration [here](https://intoli.com/blog/pca-and-svd/)): $$ \mathbf{X}_{N \times D} = \mathbf{U}_{N \times r} \times \boldsymbol{\Sigma}_{r \times r} \times \mathbf{V}^T_{r \times D} @@ -41,9 +41,9 @@ The percentage of variations explained can be calculated using the ratios of sin Here we take a probabilistic perspective. For more details and a mathematical derivation, we recommend Bishop's textbook (Christopher M. Bishop, Pattern Recognition and Machine Learning, 2006). -The idea of proabilistic PCA is to find a latent variable $z$ that can be used to describe the hidden structure in a dataset.[^2] +The idea of probabilistic PCA is to find a latent variable $z$ that can be used to describe the hidden structure in a dataset.[^2] Consider a data set $\mathbf{X}_{D \times N}=\{x_i\}$ with $i=1,2,...,N$ data points, where each data point $x_i$ is $D$-dimensional (i.e. $x_i \in \mathcal{R}^D$). -Note that, here we use the flipped version of the data matrix. We aim to represent the original $n$ dimensional vector using a lower dimensional a latent variable $z_i \in \mathcal{R}^k$. +Note that, here we use the flipped version of the data matrix. We aim to represent the original $n$ dimensional vector using a lower dimensional latent variable $z_i \in \mathcal{R}^k$. We first assume that each latent variable $z_i$ is normally distributed: @@ -58,12 +58,12 @@ x_i | z_i \sim \mathcal{N}(\mathbf{W} z_i + \boldsymbol{μ}, \sigma^2 \mathbf{I} $$ where the projection matrix $\mathbf{W}_{D \times k}$ accommodates the principal axes. -The above formula expresses $x_i$ as a linear combination of the basis columns in the projection matrix `W`, where the combination coefficients sit in `z_i` (they are the coordinats of `x_i` in the new $k$-dimensional space.). +The above formula expresses $x_i$ as a linear combination of the basis columns in the projection matrix `W`, where the combination coefficients sit in `z_i` (they are the coordinates of `x_i` in the new $k$-dimensional space.). We can also express the above formula in matrix form: $\mathbf{X}_{D \times N} \approx \mathbf{W}_{D \times k} \mathbf{Z}_{k \times N}$. We are interested in inferring $\mathbf{W}$, $μ$ and $\sigma$. -Classical PCA is the specific case of probabilistic PCA when the covariance of the noise becomes infinitesimally small, i.e. $\sigma^2 \to 0$. -Probabilistic PCA generalizes classical PCA, this can be seen by marginalizing out the the latent variable.[^2] +Classical [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) is the limiting case of probabilistic PCA when the noise variance approaches zero, i.e. $\sigma^2 \to 0$. +Probabilistic PCA generalises classical PCA; this connection can be seen by marginalising out the latent variable.[^2] ## The gene expression example @@ -72,14 +72,14 @@ In the first example, we illustrate: - how to specify the probabilistic model and - how to perform inference on $\mathbf{W}$, $\boldsymbol{\mu}$ and $\sigma$ using MCMC. -We use simulated gemnome data to demonstrate these. -The simulation is inspired by biological measurement of expression of genes in cells, and each cell is characterized by different gene features. +We use simulated genome data to demonstrate these. +The simulation is inspired by biological measurement of expression of genes in cells, and each cell is characterised by different gene features. While the human genome is (mostly) identical between all the cells in the body, there exist interesting differences in gene expression in different human tissues and disease conditions. One way to investigate certain diseases is to look at differences in gene expression in cells from patients and healthy controls (usually from the same tissue). Usually, we can assume that the changes in gene expression only affect a subset of all genes (and these can be linked to diseases in some way). One of the challenges for this kind of data is to explore the underlying structure, e.g. to make the connection between a certain state (healthy/disease) and gene expression. -This becomes difficult when the dimensions is very large (up to 20000 genes across 1000s of cells). So in order to find structure in this data, it is useful to project the data into a lower dimensional space. +This becomes difficult when the dimension is very large (up to 20000 genes across 1000s of cells). So in order to find structure in this data, it is useful to project the data into a lower dimensional space. Regardless of the biological background, the more abstract problem formulation is to project the data living in high-dimensional space onto a representation in lower-dimensional space where most of the variation is concentrated in the first few dimensions. We use PCA to explore underlying structure or pattern which may not necessarily be obvious from looking at the raw data itself. @@ -93,7 +93,7 @@ using Turing using Mooncake using LinearAlgebra, FillArrays -# Packages for visualization +# Packages for visualisation using DataFrames, StatsPlots, Measures # Set a seed for reproducibility. @@ -109,7 +109,7 @@ You can install them via `using Pkg; Pkg.add("package_name")`. ## Package usages: We use `DataFrames` for instantiating matrices, `LinearAlgebra` and `FillArrays` to perform matrix operations; `Turing` for model specification and MCMC sampling, `Mooncake` for automatic differentiation when sampling. -`StatsPlots` for visualising the resutls. `, Measures` for setting plot margin units. +`StatsPlots` for visualising the results. `, Measures` for setting plot margin units. As all examples involve sampling, for reproducibility we set a fixed seed using the `Random` standard library. ::: @@ -118,9 +118,9 @@ As all examples involve sampling, for reproducibility we set a fixed seed using Here, we simulate the biological gene expression problem described earlier. We simulate 60 cells, each cell has 9 gene features. This is a simplified problem with only a few cells and genes for demonstration purpose, which is not comparable to the complexity in real-life (e.g. thousands of features for each individual). -Even so, spotting the structures or patterns in a 9-feature space would be a challenging task; it would be nice to reduce the dimentionality using p-PCA. +Even so, spotting the structures or patterns in a 9-feature space would be a challenging task; it would be nice to reduce the dimensionality using p-PCA. -By design, we mannually divide the 60 cells into two groups. the first 3 gene features of the first 30 cells have mean 10, while those of the last 30 cells have mean 10. +By design, we manually divide the 60 cells into two groups. the first 3 gene features of the first 30 cells have mean 10, while those of the last 30 cells have mean 10. These two groups of cells differ in the expression of genes. ```{julia} @@ -134,7 +134,7 @@ mat_exp[1:(n_genes ÷ 3), 1:(n_cells ÷ 2)] .+= 10 mat_exp[(2 * (n_genes ÷ 3) + 1):end, (n_cells ÷ 2 + 1):end] .+= 10 ``` -To visualize the $(D=9) \times (N=60)$ data matrix `mat_exp`, we use the `heatmap` plot. +To visualise the $(D=9) \times (N=60)$ data matrix `mat_exp`, we use the `heatmap` plot. ```{julia} heatmap( @@ -157,7 +157,7 @@ Note that: #### Step 3: Create the pPCA model Here we construct the probabilistic model `pPCA()`. -As per the p-PCA formula, we think of each row (i.e. each gene feature) following a $N=60$ dimensional multivariate normal distribution centered around the corresponding row of $\mathbf{W}_{D \times k} \times \mathbf{Z}_{k \times N} + \boldsymbol{\mu}_{D \times N}$. +As per the p-PCA formula, we think of each row (i.e. each gene feature) following an $N=60$ dimensional multivariate normal distribution centred around the corresponding row of $\mathbf{W}_{D \times k} \times \mathbf{Z}_{k \times N} + \boldsymbol{\mu}_{D \times N}$. ```{julia} @model function pPCA(X::AbstractMatrix{<:Real}, k::Int) @@ -179,7 +179,7 @@ end; The function `pPCA()` accepts: - 1. an data array $\mathbf{X}$ (with no. of instances x dimension no. of features, NB: it is a transpose of the original data matrix); + 1. a data array $\mathbf{X}$ (with no. of instances x dimension no. of features, NB: it is a transpose of the original data matrix); 2. an integer $k$ which indicates the dimension of the latent space (the space the original feature matrix is projected onto). Specifically: @@ -264,7 +264,7 @@ This is satisfying as we have just projected the original 9-dimensional space on The is the desirable property of PCA: it picks up the principal axes along which most of the (original) data variations cluster, and remove those less relevant. If we choose the reduced space dimension $k$ to be exactly $D$ (the original data dimension), we would recover exactly the same original data matrix `mat_exp`, i.e. all information will be preserved. -Now we have represented the original high-dimensional data in two dimensions, without lossing the key information about the two groups of cells in the input data. +Now we have represented the original high-dimensional data in two dimensions, without losing the key information about the two groups of cells in the input data. Finally, the benefits of performing PCA is to analyse and visualise the dimension-reduced data in the projected, low-dimensional space. we save the dimension-reduced matrix $\mathbf{Z}$ as a `DataFrame`, rename the columns and visualise the first two dimensions. @@ -288,7 +288,7 @@ In the case of PCA, there exist a lot of heuristics to make that choice. For example, We can tune the number of principal components using empirical methods such as cross-validation based on some criteria such as MSE between the posterior predicted (e.g. mean predictions) data matrix and the original data matrix or the percentage of variation explained [^3]. For p-PCA, this can be done in an elegant and principled way, using a technique called *Automatic Relevance Determination* (ARD). -ARD can help pick the correct number of principal directions by regularizing the solution space using a parameterized, data-dependent prior distribution that effectively prunes away redundant or superfluous features [^4]. +ARD can help pick the correct number of principal directions by regularising the solution space using a parameterised, data-dependent prior distribution that effectively prunes away redundant or superfluous features [^4]. Essentially, we are using a specific prior over the factor loadings $\mathbf{W}$ that allows us to prune away dimensions in the latent space. The prior is determined by a precision hyperparameter $\alpha$. Here, smaller values of $\alpha$ correspond to more important components. You can find more details about this in, for example, Bishop (2006) [^5]. @@ -312,9 +312,9 @@ end; ``` Instead of drawing samples of each entry in $\mathbf{W}$ from a standard normal, this time we repeatedly draw $D$ samples from the $D$-dimensional MND, forming a $D \times D$ matrix $\mathbf{W}$. -This matrix is a function of $\alpha$ as the samples are drawn from the MND parameterized by $\alpha$. +This matrix is a function of $\alpha$ as the samples are drawn from the MND parameterised by $\alpha$. We also introduce a hyper-parameter $\tau$ which is the precision in the sampling distribution. -We also re-paramterise the sampling distribution, i.e. each dimension across all instances is a 60-dimensional multivariate normal distribution. Re-parameterisation can sometimes accelrate the sampling process. +We also re-parametrise the sampling distribution, i.e. each dimension across all instances is a 60-dimensional multivariate normal distribution. Re-parametrisation can sometimes accelerate the sampling process. We instantiate the model and ask Turing to sample from it using NUTS sampler. The sample trajectories of $\alpha$ is plotted using the `plot` function from the package `StatsPlots`. @@ -327,7 +327,7 @@ plot(group(chain_ppcaARD, :α); margin=6.0mm) Again, we do some inference diagnostics. Here we look at the convergence of the chains for the $α$ parameter. This parameter determines the relevance of individual components. -We see that the chains have converged and the posterior of the $\alpha$ parameters is centered around much smaller values in two instances. +We see that the chains have converged and the posterior of the $\alpha$ parameters is centred around much smaller values in two instances. In the following, we will use the mean of the small values to select the *relevant* dimensions (remember that, smaller values of $\alpha$ correspond to more important components.). We can clearly see from the values of $\alpha$ that there should be two dimensions (corresponding to $\bar{\alpha}_3=\bar{\alpha}_5≈0.05$) for this dataset. @@ -380,11 +380,11 @@ When you are in doubt about the number of dimensions to project onto, ARD might ## Final comments. p-PCA is a linear map which linearly transforms the data between the original and projected spaces. -It can also thought as a matrix factorisation method, in which $\mathbf{X}=(\mathbf{W} \times \mathbf{Z})^T$. The projection matrix can be understood as a new basis in the projected space, and $\mathbf{Z}$ are the new coordinates. +It can also be thought of as a matrix factorisation method, in which $\mathbf{X}=(\mathbf{W} \times \mathbf{Z})^T$. The projection matrix can be understood as a new basis in the projected space, and $\mathbf{Z}$ are the new coordinates. [^1]: Gilbert Strang, *Introduction to Linear Algebra*, 5th Ed., Wellesley-Cambridge Press, 2016. [^2]: Probabilistic PCA by TensorFlow, "https://www.tensorflow.org/probability/examples/Probabilistic_PCA". [^3]: Gareth M. James, Daniela Witten, Trevor Hastie, Robert Tibshirani, *An Introduction to Statistical Learning*, Springer, 2013. [^4]: David Wipf, Srikantan Nagarajan, *A New View of Automatic Relevance Determination*, NIPS 2007. -[^5]: Christopher Bishop, *Pattern Recognition and Machine Learning*, Springer, 2006. +[^5]: Christopher Bishop, *Pattern Recognition and Machine Learning*, Springer, 2006. \ No newline at end of file diff --git a/tutorials/variational-inference/index.qmd b/tutorials/variational-inference/index.qmd index 7d89ba904..940ea30ca 100755 --- a/tutorials/variational-inference/index.qmd +++ b/tutorials/variational-inference/index.qmd @@ -12,7 +12,7 @@ using Pkg; Pkg.instantiate(); ``` -This post will look at **variational inference (VI)**, an optimization approach to _approximate_ Bayesian inference, and how to use it in Turing.jl as an alternative to other approaches such as MCMC. +This post will look at **variational inference (VI)**, an optimisation approach to _approximate_ Bayesian inference, and how to use it in Turing.jl as an alternative to other approaches such as MCMC. This post will focus on the usage of VI in Turing rather than the principles and theory underlying VI. If you are interested in understanding the mathematics you can checkout [our write-up]({{}}) or any other resource online (there are a lot of great ones). @@ -27,12 +27,12 @@ q_init = q_fullrank_gaussian(m) # initial variational approximation vi(m, q_init, 1000) # perform VI with the default algorithm on `m` for 1000 iterations ``` Thus, it's no more work than standard MCMC sampling in Turing. -The default algorithm uses stochastic gradient descent to minimize the (exclusive) KL divergence. -This is commonly referred to as *automatic differentiation variational inference*[^KTRGB2017], *stochastic gradient VI*[^TL2014], and *black-box variational inference*[^RGB2014] with the reparameterization gradient[^KW2014][^RMW2014][^TL2014]. +The default algorithm uses stochastic gradient descent to minimise the (exclusive) [KL divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence). +This approach is commonly referred to as *automatic differentiation variational inference* (ADVI)[^KTRGB2017], *stochastic gradient VI*[^TL2014], and *black-box variational inference*[^RGB2014] with the reparameterization gradient[^KW2014][^RMW2014][^TL2014]. To get a bit more into what we can do with VI, let's look at a more concrete example. We will reproduce the [tutorial on Bayesian linear regression]({{}}) using VI instead of MCMC. -After that, we will discuss how to customize the behavior of `vi` for more advanced usage. +After that, we will discuss how to customise the behaviour of `vi` for more advanced usage. Let's first import the relevant packages: @@ -75,11 +75,11 @@ function split_data(df, at=0.70) end # A handy helper function to rescale our dataset. -function standardize(x) +function standardise(x) return (x .- mean(x; dims=1)) ./ std(x; dims=1) end -function standardize(x, orig) +function standardise(x, orig) return (x .- mean(orig; dims=1)) ./ std(orig; dims=1) end @@ -101,9 +101,9 @@ select!(data, Not(:Model)) train, test = split_data(data, 0.7) train_unstandardized = copy(train) -# Standardize both datasets. -std_train = standardize(Matrix(train)) -std_test = standardize(Matrix(test), Matrix(train)) +# Standardise both datasets. +std_train = standardise(Matrix(train)) +std_test = standardise(Matrix(test), Matrix(train)) # Save dataframe versions of our dataset. train_cut = DataFrame(std_train, names(data)) @@ -147,18 +147,18 @@ m = linear_regression(train, train_label, n_obs, n_vars); ## Basic Usage To run VI, we must first set a *variational family*. For instance, the most commonly used family is the mean-field Gaussian family. -For this, Turing provides functions that automatically construct the initialization corresponding to the model `m`: +For this, Turing provides functions that automatically construct the initialisation corresponding to the model `m`: ```{julia} q_init = q_meanfield_gaussian(m); ``` -`vi` will automatically recognize the variational family through the type of `q_init`. +`vi` will automatically recognise the variational family through the type of `q_init`. Here is a detailed documentation for the constructor: ```{julia} @doc(Variational.q_meanfield_gaussian) ``` -As we can see, the precise initialization can be customized through the keyword arguments. +As we can see, the precise initialisation can be customized through the keyword arguments. Let's run VI with the default setting: ```{julia} @@ -166,7 +166,7 @@ n_iters = 1000 q_avg, q_last, info, state = vi(m, q_init, n_iters; show_progress=false); ``` The default setting uses the `AdvancedVI.RepGradELBO` objective, which corresponds to a variant of what is known as *automatic differentiation VI*[^KTRGB2017] or *stochastic gradient VI*[^TL2014] or *black-box VI*[^RGB2014] with the reparameterization gradient[^KW2014][^RMW2014][^TL2014]. -The default optimizer we use is `AdvancedVI.DoWG`[^KMJ2023] combined with a proximal operator. +The default optimiser we use is `AdvancedVI.DoWG`[^KMJ2023] combined with a proximal operator. (The use of proximal operators with VI on a location-scale family is discussed in detail by J. Domke[^D2020][^DGG2023] and others[^KOWMG2023].) We will take a deeper look into the returned values and the keyword arguments in the following subsections. First, here is the full documentation for `vi`: @@ -176,7 +176,7 @@ First, here is the full documentation for `vi`: ``` ## Values Returned by `vi` -The main output of the algorithm is `q_avg`, the average of the parameters generated by the optimization algorithm. +The main output of the algorithm is `q_avg`, the average of the parameters generated by the optimisation algorithm. For computing `q_avg`, the default setting uses what is known as polynomial averaging[^SZ2013]. Usually, `q_avg` will perform better than the last-iterate `q_last`. For instance, we can compare the ELBO of the two: @@ -188,7 +188,7 @@ For instance, we can compare the ELBO of the two: ``` We can see that `ELBO_q_avg` is slightly more optimal. -Now, `info` contains information generated during optimization that could be useful for diagnostics. +Now, `info` contains information generated during optimisation that could be useful for diagnostics. For the default setting, which is `RepGradELBO`, it contains the ELBO estimated at each step, which can be plotted as follows: ```{julia} @@ -199,7 +199,7 @@ Furthermore, at each step, the ELBO is evaluated on `q_last`, not `q_avg`, which To obtain more accurate ELBO estimates evaluated on `q_avg`, we have to define a custom callback function. ## Custom Callback Functions -To inspect the progress of optimization in more detail, one can define a custom callback function. +To inspect the progress of optimisation in more detail, one can define a custom callback function. For example, the following callback function estimates the ELBO on `q_avg` every 10 steps with a larger number of samples: ```{julia} @@ -232,7 +232,7 @@ We can see that the ELBO values are less noisy and progress more smoothly due to ## Using Different Optimisers The default optimiser we use is a proximal variant of DoWG[^KMJ2023]. For Gaussian variational families, this works well as a default option. -Sometimes, the step size of `AdvancedVI.DoWG` could be too large, resulting in unstable behavior. +Sometimes, the step size of `AdvancedVI.DoWG` could be too large, resulting in unstable behaviour. (In this case, we recommend trying `AdvancedVI.DoG`[^IHC2023]) Or, for whatever reason, it might be desirable to use a different optimiser. Our implementation supports any optimiser that implements the [Optimisers.jl](https://fluxml.ai/Optimisers.jl/stable/) interface. @@ -272,7 +272,7 @@ q_init_fr = q_fullrank_gaussian(m); The term *full-rank* might seem a bit peculiar since covariance matrices are always full-rank. This term, however, traditionally comes from the fact that full-rank families use full-rank factors in addition to the diagonal of the covariance. -In contrast to the mean-field family, the full-rank family will often result in more computation per optimization step and slower convergence, especially in high dimensions: +In contrast to the mean-field family, the full-rank family will often result in more computation per optimisation step and slower convergence, especially in high dimensions: ```{julia} q_fr, _, info_fr, _ = vi(m, q_init_fr, n_iters; show_progress=false, callback) @@ -283,10 +283,10 @@ Plots.plot!(elbo_fr, xlabel="Iterations", ylabel="ELBO", label="Full-Rank", ylim ``` However, we can see that the full-rank families achieve a higher ELBO in the end. Due to the relationship between the ELBO and the Kullback-Leibler divergence, this indicates that the full-rank covariance is much more accurate. -This trade-off between statistical accuracy and optimization speed is often referred to as the *statistical-computational trade-off*. +This trade-off between statistical accuracy and optimisation speed is often referred to as the *statistical-computational trade-off*. The fact that we can control this trade-off through the choice of variational family is a strength, rather than a limitation, of variational inference. -We can also visualize the covariance matrix. +We can also visualise the covariance matrix. ```{julia} heatmap(cov(rand(q_fr, 100_000), dims=2)) ``` @@ -330,7 +330,7 @@ avg[union(sym2range[:intercept]...)] avg[union(sym2range[:coefficients]...)] ``` -For further convenience, we can wrap the samples into a `Chains` object to summarize the results. +For further convenience, we can wrap the samples into a `Chains` object to summarise the results. ```{julia} varinf = Turing.DynamicPPL.VarInfo(m) vns_and_values = Turing.DynamicPPL.varname_and_value_leaves(Turing.DynamicPPL.values_as(varinf, OrderedDict)) @@ -512,11 +512,11 @@ Also, the coverage of full-rank VI and MCMC is much better the crude mean-field [^D2020]: Domke, J. (2020). Provable smoothness guarantees for black-box variational inference. In *Proceedings of the International Conference on Machine Learning*. PMLR. [^DGG2023]: Domke, J., Gower, R., & Garrigos, G. (2023). Provable convergence guarantees for black-box variational inference. In *Advances in Neural Information Processing Systems*, 36. [^IHC2023]: Ivgi, M., Hinder, O., & Carmon, Y. (2023). DoG is SGD’s best friend: A parameter-free dynamic step size schedule. In *Proceedings of the International Conference on Machine Learning*. PMLR. -[^KB2014]: Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In *Proceedings of the International Conference on Learning Representations*. +[^KB2014]: Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimisation. In *Proceedings of the International Conference on Learning Representations*. [^KOWMG2023]: Kim, K., Oh, J., Wu, K., Ma, Y., & Gardner, J. (2023). On the convergence of black-box variational inference. In *Advances in Neural Information Processing Systems*, 36. [^KTRGB2017]: Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation variational inference. *Journal of Machine Learning Research*, 18(14). [^KW2014]: Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In *Proceedings of the International Conference on Learning Representations.* [^RGB2014]: Ranganath, R., Gerrish, S., & Blei, D. (2014). Black box variational inference. In *Proceedings of the International Conference on Artificial intelligence and statistics*. PMLR. [^RMW2014]: Rezende, D. J., Mohamed, S., & Wierstra, D (2014). Stochastic backpropagation and approximate inference in deep generative models. In *Proceedings of the International Conference on Machine Learning*. PMLR. -[^SZ2013]: Shamir, O., & Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In *Proceedings of the International Conference on Machine Learning.* PMLR. +[^SZ2013]: Shamir, O., & Zhang, T. (2013). Stochastic gradient descent for non-smooth optimisation: Convergence results and optimal averaging schemes. In *Proceedings of the International Conference on Machine Learning.* PMLR. [^TL2014]: Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In *Proceedings of the International Conference on Machine Learning*. PMLR. diff --git a/usage/automatic-differentiation/index.qmd b/usage/automatic-differentiation/index.qmd index 5ae168473..7fce899db 100755 --- a/usage/automatic-differentiation/index.qmd +++ b/usage/automatic-differentiation/index.qmd @@ -16,7 +16,7 @@ Pkg.instantiate(); Automatic differentiation (AD) is a technique used in Turing.jl to evaluate the gradient of a function at a given set of arguments. In the context of Turing.jl, the function being differentiated is the log probability density of a model, and the arguments are the parameters of the model (i.e. the values of the random variables). -The gradient of the log probability density is used by various algorithms in Turing.jl, such as HMC (including NUTS), mode estimation (which uses gradient-based optimization), and variational inference. +The gradient of the log probability density is used by various algorithms in Turing.jl, such as HMC (including NUTS), mode estimation (which uses gradient-based optimisation), and variational inference. The Julia ecosystem has a number of AD libraries. You can switch between these using the unified [ADTypes.jl](https://github.com/SciML/ADTypes.jl/) interface, which for a given AD backend, provides types such as `AutoBackend` (see [the documentation](https://docs.sciml.ai/ADTypes/stable/) for more details). diff --git a/usage/custom-distribution/index.qmd b/usage/custom-distribution/index.qmd index f28eab9b0..1b26170d8 100755 --- a/usage/custom-distribution/index.qmd +++ b/usage/custom-distribution/index.qmd @@ -13,10 +13,9 @@ using Pkg; Pkg.instantiate(); ``` -`Turing.jl` supports the use of distributions from the Distributions.jl package. -By extension, it also supports the use of customized distributions by defining them as subtypes of `Distribution` type of the Distributions.jl package, as well as corresponding functions. +`Turing.jl` supports the use of distributions from the Distributions.jl package. By extension, it also supports the use of customised distributions by defining them as subtypes of the `Distribution` type in the Distributions.jl package, as well as corresponding functions. -This page shows a workflow of how to define a customized distribution, using our own implementation of a simple `Uniform` distribution as a simple example. +This page shows a workflow of how to define a customised distribution, using our own implementation of a simple `Uniform` distribution as a simple example. ```{julia} #| output: false @@ -49,20 +48,15 @@ In most cases, it may be required to define some helper functions. ### Domain Transformation -Certain samplers, such as `HMC`, require the domain of the priors to be unbounded. -Therefore, to use our `CustomUniform` as a prior in a model we also need to define how to transform samples from `[0, 1]` to `ℝ`. -To do this, we need to define the corresponding `Bijector` from `Bijectors.jl`, which is what `Turing.jl` uses internally to deal with constrained distributions. +Certain samplers, such as `HMC`, require the domain of the priors to be unbounded. Therefore, to use our `CustomUniform` as a prior in a model we also need to define how to transform samples from [0, 1] to ℝ. To do this, we need to define the corresponding `Bijector` from `Bijectors.jl`, which is what `Turing.jl` uses internally to deal with constrained distributions. -To transform from `[0, 1]` to `ℝ` we can use the `Logit` bijector: +To transform from [0, 1] to ℝ we can use the `Logit` bijector: ```{julia} Bijectors.bijector(d::CustomUniform) = Logit(0.0, 1.0) ``` -In the present example, `CustomUniform` is a subtype of `ContinuousUnivariateDistribution`. -The procedure for subtypes of `ContinuousMultivariateDistribution` and `ContinuousMatrixDistribution` is exactly the same. -For example, `Wishart` defines a distribution over positive-definite matrices and so `bijector` returns a `PDBijector` when called with a `Wishart` distribution as an argument. -For discrete distributions, there is no need to define a bijector; the `Identity` bijector is used by default. +In the present example, `CustomUniform` is a subtype of `ContinuousUnivariateDistribution`. The procedure for subtypes of `ContinuousMultivariateDistribution` and `ContinuousMatrixDistribution` is exactly the same. For example, `Wishart` defines a distribution over positive-definite matrices and so the `bijector` returns a `PDBijector` when called with a `Wishart` distribution as an argument. For discrete distributions, there is no need to define a bijector; the `Identity` bijector is used by default. As an alternative to the above, for `UnivariateDistribution` we could define the `minimum` and `maximum` of the distribution: @@ -71,7 +65,7 @@ Distributions.minimum(d::CustomUniform) = 0.0 Distributions.maximum(d::CustomUniform) = 1.0 ``` -and `Bijectors.jl` will return a default `Bijector` called `TruncatedBijector` which makes use of `minimum` and `maximum` derive the correct transformation. +and `Bijectors.jl` will return a default `Bijector` called `TruncatedBijector` which makes use of `minimum` and `maximum` to derive the correct transformation. Internally, Turing basically does the following when it needs to convert a constrained distribution to an unconstrained distribution, e.g. when sampling using `HMC`: @@ -82,8 +76,7 @@ transformed_dist = transformed(dist, b) # results in distribution with transform ``` and then we can call `rand` and `logpdf` as usual, where - - `rand(transformed_dist)` returns a sample in the unconstrained space, and - `logpdf(transformed_dist, y)` returns the log density of the original distribution, but with `y` living in the unconstrained space. -To read more about Bijectors.jl, check out [its documentation](https://turinglang.org/Bijectors.jl/stable/). +To read more about Bijectors.jl, check out [its documentation](https://turinglang.org/Bijectors.jl/stable/). \ No newline at end of file diff --git a/usage/external-samplers/index.qmd b/usage/external-samplers/index.qmd index 3e278b40e..af742a0cc 100755 --- a/usage/external-samplers/index.qmd +++ b/usage/external-samplers/index.qmd @@ -16,7 +16,7 @@ Pkg.instantiate(); `Turing` provides several wrapped samplers from external sampling libraries, e.g., HMC samplers from `AdvancedHMC`. These wrappers allow new users to seamlessly sample statistical models without leaving `Turing` -However, these wrappers might only sometimes be complete, missing some functionality from the wrapped sampling library. +However, these wrappers might not always be complete, missing some functionality from the wrapped sampling library. Moreover, users might want to use samplers currently not wrapped within `Turing`. For these reasons, `Turing` also makes running external samplers on Turing models easy without any necessary modifications or wrapping! @@ -44,7 +44,7 @@ model = funnel() | (; x); Users can use any sampler algorithm to sample this model if it follows the `AbstractMCMC` API. Before discussing how this is done in practice, giving a high-level description of the process is interesting. Imagine that we created an instance of an external sampler that we will call `spl` such that `typeof(spl)<:AbstractMCMC.AbstractSampler`. -In order to avoid type ambiguity within Turing, at the moment it is necessary to declare `spl` as an external sampler to Turing `espl = externalsampler(spl)`, where `externalsampler(s::AbstractMCMC.AbstractSampler)` is a Turing function that types our external sampler adequately. +In order to avoid type ambiguity within Turing, at the moment, it is necessary to declare `spl` as an external sampler to Turing `espl = externalsampler(spl)`, where `externalsampler(s::AbstractMCMC.AbstractSampler)` is a Turing function that types our external sampler adequately. An excellent point to start to show how this is done in practice is by looking at the sampling library `AdvancedMH` ([`AdvancedMH`'s GitHub](https://github.com/TuringLang/AdvancedMH.jl)) for Metropolis-Hastings (MH) methods. Let's say we want to use a random walk Metropolis-Hastings sampler without specifying the proposal distributions. @@ -76,7 +76,7 @@ However, the native HMC sampler within Turing only allows the user to specify th Thankfully, we can use Turing's support for external samplers to define an HMC sampler with a custom mass matrix in `AdvancedHMC` and then use it to sample our Turing model. We can use the library `Pathfinder`[^2] ([`Pathfinder`'s GitHub](https://github.com/mlcolab/Pathfinder.jl)) to construct our estimate of mass matrix. -`Pathfinder` is a variational inference algorithm that first finds the maximum a posteriori (MAP) estimate of a target posterior distribution and then uses the trace of the optimization to construct a sequence of multivariate normal approximations to the target distribution. +`Pathfinder` is a variational inference algorithm that first finds the maximum a posteriori (MAP) estimate of a target posterior distribution and then uses the trace of the optimisation to construct a sequence of multivariate normal approximations to the target distribution. In this process, `Pathfinder` computes an estimate of the mass matrix the user can access. You can see an example of how to use `Pathfinder` with Turing in [`Pathfinder`'s docs](https://mlcolab.github.io/Pathfinder.jl/stable/examples/turing/). @@ -124,7 +124,7 @@ Thus, in your external sampler, you can access the inner object with `model.logd As shown above, there must be two `step` methods: - A method that performs the first step, performing any initialisation it needs to; and - - A method that performs the following steps and takes an extra input, `state`, which carries the initialization information. + - A method that performs the following steps and takes an extra input, `state`, which carries the initialisation information. The output of both of these methods must be a tuple containing: - a 'transition', which is essentially the 'visible output' of the sampler: this object is later used to construct an `MCMCChains.Chains`; @@ -146,4 +146,4 @@ In general, we recommend that the `AbstractMCMC` interface is implemented direct However, any DynamicPPL- or Turing-specific functionality is best implemented in a `MySamplerTuringExt` extension. [^1]: Xu et al., [AdvancedHMC.jl: A robust, modular and efficient implementation of advanced HMC algorithms](http://proceedings.mlr.press/v118/xu20a/xu20a.pdf), 2019 -[^2]: Zhang et al., [Pathfinder: Parallel quasi-Newton variational inference](https://arxiv.org/abs/2108.03782), 2021 +[^2]: Zhang et al., [Pathfinder: Parallel quasi-Newton variational inference](https://arxiv.org/abs/2108.03782), 2021 \ No newline at end of file diff --git a/usage/mode-estimation/index.qmd b/usage/mode-estimation/index.qmd index a592a83dd..f888807a9 100755 --- a/usage/mode-estimation/index.qmd +++ b/usage/mode-estimation/index.qmd @@ -12,7 +12,7 @@ using Pkg; Pkg.instantiate(); ``` -After defining a statistical model, in addition to sampling from its distributions, one may be interested in finding the parameter values that maximise for instance the posterior distribution density function or the likelihood. This is called mode estimation. Turing provides support for two mode estimation techniques, [maximum likelihood estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) (MLE) and [maximum a posterior](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation) (MAP) estimation. +After defining a statistical model, in addition to sampling from its distributions, one may be interested in finding the parameter values that maximise for instance the posterior distribution density function or the likelihood. This is called mode estimation. Turing provides support for two mode estimation techniques, [maximum likelihood estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) (MLE) and [maximum a posteriori](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation) (MAP) estimation. To demonstrate mode estimation, let us load Turing and declare a model: @@ -37,7 +37,7 @@ data = [1.5, 2.0] model = gdemo(data) ``` -Finding the maximum aposteriori or maximum likelihood parameters is as simple as +Finding the maximum a posteriori or maximum likelihood parameters is as simple as ```{julia} # Generate a MLE estimate. @@ -56,7 +56,7 @@ The estimates are returned as instances of the `ModeResult` type. It has the fie ## Controlling the optimisation process -Under the hood `maximum_likelihood` and `maximum_a_posteriori` use the [Optimization.jl](https://github.com/SciML/Optimization.jl) package, which provides a unified interface to many other optimisation packages. By default Turing typically uses the [LBFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) method from [Optim.jl](https://github.com/JuliaNLSolvers/Optim.jl) to find the mode estimate, but we can easily change that: +Under the hood `maximum_likelihood` and `maximum_a_posteriori` use the [Optimisation.jl](https://github.com/SciML/Optimisation.jl) package, which provides a unified interface to many other optimisation packages. By default Turing typically uses the [LBFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) method from [Optim.jl](https://github.com/JuliaNLSolvers/Optim.jl) to find the mode estimate, but we can easily change that: ```{julia} using OptimizationOptimJL: NelderMead @@ -66,7 +66,7 @@ using OptimizationNLopt: NLopt.LD_TNEWTON_PRECOND_RESTART @show maximum_likelihood(model, LD_TNEWTON_PRECOND_RESTART()); ``` -The above are just two examples, Optimization.jl supports [many more](https://docs.sciml.ai/Optimization/stable/). +The above are just two examples, Optimisation.jl supports [many more](https://docs.sciml.ai/Optimisation/stable/). We can also help the optimisation by giving it a starting point we know is close to the final solution, or by specifying an automatic differentiation method @@ -80,13 +80,13 @@ maximum_likelihood( When providing values to arguments like `initial_params` the parameters are typically specified in the order in which they appear in the code of the model, so in this case first `s²` then `m`. More precisely it's the order returned by `Turing.Inference.getparams(model, DynamicPPL.VarInfo(model))`. -We can also do constrained optimisation, by providing either intervals within which the parameters must stay, or costraint functions that they need to respect. For instance, here's how one can find the MLE with the constraint that the variance must be less than 0.01 and the mean must be between -1 and 1.: +We can also do constrained optimisation, by providing either intervals within which the parameters must stay, or constraint functions that they need to respect. For instance, here's how one can find the MLE with the constraint that the variance must be less than 0.01 and the mean must be between -1 and 1.: ```{julia} maximum_likelihood(model; lb=[0.0, -1.0], ub=[0.01, 1.0]) ``` -The arguments for lower (`lb`) and upper (`ub`) bounds follow the arguments of `Optimization.OptimizationProblem`, as do other parameters for providing [constraints](https://docs.sciml.ai/Optimization/stable/tutorials/constraints/), such as `cons`. Any extraneous keyword arguments given to `maximum_likelihood` or `maximum_a_posteriori` are passed to `Optimization.solve`. Some often useful ones are `maxiters` for controlling the maximum number of iterations and `abstol` and `reltol` for the absolute and relative convergence tolerances: +The arguments for lower (`lb`) and upper (`ub`) bounds follow the arguments of `Optimisation.OptimizationProblem`, as do other parameters for providing [constraints](https://docs.sciml.ai/Optimisation/stable/tutorials/constraints/), such as `cons`. Any extraneous keyword arguments given to `maximum_likelihood` or `maximum_a_posteriori` are passed to `Optimisation.solve`. Some often useful ones are `maxiters` for controlling the maximum number of iterations and `abstol` and `reltol` for the absolute and relative convergence tolerances: ```{julia} badly_converged_mle = maximum_likelihood( @@ -100,11 +100,11 @@ We can check whether the optimisation converged using the `optim_result` field o @show badly_converged_mle.optim_result; ``` -For more details, such as a full list of possible arguments, we encourage the reader to read the docstring of the function `Turing.Optimisation.estimate_mode`, which is what `maximum_likelihood` and `maximum_a_posteriori` call, and the documentation of [Optimization.jl](https://docs.sciml.ai/Optimization/stable/). +For more details, such as a full list of possible arguments, we encourage the reader to read the docstring of the function `Turing.Optimisation.estimate_mode`, which is what `maximum_likelihood` and `maximum_a_posteriori` call, and the documentation of [Optimisation.jl](https://docs.sciml.ai/Optimisation/stable/). ## Analyzing your mode estimate -Turing extends several methods from `StatsBase` that can be used to analyze your mode estimation results. Methods implemented include `vcov`, `informationmatrix`, `coeftable`, `params`, and `coef`, among others. +Turing extends several methods from `StatsBase` that can be used to analyse your mode estimation results. Methods implemented include `vcov`, `informationmatrix`, `coeftable`, `params`, and `coef`, among others. For example, let's examine our ML estimate from above using `coeftable`: @@ -123,4 +123,4 @@ You can begin sampling your chain from an MLE/MAP estimate by extracting the vec #| eval: false map_estimate = maximum_a_posteriori(model) chain = sample(model, NUTS(), 1_000; initial_params=map_estimate.values.array) -``` +``` \ No newline at end of file diff --git a/usage/modifying-logprob/index.qmd b/usage/modifying-logprob/index.qmd index 892ac7ba8..5f50cc565 100755 --- a/usage/modifying-logprob/index.qmd +++ b/usage/modifying-logprob/index.qmd @@ -12,10 +12,7 @@ using Pkg; Pkg.instantiate(); ``` -Turing accumulates log probabilities internally in an internal data structure that is accessible through the internal variable `__varinfo__` inside of the model definition. -To avoid users having to deal with internal data structures, Turing provides the `@addlogprob!` macro which increases the accumulated log probability. -For instance, this allows you to -[include arbitrary terms in the likelihood](https://github.com/TuringLang/Turing.jl/issues/1332) +Turing accumulates log probabilities internally in an internal data structure that is accessible through the internal variable `__varinfo__` within the model definition. To avoid users having to deal with internal data structures, Turing provides the `@addlogprob!` macro which increases the accumulated log probability. For instance, this allows you to [include arbitrary terms in the likelihood](https://github.com/TuringLang/Turing.jl/issues/1332) ```{julia} using Turing @@ -47,10 +44,9 @@ using LinearAlgebra end ``` -Note that `@addlogprob! (p::Float64)` adds `p` to the log likelihood. -If instead you want to add to the log prior, you can use +Note that `@addlogprob! (p::Float64)` adds `p` to the log likelihood. If instead you want to add to the log prior, you can use ```{julia} #| eval: false @addlogprob! (; logprior=value_goes_here) -``` +``` \ No newline at end of file diff --git a/usage/performance-tips/index.qmd b/usage/performance-tips/index.qmd index 2445b58b5..f08d731b5 100755 --- a/usage/performance-tips/index.qmd +++ b/usage/performance-tips/index.qmd @@ -31,7 +31,7 @@ using Turing end ``` -can be directly expressed more efficiently using a simple transformation: +can be directly expressed more efficiently with a simple transformation: ```{julia} using FillArrays @@ -138,4 +138,4 @@ In this particular model, the following call to `getindex` should be highlighted │ %120 = p::AbstractVector │ %121 = Base.getindex(%120, 1)::Any [...] -``` +``` \ No newline at end of file diff --git a/usage/probability-interface/index.qmd b/usage/probability-interface/index.qmd index 7ec1b0bab..3a1703d29 100644 --- a/usage/probability-interface/index.qmd +++ b/usage/probability-interface/index.qmd @@ -37,7 +37,7 @@ dataset[1:5] ## Conditioning and Deconditioning -Bayesian models can be transformed with two main operations, conditioning and deconditioning (also known as marginalization). +Bayesian models can be transformed with two main operations, conditioning and deconditioning (also known as marginalisation). Conditioning takes a variable and fixes its value as known. We do this by passing a model and a collection of conditioned variables to `|`, or its alias, `condition`: @@ -78,7 +78,7 @@ For illustrative purposes, however, we do not use this function in the examples ## Probabilities and Densities -We often want to calculate the (unnormalized) probability density for an event. +We often want to calculate the (unnormalised) probability density for an event. This probability might be a prior, a likelihood, or a posterior (joint) density. DynamicPPL provides convenient functions for this. To begin, let's define a model `gdemo`, condition it on a dataset, and draw a sample. @@ -179,4 +179,4 @@ end cross_val(dataset) ``` -[^1]: See [ParetoSmooth.jl](https://github.com/TuringLang/ParetoSmooth.jl) for a faster and more accurate implementation of cross-validation than the one provided here. +[^1]: See [ParetoSmooth.jl](https://github.com/TuringLang/ParetoSmooth.jl) for a faster and more accurate implementation of cross-validation than the one provided here. \ No newline at end of file diff --git a/usage/sampler-visualisation/index.qmd b/usage/sampler-visualisation/index.qmd index 8f6cda974..958bf1f74 100755 --- a/usage/sampler-visualisation/index.qmd +++ b/usage/sampler-visualisation/index.qmd @@ -114,7 +114,7 @@ setprogress!(false) ### Gibbs -Gibbs sampling tends to exhibit a "jittery" trajectory. The example below combines `HMC` and `PG` sampling to traverse the posterior. +Gibbs sampling tends to exhibit a "jittery" trajectory. The example below combines HMC and PG sampling to traverse the posterior. ```{julia} c = sample(model, Gibbs(:s² => HMC(0.01, 5), :m => PG(20)), 1000) @@ -123,7 +123,7 @@ plot_sampler(c) ### HMC -Hamiltonian Monte Carlo (HMC) sampling is a typical sampler to use, as it tends to be fairly good at converging in a efficient manner. It can often be tricky to set the correct parameters for this sampler however, and the `NUTS` sampler is often easier to run if you don't want to spend too much time fiddling with step size and and the number of steps to take. Note however that `HMC` does not explore the positive values μ very well, likely due to the leapfrog and step size parameter settings. +Hamiltonian Monte Carlo (HMC) sampling is a typical sampler to use, as it tends to be fairly good at converging in an efficient manner. It can often be tricky to set the correct parameters for this sampler however, and the NUTS sampler is often easier to run if you don't want to spend too much time fiddling with step size and the number of steps to take. Note however that HMC does not explore the positive values μ very well, likely due to the leapfrog and step size parameter settings. ```{julia} c = sample(model, HMC(0.01, 10), 1000) @@ -196,4 +196,4 @@ Next, we plot using 50 particles. ```{julia} c = sample(model, PG(50), 1000) plot_sampler(c) -``` +``` \ No newline at end of file diff --git a/usage/tracking-extra-quantities/index.qmd b/usage/tracking-extra-quantities/index.qmd index 40017716d..8ce77151c 100755 --- a/usage/tracking-extra-quantities/index.qmd +++ b/usage/tracking-extra-quantities/index.qmd @@ -15,7 +15,7 @@ Pkg.instantiate(); Often, there are quantities in models that we might be interested in viewing the values of, but which are not random variables in the model that are explicitly drawn from a distribution. -As a motivating example, the most natural parameterization for a model might not be the most computationally feasible. +As a motivating example, the most natural parameterisation for a model might not be the most computationally feasible. Consider the following (efficiently reparametrized) implementation of Neal's funnel [(Neal, 2003)](https://arxiv.org/abs/physics/0009028): ```{julia} diff --git a/usage/troubleshooting/index.qmd b/usage/troubleshooting/index.qmd index 380378cd5..ff988c180 100755 --- a/usage/troubleshooting/index.qmd +++ b/usage/troubleshooting/index.qmd @@ -126,7 +126,7 @@ When the log probability density of the model is calculated, `a` is sampled from However, `x` is _always_ a `Vector{Float64}`, and the call `x[1] = a` attempts to insert a `Dual` object into a `Vector{Float64}`, which is not allowed. ::: {.callout-note} -In more depth: the basic premise of ForwardDiff is that functions have to accept `Real` parameters instead of `Float64` (since `Dual` is a subtype of `Real`). +In more detail: the basic premise of ForwardDiff is that functions have to accept `Real` parameters instead of `Float64` (since `Dual` is a subtype of `Real`). Here, the line `x[1] = a` is equivalent to `setindex!(x, a, 1)`, and although the method `setindex!(::Vector{Float64}, ::Real, ...)` does exist, it attempts to convert the `Real` into a `Float64`, which is where it fails. ::: @@ -158,4 +158,4 @@ A better solution is to pass a type as a parameter to the model: b ~ MvNormal(x, I) end sample(forwarddiff_working2(), NUTS(; adtype=AutoForwardDiff()), 10) -``` +``` \ No newline at end of file