Doc adjustments (#169)

jeremiahpslewis · cscherrer · mschauer · web-flow · commit 7102906a5336 · 2021-11-09T14:06:23.000-08:00
* Tweaks

* Further tweaks

* Update docs/src/affine.md

Co-authored-by: Chad Scherrer &lt;chad.scherrer@gmail.com&gt;

* Update docs/src/affine.md

* Update docs/src/affine.md

* Update docs/src/affine.md

* Update docs/src/affine.md

Co-authored-by: Chad Scherrer &lt;chad.scherrer@gmail.com&gt;

* Update docs/src/affine.md

* Update docs/src/affine.md

Co-authored-by: Chad Scherrer &lt;chad.scherrer@gmail.com&gt;

* Update docs/src/affine.md

Co-authored-by: Moritz Schauer &lt;moritzschauer@web.de&gt;

* Update docs/src/affine.md

Co-authored-by: Chad Scherrer &lt;chad.scherrer@gmail.com&gt;
Co-authored-by: Moritz Schauer &lt;moritzschauer@web.de&gt;
Co-authored-by: Guillaume Dalle &lt;22795598+gdalle@users.noreply.github.com&gt;
diff --git a/docs/src/affine.md b/docs/src/affine.md
@@ -1,27 +1,31 @@
 # Affine Transformations
 
-It's very common for measures to be parameterized by `μ` and `σ`, for example as in `Normal(μ=3, σ=4)` or `StudentT(ν=1, μ=3, σ=4)`. In this context, `μ` and `σ` do not always refer to the mean and standard deviation (the `StudentT` above is equivalent to a Cauchy, so both are undefined).
+It's very common for measures to use parameters `μ` and `σ`, for example as in `Normal(μ=3, σ=4)` or `StudentT(ν=1, μ=3, σ=4)`. In this context, `μ` and `σ` need not always refer to the mean and standard deviation (the `StudentT` measure specified above is equivalent to a [Cauchy](https://en.wikipedia.org/wiki/Cauchy_distribution) measure, so both mean and standard deviation are undefined).
 
-Rather, `μ` is a "location parameter", and `σ` is a "scale parameter". Together these determine an affine transformation
+In general, `μ` is a "location parameter", and `σ` is a "scale parameter". Together these parameters determine an affine transformation.
 
 ```math
 f(z) = σ z + μ
 ```
 
-Here are below, we'll use ``z`` to represent an "un-transformed" variable, typically coming from a measure like `Normal()` with no location or scale parameters.
+Starting with the above definition, we'll use ``z`` to represent an "un-transformed" variable, typically coming from a measure which has neither a location nor a scale parameter, for example `Normal()`.
 
-Affine transforms are often incorrectly referred to as "linear". Linearity requires ``f(ax + by) = a f(x) + b f(y)`` for scalars ``a`` and ``b``, which only holds for the above ``f`` if ``μ=0``.
+Affine transformations are often ambiguously referred as "linear transformations". In fact, an affine transformation is ["the composition of two functions: a translation and a linear map"](https://en.wikipedia.org/wiki/Affine_transformation#Representation) in the stricter algebraic sense: For a function `f` to be linear requires 
+``f(ax + by) == a f(x) + b f(y)``
+for scalars ``a`` and ``b``. For an affine function
+``f(z) = σ * z + μ``, where the linear map is defined as ``σ`` and the translation defined as ``μ``,
+linearity holds only if the translation component ``μ`` is equal to zero.
 
 
 ## Cholesky-based parameterizations
 
-If the "un-transformed" `z` is a scalar, things are relatively simple. But it's important our approach handle the multivariate case as well.
+If the "un-transformed" `z` is univariate, things are relatively simple. But it's important our approach handle the multivariate case as well.
 
-In the literature, it's common for a multivariate normal distribution to be parameterized by a mean `μ` and covariance matrix `Σ`. This is mathematically convenient, but can be very awkward from a computational perspective.
+In the literature, it's common for a multivariate normal distribution to be parameterized by a mean `μ` and covariance matrix `Σ`. This is mathematically convenient, but leads to an ``O(n^3)`` [Cholesky decomposition](https://en.wikipedia.org/wiki/Cholesky_decomposition), which becomes a significant bottleneck to compute as ``n`` gets large.
 
 While MeasureTheory.jl includes (or will include) a parameterization using `Σ`, we prefer to work in terms of its Cholesky decomposition ``σ``.
 
-Using "``σ``" for this may seem strange at first, so we should explain the notation. Let ``σ`` be a lower-triangular matrix satisfying
+To see the relationship between our ``σ`` parameterization and the likely more familiar  ``Σ`` parameterization,  let ``σ`` be a lower-triangular matrix satisfying
 
 ```math
 σ σᵗ = Σ
@@ -33,23 +37,23 @@ Then given a (multivariate) standard normal ``z``, the covariance matrix of ``σ
 𝕍[σ z + μ] = Σ
 ```
 
-Comparing to the one dimensional case where
+The one-dimensional case where we have
 
 ```math
 𝕍[σ z + μ] = σ²
 ```
 
-shows that the lower Cholesky factor of the covariance generalizes the concept of standard deviation, justifying the notation.
+shows that the lower Cholesky factor of the covariance generalizes the concept of standard deviation, completing the link between ``σ`` and `Σ`.
 
 ## The "Cholesky precision" parameterization
 
-The ``(μ,σ)`` parameterization is especially convenient for random sampling. Any `z ~ Normal()` determines an `x ~ Normal(μ,σ)` through
+The ``(μ,σ)`` parameterization is especially convenient for random sampling. Any measure `z ~ Normal()` determines an `x ~ Normal(μ,σ)` through the affine transformation
 
 ```math
 x = σ z + μ
 ```
 
-On the other hand, the log-density computation is not quite so simple. Starting with an ``x``, we need to find ``z`` using
+The log-density transformation of a `Normal` with parameters μ, σ does not follow as directly. Starting with an ``x``, we need to find ``z`` using
 
 ```math
 z = σ⁻¹ (x - μ)
@@ -63,19 +67,19 @@ logdensity(d::Normal{(:μ,:σ)}, x) = logdensity(d.σ \ (x - d.μ)) - logdet(d.
 
 Here the `- logdet(σ)` is the "log absolute Jacobian", required to account for the stretching of the space.
 
-The above requires solving a linear system, which adds some overhead. Even with the convenience of a lower triangular system, it's still not quite a efficient as a multiplication.
+The above requires solving a linear system, which adds some overhead. Even with the convenience of a lower triangular system, it's still not quite as efficient as multiplication.
 
-In addition to the covariance ``Σ``, it's also common to parameterize a multivariate normal by its _precision matrix_, ``Ω = Σ⁻¹``. Similarly to our use of ``σ``, we'll use ``ω`` for the lower Cholesky factor of ``Ω``.
+In addition to the covariance ``Σ``, it's also common to parameterize a multivariate normal by its _precision matrix_, defined as the inverse of the covariance matrix, ``Ω = Σ⁻¹``. Similar to our use of ``σ`` for the lower Cholesky factor of `Σ`, we'll use ``ω`` for the lower Cholesky factor of ``Ω``.
 
-This allows a more efficient log-density,
+This parameterization enables more efficient calculation of the log-density using only multiplication and addition,
 
 ```julia
 logdensity(d::Normal{(:μ,:ω)}, x) = logdensity(d.ω * (x - d.μ)) + logdet(d.ω)
 ```
 
 ## `AffineTransform`
 
-Transforms like ``z → σ z + μ`` and ``z → ω \ z + μ`` can be represented using an `AffineTransform`. For example,
+Transforms like ``z → σ z + μ`` and ``z → ω \ z + μ`` can be specified in MeasureTheory.jl using an `AffineTransform`. For example,
 
 ```julia
 julia> f = AffineTransform((μ=3.,σ=2.))
@@ -85,9 +89,9 @@ julia> f(1.0)
 5.0
 ```
 
-In the scalar case this is relatively simple to invert. But if `σ` is a matrix, this would require matrix inversion. Adding to this complication is that lower triangular matrices are not closed under matrix inversion. 
+In the univariate case this is relatively simple to invert. But if `σ` is a matrix, matrix inversion becomes necessary. This is not always possible as lower triangular matrices are not closed under matrix inversion and as such are not guaranteed to exist. 
 
-Our multiple parameterizations make it convenient to deal with these issues. The inverse transform of a ``(μ,σ)`` transform will be in terms of ``(μ,ω)``, and vice-versa. So
+With multiple parameterizations of a given family of measures, we can work around these issues. The inverse transform of a ``(μ,σ)`` transform will be in terms of ``(μ,ω)``, and vice-versa. So
 
 ```julia
 julia> f⁻¹ = inv(f)