You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Chapters/Distributions.qmd
+79-1Lines changed: 79 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -136,6 +136,84 @@ pz.Normal(30, 4).plot_ppf();
136
136
```
137
137
138
138
139
+
## Point estimates
140
+
141
+
Sometimes, rather than visualizing distributions, we summarize them using a few numbers. This is useful when dealing with many distributions, when only a brief summary is needed, or when the distributions are high-dimensional and hard to plot.
142
+
143
+
To summarize a distribution we usually use some measure of central tendency and some measure of dispersion. Common measures of central tendency include the `mean`, `median`, and `mode`. @fig-point-estimates shows these three measures for a `Gamma` distribution.
144
+
145
+
```{python}
146
+
#| code-fold: true
147
+
#| label: point-estimates
148
+
#| fig-cap: "A gamma distribution with its mean, median, and mode."
The mean is the average value of the distribution, the median is the value that divides the distribution into two equal halves, and the mode is the value with the highest probability density. In symmetric distributions like the `Normal`, these three measures are equal. But in skewed distributions like the `Gamma`, they can be quite different. The mean is probably the most commonly used measure of central tendency, but it can be sensitive to outliers. The median is more robust to outliers and is often preferred in skewed distributions. The mode is useful when we are interested in the most probable value of the distribution, the mode is usually less common than the other too. In practice, the choice of which measure to use depends on the context and the specific characteristics of the distribution being analyzed.
162
+
163
+
In ArviZ we can compute these point estimates from samples using the `azp.mean()`, `azp.median()`, and `azp.mode()` functions.
Point estimates also show in other places in ArviZ, for example in the `plot_dist()` or when calling `summary()`, these functions has a `point_estimate` argument to choose which point estimate to use. The default value is controlled globally by:
171
+
172
+
```{python}
173
+
azp.rcParams["stats.point_estimate"]
174
+
```
175
+
176
+
## Credible intervals
177
+
178
+
To describe the uncertainty in our estimates, we usually want to complement point estimates with some measure of dispersion. A common approach is to use the standard deviation or the variance. The standard deviation is usually preferred over the variance because it is in the same units as the original data. However, both measures can be misleading for skewed distributions or distributions with heavy tails. Hence, other measures of dispersion are often preferred like the median absolute deviation [MAD](https://en.wikipedia.org/wiki/Median_absolute_deviation) or the interquartile range [IQR](https://en.wikipedia.org/wiki/Interquartile_range).
179
+
180
+
One issue with using a single number to summarize uncertainty is that it does not provide information about the shape of the distribution. For example two distributions can have the same standard deviation but very different shapes. Or for bounded distributions like the `Gamma`, the standard deviation can be misleading as it does not take into account that negative values are not allowed.
181
+
182
+
One popular way to summarize the uncertainty in a distribution is to use intervals. For example, we may want to report that 90% of the values lie within a certain range. In Bayesian statistics, these often called `credible intervals`. In principle we can defined infinite intervals containing a given mass. Then we need to add some other constraint to build useful intervals. Two common types of credible intervals are the `equal-tailed interval` (ETI) and the `highest-density interval` (HDI).
183
+
184
+
* ETI: The interval that contains a given percentage of the distribution, with equal probability in both tails. For example, a 90% equal-tailed interval
185
+
has 90% of the distribution between the lower and upper bounds, with 5% of the distribution in each tail.
186
+
187
+
* HDI: The interval that contains a given mass and where all points inside the interval have a higher density than any point outside the interval. Alternatively, we can think of it as the shortest interval containing a given portion of the probability density.
188
+
189
+
For some distributions like asymmetric ones, the HDI is usually preferred over the ETI because it better represents the most credible values of the distribution.
190
+
191
+
```{python}
192
+
#| code-fold: true
193
+
#| label: credible-intervals
194
+
#| fig-cap: "HDI (top) and ETI (bottom) credible intervals for a Gamma distribution. The thin line of the interval shows the 89% interval and the thick line shows the 50% interval."
In ArviZ we can compute credible intervals from samples using the `azp.hdi()` and `azp.eti()` functions.
202
+
203
+
```{python}
204
+
azp.hdi(pois_data), azp.eti(pois_data)
205
+
```
206
+
207
+
These, as well as other functions, have a `ci_prob` argument to choose the mass of the credible interval. The default value is globally controlled by:
208
+
209
+
```{python}
210
+
azp.rcParams["stats.ci_prob"]
211
+
```
212
+
213
+
Common choices for the mass of credible intervals are 95%, 90%, and 50%. ArviZ default is just a friendly reminder that no value is inherently better than the others; the key is to be consistent and transparent about your choice [@mcelreath_2020]. One practical advantage of using 89% as a default is that it requires fewer samples to estimate reliably than, for example, 95% intervals. In @sec-mcmc-diagnostics we discuss how the reliability of estimates like credible intervals depends on the number of samples used.
214
+
215
+
Credible intervals also show in other places in ArviZ, for example in the `plot_dist()` or when calling `summary()`.
216
+
139
217
## Distributions in ArviZ {#sec-dist}
140
218
141
219
The PMF/PDF, CDF, and PPF are convenient ways to represent distributions for which we know the analytical form. But in practice, we often work with distributions that we don't know their analytical form. Instead, we have a set of samples from the distribution. A clear example is a posterior distribution, computed using an MCMC method. For those cases, we still want useful visualization that we can use for ourselves or to show others. Some common methods are:
@@ -256,4 +334,4 @@ azp.plot_dist(
256
334
);
257
335
```
258
336
259
-
The number of quantiles is something you will need to choose by yourself, usually, it is a good idea to keep this number relatively small and "round", as the main feature of a quantile dot plot is that finding probability intervals reduces to counting dots [@kay_2016; @fernandes_2018]. It is easier to count and compute proportions if you have 10, or 20 dots than if you have 11 or 57. But sometimes a larger number could be a good idea too. When we are interested in the tails of a distribution, using more quantiles can help. A choice like 100 is often a good default because each dot represents exactly 1% of the distribution, ensuring small probabilities can still be estimated accurately. 100 is the default in ArviZ.
337
+
The number of quantiles is something you will need to choose by yourself, usually, it is a good idea to keep this number relatively small and "round", as the main feature of a quantile dot plot is that finding probability intervals reduces to counting dots [@kay_2016; @fernandes_2018]. It is easier to count and compute proportions if you have 10, or 20 dots than if you have 11 or 57. But sometimes a larger number could be a good idea too. When we are interested in the tails of a distribution, using more quantiles can help. A choice like 100 is often a good default because each dot represents exactly 1% of the distribution, ensuring small probabilities can still be estimated accurately. 100 is the default in ArviZ.
0 commit comments