Skip to content

Commit fbc76b7

Browse files
drbenvincentBenjamin T. VincentOriolAbril
authored
Bayesian copula estimation example notebook (#257)
* create truncated regression example * delete truncated regression example from main branch * initial commit of copula notebook * obeying the law * update notebook label to match notebook filename * replace HTML for figure 1 + simpler logos * replace direct URL link to other notebook * use rng * re-run to show graphviz images * try out computation with full posterior * clean up * remove extra comma in ":tags:" * fix typo: tranform_data() -> transform_data() * fix final subtitle * ran pre-commit checks * make it run under v5 * Add Eric's suggested text * move from Case Study section to How To section * try to fix image showing as missing in the rendered docs * duplicate gates logo for the existing binning notebook * SEED -> rng * update notebook anchors --------- Co-authored-by: Benjamin T. Vincent <[email protected]> Co-authored-by: Benjamin T. Vincent <[email protected]> Co-authored-by: Oriol (ZBook) <[email protected]>
1 parent 5eba5cd commit fbc76b7

File tree

9 files changed

+1352
-20
lines changed

9 files changed

+1352
-20
lines changed

examples/case_studies/binning.ipynb

Lines changed: 20 additions & 19 deletions
Large diffs are not rendered by default.

examples/case_studies/binning.myst.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ kernelspec:
1111
---
1212

1313
(awkward_binning)=
14+
(binning)=
1415
# Estimating parameters of a distribution from awkwardly binned data
1516
:::{post} Oct 23, 2021
1617
:tags: binned data, case study, parameter estimation
@@ -27,7 +28,7 @@ Very often this data can be in a form that we, as data scientists, would not con
2728

2829
So we are faced with a problem: we have datasets with counts of our measure of interest (whether that be age, income, BMI, or whatever), but they are binned, and they have been binned _differently_. This notebook presents a solution to this problem that [PyMC Labs](https://www.pymc-labs.io) worked on, supported by the [Gates Foundation](https://www.gatesfoundation.org/). We _can_ make inferences about the parameters of a population level distribution.
2930

30-
![](gates.png)
31+
![](gates_labs_logos.png)
3132

3233
+++
3334

examples/case_studies/gates.png

-37.3 KB
Binary file not shown.
385 KB
Loading

examples/conf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@
4343
"page_footer.md",
4444
"**/*.myst.md",
4545
]
46+
numfig = True
4647

4748

4849
def remove_index(app):

examples/howto/copula-estimation.ipynb

Lines changed: 1034 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 295 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,295 @@
1+
---
2+
jupytext:
3+
text_representation:
4+
extension: .md
5+
format_name: myst
6+
format_version: 0.13
7+
kernelspec:
8+
display_name: Python 3 (ipykernel)
9+
language: python
10+
name: python3
11+
---
12+
13+
(copula-estimation)=
14+
# Bayesian copula estimation: Describing correlated joint distributions
15+
16+
:::{post} December 2023
17+
:tags: copula, parameter estimation
18+
:category: intermediate
19+
:author: Eric Ma, Benjamin T. Vincent
20+
:::
21+
22+
## The problem
23+
When we deal with multiple variables (e.g. $a$ and $b$) we often want to describe the joint distribution $P(a, b)$ parametrically. If we are lucky, then this joint distribution might be 'simple' in some way. For example, it could be that $a$ and $b$ are statistically independent, in which case we can break down the joint distribution into $P(a, b) = P(a) P(b)$ and so we just need to find appropriate parametric descriptions for $P(a)$ and $P(b)$. Even if this is not appropriate, it may be that $P(a, b)$ could be described well by a simple multivariate distribution, such as a multivariate normal distribution for example.
24+
25+
However, very often when we deal with real datasets, there is complex correlational structure in $P(a, b)$ meaning that these two previous approaches are not available to us. So alternative methods are required.
26+
27+
## Copulas to the rescue
28+
This is where [copulas](https://en.wikipedia.org/wiki/Copula_(probability_theory)) come in. These allow you do describe a complex distribution $P(a, b)$ with correlational structure by a simple multivariate distribution (such as a Multivariate Gaussian), two marginal distributions, and some transformations. For a very accessible introduction to copulas, we recommend reading through [this](https://twiecki.io/blog/2018/05/03/copulas/) blog post by Thomas Wiecki.
29+
30+
This notebook covers how we can describe a distribution $P(a, b)$ with correlational structure using Bayesian methods to infer the parameters of the copula. The general approach we will take is shown in the schematic below.
31+
- At the bottom, we have our **observation space** where the data lives.
32+
- We can assume that this data was generated through the process from top to bottom - we have a multivariate normal distribution in **multivariate normal space** which is transformed in two stages to result in our data in observation space.
33+
- We have access to data in **observation space** but we can make inferences about the parameters in **multivariate normal space** by transforming from one to the other.
34+
35+
:::{figure-md} copula-fig-target
36+
37+
<img src="copula_schematic.png" alt="copula schematic" width="100%">
38+
39+
Schematic of a 2D Gaussian copula. Our complex distribution P(a, b) in observation space (bottom) is modelled as being generated by a 2D Gaussian copula in multivariate normal space (top). Mapping from multivariate normal space to observation space (downwards) is done by the normal cumulative density function and then the inverse cumulative density function of the marginal distributions. The reverse, inference, process (upwards) can be done through the cumulative density function of the marginal distributions followed by an inverse cumulative density function of the normal distribution.
40+
:::
41+
42+
+++
43+
44+
This notebook will describe how to make inferences about copulas based on bivariate data with rich correlational structure. We at [PyMC Labs](https://www.pymc-labs.com) completed this work as part of a larger project with the [Gates Foundation](https://www.gatesfoundation.org), some of which has also been outlined {ref}`here <binning>`.
45+
46+
![](gates_labs_logos.png)
47+
48+
```{code-cell} ipython3
49+
import arviz as az
50+
import matplotlib.pyplot as plt
51+
import numpy as np
52+
import pymc as pm
53+
import pytensor.tensor as pt
54+
import seaborn as sns
55+
56+
from scipy.stats import expon, multivariate_normal, norm
57+
```
58+
59+
```{code-cell} ipython3
60+
%config InlineBackend.figure_format = 'retina'
61+
az.style.use("arviz-darkgrid")
62+
plt.rcParams.update({"font.size": 14, "figure.constrained_layout.use": False})
63+
SEED = 43
64+
rng = np.random.default_rng(SEED)
65+
```
66+
67+
## Data generating process
68+
Before diving in to inference, we will spend some time fleshing out the steps in the schematic figure above. First, we will demonstrate the generative model by describing a multivariate normal copula and transform that into observation space. Second, we show how the inverse transformations can allow use to move back from observation space to multivariate normal space. Once we have these details pinned down, we proceed to the inference process in a later section.
69+
70+
Now we will define the properties of our Gaussian copula with a nested dictionary. At the top level we have keys `a` and `b` and `rho`.
71+
72+
- `rho` describes the correlation coefficient of the multivariate normal copula.
73+
- `a` and `b` are also dictionaries, each of which contains the marginal distribution (as a scipy distribution object) and their parameters.
74+
75+
Note that we implicitly define the multivariate normal to have zero mean and unit variance. This is because these moments do not survive the transformation through 'uniform space', the second step in our copula schematic above.
76+
77+
```{code-cell} ipython3
78+
# define properties of our copula
79+
b_scale = 2
80+
θ = {"a_dist": norm(), "b_dist": expon(scale=1 / b_scale), "rho": 0.9}
81+
```
82+
83+
First, we define the true multivariate normal and draw some samples from it.
84+
85+
```{code-cell} ipython3
86+
n_samples = 5000
87+
88+
# draw random samples in multivariate normal space
89+
mu = [0, 0]
90+
cov = [[1, θ["rho"]], [θ["rho"], 1]]
91+
x = multivariate_normal(mu, cov).rvs(n_samples, random_state=rng)
92+
a_norm = x[:, 0]
93+
b_norm = x[:, 1]
94+
95+
sns.jointplot(x=a_norm, y=b_norm, height=6, kind="hex");
96+
```
97+
98+
Our first transformation (normal cdf) transforms data from multivariate normal space into uniform space. Note how the marginal distributions are uniform, but the correlational structure from the multivariate normal space remains in the interesting joint density below.
99+
100+
```{code-cell} ipython3
101+
# make marginals uniform
102+
a_unif = norm(loc=0, scale=1).cdf(a_norm)
103+
b_unif = norm(loc=0, scale=1).cdf(b_norm)
104+
sns.jointplot(x=a_unif, y=b_unif, height=6, kind="hex");
105+
```
106+
107+
Our final transformation (the inverse CDF of the marginal distributions) gives rise to $a$ and $b$ in observation space.
108+
109+
```{code-cell} ipython3
110+
# transform to observation space
111+
a = θ["a_dist"].ppf(a_unif)
112+
b = θ["b_dist"].ppf(b_unif)
113+
sns.jointplot(x=a, y=b, height=6, kind="hex", xlim=(-4, 4), ylim=(0, 3));
114+
```
115+
116+
## Copula inference process
117+
To understand the approach taken, we will walk through the inverse process, going from observation space to multivariate normal space.
118+
119+
```{code-cell} ipython3
120+
# observation -> uniform space
121+
a1 = θ["a_dist"].cdf(a)
122+
b1 = θ["b_dist"].cdf(b)
123+
sns.jointplot(x=a1, y=b1, kind="hex", height=6);
124+
```
125+
126+
```{code-cell} ipython3
127+
# uniform -> MvNormal space
128+
a2 = norm(loc=0, scale=1).ppf(a1)
129+
b2 = norm(loc=0, scale=1).ppf(b1)
130+
sns.jointplot(x=a2, y=b2, kind="hex", height=6);
131+
```
132+
133+
So now we have worked through what we outlined in Figure 1. We have stepped through in detail the data generating process going from multivariate normal to observation space. We then saw how to do the inverse (inference) process going from observation space to multivariate normal space. This is the approach we use in the PyMC model.
134+
135+
+++
136+
137+
## PyMC models for copula and marginal estimation
138+
139+
We will conduct inferences about parameters in multivariate normal space, constraining plausible parameter values by the data in observation space. However, we also use our observations of $a$ and $b$ to constrain estimates of the parameters of the marginal distributions.
140+
141+
In our experimentation, we explored models which conducted simultaneous estimation of the parameters of the marginals, and the covariance parameter of the copula, but we found this unstable. The solution we used below was found to be more robust, and involves a 2-step process.
142+
143+
1. Estimate the parameters of the marginal distributions.
144+
2. Estimate the covariance parameter of the copula, using point estimates of the marginal distribution parameters from step 1.
145+
146+
```{code-cell} ipython3
147+
coords = {"obs_id": np.arange(len(a))}
148+
with pm.Model(coords=coords) as marginal_model:
149+
"""
150+
Assumes observed data in variables `a` and `b`
151+
"""
152+
# marginal estimation
153+
a_mu = pm.Normal("a_mu", mu=0, sigma=1)
154+
a_sigma = pm.Exponential("a_sigma", lam=0.5)
155+
pm.Normal("a", mu=a_mu, sigma=a_sigma, observed=a, dims="obs_id")
156+
157+
b_scale = pm.Exponential("b_scale", lam=0.5)
158+
pm.Exponential("b", lam=1 / b_scale, observed=b, dims="obs_id")
159+
160+
pm.model_graph.model_to_graphviz(marginal_model)
161+
```
162+
163+
```{code-cell} ipython3
164+
with marginal_model:
165+
marginal_idata = pm.sample(random_seed=rng)
166+
167+
az.plot_posterior(
168+
marginal_idata, var_names=["a_mu", "a_sigma", "b_scale"], ref_val=[0, 1.0, 1 / 2.0]
169+
);
170+
```
171+
172+
In the copula model below you can see that we set up a prior over the covariance parameter. The posterior distribution over this parameter is constrained by the data in multivariate normal space. But in order to do that we need to transform the observations `[a, b]` in observation space, to multivariate normal space, which we store in `data`.
173+
174+
On using point estimates: as you'll see in the code below we have opted to use point estimates from Step 1 rather than the full posterior from Step 1. This is a simplification that we opted for due to complexities in tensor shape handling when passing in posterior distributions as parameters to a distribution.
175+
176+
During notebook review, however, [@OriolAbril](https://github.com/OriolAbril) (one of the maintainers of the PyMC Examples repository) correctly pointed out that exponentiating the logcdf of a data point evaluated under a distribution using point estimates _will not necessarily_ return an value equal to the expectation of exponentiating the logcdf of a data point evaluated under many possible distributions (constructed from a full posterior). To ensure timely progress on the notebook, we have opted to show the code as-is, but also leave this note for both our future selves to update the notebook later while also providing an opportunity for future readers to contribute through modifying the example to address this point.
177+
178+
```{code-cell} ipython3
179+
def transform_data(marginal_idata):
180+
# point estimates
181+
a_mu = marginal_idata.posterior["a_mu"].mean().item()
182+
a_sigma = marginal_idata.posterior["a_sigma"].mean().item()
183+
b_scale = marginal_idata.posterior["b_scale"].mean().item()
184+
# transformations from observation space -> uniform space
185+
__a = pt.exp(pm.logcdf(pm.Normal.dist(mu=a_mu, sigma=a_sigma), a))
186+
__b = pt.exp(pm.logcdf(pm.Exponential.dist(lam=1 / b_scale), b))
187+
# uniform space -> multivariate normal space
188+
_a = pm.math.probit(__a)
189+
_b = pm.math.probit(__b)
190+
# join into an Nx2 matrix
191+
data = pt.math.stack([_a, _b], axis=1).eval()
192+
return data, a_mu, a_sigma, b_scale
193+
194+
195+
data, a_mu, a_sigma, b_scale = transform_data(marginal_idata)
196+
```
197+
198+
```{code-cell} ipython3
199+
coords.update({"param": ["a", "b"], "param_bis": ["a", "b"]})
200+
with pm.Model(coords=coords) as copula_model:
201+
# Prior on covariance of the multivariate normal
202+
chol, corr, stds = pm.LKJCholeskyCov(
203+
"chol",
204+
n=2,
205+
eta=2.0,
206+
sd_dist=pm.Exponential.dist(1.0),
207+
compute_corr=True,
208+
)
209+
cov = pm.Deterministic("cov", chol.dot(chol.T), dims=("param", "param_bis"))
210+
211+
# Likelihood function
212+
pm.MvNormal("N", mu=0.0, cov=cov, observed=data, dims=("obs_id", "param"))
213+
214+
pm.model_graph.model_to_graphviz(copula_model)
215+
```
216+
217+
```{code-cell} ipython3
218+
with copula_model:
219+
copula_idata = pm.sample(random_seed=rng, tune=2000, cores=1)
220+
221+
az.plot_posterior(copula_idata, var_names=["cov"], ref_val=[1.0, θ["rho"], θ["rho"], 1.0]);
222+
```
223+
224+
You can see that we have successfully recovered the covariance matrix of the multivariate normal copula which was used to generate the sample data.
225+
226+
In the section below, we will use this information in order to sample from our parametric description of $P(a, b)$.
227+
228+
+++
229+
230+
## Comparing inferences against the true data
231+
Finally, we can do a visual check to see whether our inferences (red) match up with our original observed data (black).
232+
233+
```{code-cell} ipython3
234+
# data munging to extract covariance estimates from copula_idata in useful shape
235+
d = {k: v.values.reshape((-1, *v.shape[2:])) for k, v in copula_idata.posterior[["cov"]].items()}
236+
237+
# generate (a, b) samples
238+
ab = np.vstack([multivariate_normal([0, 0], cov).rvs() for cov in d["cov"]])
239+
240+
# transform to uniform space
241+
uniform_a = norm().cdf(ab[:, 0])
242+
uniform_b = norm().cdf(ab[:, 1])
243+
244+
# transform to observation space
245+
# estimated marginal parameters a_mu, a_sigma, b_scale are point estimates from marginal estimation.
246+
ppc_a = norm(loc=a_mu, scale=a_sigma).ppf(uniform_a)
247+
ppc_b = expon(scale=b_scale).ppf(uniform_b)
248+
```
249+
250+
```{code-cell} ipython3
251+
# plot data in black
252+
ax = az.plot_pair(
253+
{"a": a, "b": b},
254+
marginals=True,
255+
# kind=["kde", "scatter"],
256+
kind="kde",
257+
scatter_kwargs={"alpha": 0.1},
258+
kde_kwargs=dict(
259+
contour_kwargs=dict(colors="k", linestyles="--"), contourf_kwargs=dict(alpha=0)
260+
),
261+
marginal_kwargs=dict(color="k", plot_kwargs=dict(ls="--")),
262+
)
263+
264+
# plot inferences in red
265+
axs = az.plot_pair(
266+
{"a": ppc_a, "b": ppc_b},
267+
marginals=True,
268+
# kind=["kde", "scatter"],
269+
kind="kde",
270+
scatter_kwargs={"alpha": 0.01},
271+
kde_kwargs=dict(contour_kwargs=dict(colors="r", linestyles="-"), contourf_kwargs=dict(alpha=0)),
272+
marginal_kwargs=dict(color="r"),
273+
ax=ax,
274+
);
275+
```
276+
277+
## Acknowledgements
278+
We would like to acknowledge [Jonathan Sedar](https://github.com/jonsedar), [Junpeng Lao](https://github.com/junpenglao), and [Oriol Abril](https://github.com/OriolAbril) for useful advice during the development of this notebook.
279+
280+
+++
281+
282+
## Authors
283+
* Authored by [Eric Ma](https://www.pymc-labs.com/team) & [Benjamin T. Vincent](https://github.com/drbenvincent) in November, 2023 ([pymc-examples#257](https://github.com/pymc-devs/pymc-examples/pull/257)).
284+
285+
+++
286+
287+
## Watermark
288+
289+
```{code-cell} ipython3
290+
%load_ext watermark
291+
%watermark -n -u -v -iv -w -p pytensor,xarray
292+
```
293+
294+
:::{include} ../page_footer.md
295+
:::

examples/howto/copula_schematic.png

398 KB
Loading

examples/howto/gates_labs_logos.png

385 KB
Loading

0 commit comments

Comments
 (0)