diff --git a/lectures/_toc.yml b/lectures/_toc.yml index 4681e02d6..a8dc9e4ee 100644 --- a/lectures/_toc.yml +++ b/lectures/_toc.yml @@ -15,6 +15,7 @@ parts: numbered: true chapters: - file: prob_matrix + - file: stats_examples - file: lln_clt - file: prob_meaning - file: multi_hyper diff --git a/lectures/prob_matrix.md b/lectures/prob_matrix.md index 783a9a262..b53c7788f 100644 --- a/lectures/prob_matrix.md +++ b/lectures/prob_matrix.md @@ -11,11 +11,13 @@ kernelspec: name: python3 --- + + # Elementary Probability with Matrices This lecture uses matrix algebra to illustrate some basic ideas about probability theory. -After providing somewhat informal definitions of the underlying objects, we'll use matrices and vectors to describe probability distributions. +After introducing underlying objects, we'll use matrices and vectors to describe probability distributions. Among concepts that we'll be studying include @@ -31,7 +33,9 @@ Among concepts that we'll be studying include - parameters that define a probability distribution - sufficient statistics as data summaries -We'll use a matrix to represent a bivariate probability distribution and a vector to represent a univariate probability distribution +We'll use a matrix to represent a bivariate or multivariate probability distribution and a vector to represent a univariate probability distribution + +This {doc}`companion lecture ` describes some popular probability distributions and uses Python to sample from them. In addition to what's in Anaconda, this lecture will need the following libraries: @@ -59,7 +63,14 @@ set_matplotlib_formats('retina') We'll briefly define what we mean by a **probability space**, a **probability measure**, and a **random variable**. -For most of this lecture, we sweep these objects into the background, but they are there underlying the other objects that we'll mainly focus on. +For most of this lecture, we sweep these objects into the background + +```{note} +Nevertheless, they'll be lurking beneath **induced distributions** of random variables that we'll focus on here. These deeper objects are essential for defining and analysing the concepts of stationarity and ergodicity that underly laws of large numbers. For a relatively +nontechnical presentation of some of these results see this chapter from Lars Peter Hansen and Thomas J. Sargent's online monograph titled "Risk, Uncertainty, and Values":. +``` + + Let $\Omega$ be a set of possible underlying outcomes and let $\omega \in \Omega$ be a particular underlying outcomes. @@ -87,6 +98,11 @@ where ${\mathcal G}$ is the subset of $\Omega$ for which $X(\omega) \in A$. We call this the induced probability distribution of random variable $X$. +Instead of working explicitly with an underlying probability space $\Omega,\mathcal{F}$ and probability measure $\mu$, +applied statisticians often proceed simply by specifying a form for an induced distribution for a random variable $X$. + +That is how we'll proceed in this lecture and in many subsequent lectures. + ## What Does Probability Mean? @@ -133,7 +149,7 @@ as a short-hand way of saying that the random variable $X$ is described by the p Consider drawing a sample $x_0, x_1, \dots , x_{N-1}$ of $N$ independent and identically distributoed draws of $X$. -What do the "identical" and "independent" mean in IID or iid ("identically and independently distributed)? +What do the "identical" and "independent" mean in IID or iid ("identically and independently distributed")? - "identical" means that each draw is from the same distribution. - "independent" means that joint distribution equal products of marginal distributions, i.e., @@ -145,7 +161,7 @@ $$ \end{aligned} $$ -We define an e **empirical distribution** as follows. +We define an **empirical distribution** as follows. For each $i = 0,\dots,I-1$, let @@ -158,7 +174,7 @@ N & = \sum^{I-1}_{i=0} N_i \quad \text{total number of draws},\\ $$ -Key ideas that justify connecting probability theory with statistics are laws of large numbers and central limit theorems +Key concepts that connect probability theory with statistics are laws of large numbers and central limit theorems **LLN:** @@ -173,7 +189,9 @@ Key ideas that justify connecting probability theory with statistics are laws o - For "frequentist" statisticians, **anticipated relative frequency** is **all** that a probability distribution means. -- But for a Bayesian it means something more or different. +- But for a Bayesian it means something else -- something partly subjective and purely personal. + + * we say "partly" because a Bayesian also pays attention to relative frequencies ## Representing Probability Distributions @@ -195,7 +213,7 @@ $$ F(x) = \int_{-\infty}^{x}f(t)dt $$ -Here $B$ is a set of possible $X$'s whose probability we want to compute. +Here $B$ is a set of possible $X$'s whose probability of occurring we want to compute. When a probability density exists, a probability distribution can be characterized either by its CDF or by its density. @@ -265,10 +283,12 @@ where $\theta $ is a vector of parameters that is of much smaller dimension than **Remarks:** +- A **statistical model** is a joint probability distribution characterized by a list of **parameters** - The concept of **parameter** is intimately related to the notion of **sufficient statistic**. -- Sufficient statistics are nonlinear functions of a data set. -- Sufficient statistics are designed to summarize all **information** about parameters that is contained in a data set. -- They are important tools that AI uses to summarize a **big data** set +- A **statistic** is a nonlinear function of a data set. +- **Sufficient statistics** summarize all **information** that a data set contains about parameters of statistical model. + * Note that a sufficient statistic corresponds to a particular statistical model. + * Sufficient statistics are key tools that AI uses to summarize or compress a **big data** set. - R. A. Fisher provided a rigorous definition of **information** -- see @@ -406,562 +426,151 @@ $$ =\frac{ \sum_{i}f_{ij} }{ \sum_{i}f_{ij}}=1 $$ -**Remark:** The mathematics of conditional probability implies **Bayes' Law**: +**Remark:** The mathematics of conditional probability implies: $$ \textrm{Prob}\{X=i|Y=j\} =\frac{\textrm{Prob}\{X=i,Y=j\}}{\textrm{Prob}\{Y=j\}}=\frac{\textrm{Prob}\{Y=j|X=i\}\textrm{Prob}\{X=i\}}{\textrm{Prob}\{Y=j\}} $$ -For the joint distribution {eq}`eq:example101discrete` - -$$ -\textrm{Prob}\{X=0|Y=1\} =\frac{ .1}{.1+.5}=\frac{.1}{.6} -$$ - -## Statistical Independence - -Random variables X and Y are statistically **independent** if - -$$ -\textrm{Prob}\{X=i,Y=j\}={f_ig_j} -$$ - -where - -$$ -\begin{aligned} -\textrm{Prob}\{X=i\} &=f_i\ge0, \sum{f_i}=1 \cr -\textrm{Prob}\{Y=j\} & =g_j\ge0, \sum{g_j}=1 -\end{aligned} -$$ - -Conditional distributions are - -$$ -\begin{aligned} -\textrm{Prob}\{X=i|Y=j\} & =\frac{f_ig_j}{\sum_{i}f_ig_j}=\frac{f_ig_j}{g_j}=f_i \\ -\textrm{Prob}\{Y=j|X=i\} & =\frac{f_ig_j}{\sum_{j}f_ig_j}=\frac{f_ig_j}{f_i}=g_j -\end{aligned} -$$ - - -## Means and Variances - -The mean and variance of a discrete random variable $X$ are - -$$ -\begin{aligned} -\mu_{X} & \equiv\mathbb{E}\left[X\right] -=\sum_{k}k \textrm{Prob}\{X=k\} \\ -\sigma_{X}^{2} & \equiv\mathbb{D}\left[X\right]=\sum_{k}\left(k-\mathbb{E}\left[X\right]\right)^{2}\textrm{Prob}\{X=k\} -\end{aligned} -$$ - -A continuous random variable having density $f_{X}(x)$) has mean and variance - -$$ -\begin{aligned} -\mu_{X} & \equiv\mathbb{E}\left[X\right]=\int_{-\infty}^{\infty}xf_{X}(x)dx \\ -\sigma_{X}^{2}\equiv\mathbb{D}\left[X\right] & =\mathrm{E}\left[\left(X-\mu_{X}\right)^{2}\right]=\int_{-\infty}^{\infty}\left(x-\mu_{X}\right)^{2}f_{X}(x)dx -\end{aligned} -$$ - - -## Generating Random Numbers - -Suppose we have at our disposal a pseudo random number that draws a uniform random variable, i.e., one with probability distribution - -$$ -\textrm{Prob}\{\tilde{X}=i\}=\frac{1}{I},\quad i=0,\ldots,I-1 -$$ - -How can we transform $\tilde{X}$ to get a random variable $X$ for which $\textrm{Prob}\{X=i\}=f_i,\quad i=0,\ldots,I-1$, - where $f_i$ is an arbitary discrete probability distribution on $i=0,1,\dots,I-1$? - -The key tool is the inverse of a cumulative distribution function (CDF). - -Observe that the CDF of a distribution is monotone and non-decreasing, taking values between $0$ and $1$. - -We can draw a sample of a random variable $X$ with a known CDF as follows: - -- draw a random variable $u$ from a uniform distribution on $[0,1]$ -- pass the sample value of $u$ into the **"inverse"** target CDF for $X$ -- $X$ has the target CDF - - -Thus, knowing the **"inverse"** CDF of a distribution is enough to simulate from this distribution. - ```{note} -The "inverse" CDF needs to exist for this method to work. +This can be interpreted as a version of what a Bayesian calls **Bayes' Law**. ``` -The inverse CDF is - -$$ -F^{-1}(u)\equiv\inf \{x\in \mathbb{R}: F(x) \geq u\} \quad(00$. - -Its density function is +For the joint distribution {eq}`eq:example101discrete` $$ -\quad f(x)=\lambda e^{-\lambda x} +\textrm{Prob}\{X=0|Y=1\} =\frac{ .1}{.1+.5}=\frac{.1}{.6} $$ -Its CDF is - -$$ -F(x)=\int_{0}^{\infty}\lambda e^{-\lambda x}=1-e^{-\lambda x} -$$ -Let $U$ follow a uniform distribution on $[0,1]$. +## Transition Probability Matrix -$X$ is a random variable such that $U=F(X)$. +Consider the following joint probability distribution of two random variables. -The distribution $X$ can be deduced from +Let $X,Y$ be discrete random variables with joint distribution $$ -\begin{aligned} -U& =F(X)=1-e^{-\lambda X}\qquad\\ -\implies & \quad -U=e^{-\lambda X}\\ -\implies& \quad \log(1-U)=-\lambda X\\ -\implies & \quad X=\frac{(1-U)}{-\lambda} -\end{aligned} +\textrm{Prob}\{X=i,Y=j\} = \rho_{ij} $$ -Let's draw $u$ from $U[0,1]$ and calculate $x=\frac{log(1-U)}{-\lambda}$. - - -We'll check whether $X$ seems to follow a **continuous geometric** (exponential) distribution. - -Let's check with `numpy`. - -```{code-cell} ipython3 -n, λ = 1_000_000, 0.3 - -# draw uniform numbers -u = np.random.rand(n) - -# transform -x = -np.log(1-u)/λ - -# draw geometric distributions -x_g = np.random.exponential(1 / λ, n) - -# plot and compare -plt.hist(x, bins=100, density=True) -plt.show() -``` - -```{code-cell} ipython3 -plt.hist(x_g, bins=100, density=True, alpha=0.6) -plt.show() -``` - -**Geometric distribution** - -Let $X$ distributed geometrically, that is +where $i = 0,\dots,I-1; j = 0,\dots,J-1$ and $$ -\begin{aligned} -\textrm{Prob}(X=i) & =(1-\lambda)\lambda^i,\quad\lambda\in(0,1), \quad i=0,1,\dots \\ - & \sum_{i=0}^{\infty}\textrm{Prob}(X=i)=1\longleftrightarrow(1- \lambda)\sum_{i=0}^{\infty}\lambda^i=\frac{1-\lambda}{1-\lambda}=1 -\end{aligned} +\sum_i\sum_j \rho_{ij} = 1, \quad \rho_{ij} \geqslant 0. $$ -Its CDF is given by +An associated conditional distribution is $$ -\begin{aligned} -\textrm{Prob}(X\le i)& =(1-\lambda)\sum_{j=0}^{i}\lambda^i\\ -& =(1-\lambda)[\frac{1-\lambda^{i+1}}{1-\lambda}]\\ -& =1-\lambda^{i+1}\\ -& =F(X)=F_i \quad -\end{aligned} +\textrm{Prob}\{Y=i\vert X=j\} = \frac{\rho_{ij}}{ \sum_{i}\rho_{ij}} += \frac{\textrm{Prob}\{Y=j, X=i\}}{\textrm{Prob}\{ X=i\}} $$ -Again, let $\tilde{U}$ follow a uniform distribution and we want to find $X$ such that $F(X)=\tilde{U}$. - -Let's deduce the distribution of $X$ from +We can define a transition probability matrix $$ -\begin{aligned} -\tilde{U} & =F(X)=1-\lambda^{x+1}\\ -1-\tilde{U} & =\lambda^{x+1}\\ -\log(1-\tilde{U})& =(x+1)\log\lambda\\ -\frac{\log(1-\tilde{U})}{\log\lambda}& =x+1\\ -\frac{\log(1-\tilde{U})}{\log\lambda}-1 &=x -\end{aligned} +p_{ij}=\textrm{Prob}\{Y=j|X=i\}= \frac{\rho_{ij}}{ \sum_{j}\rho_{ij}} $$ -However, $\tilde{U}=F^{-1}(X)$ may not be an integer for any $x\geq0$. - -So let +where $$ -x=\lceil\frac{\log(1-\tilde{U})}{\log\lambda}-1\rceil +\left[ + \begin{matrix} + p_{11} & p_{12}\\ + p_{21} & p_{22} + \end{matrix} +\right] $$ -where $\lceil . \rceil$ is the ceiling function. - -Thus $x$ is the smallest integer such that the discrete geometric CDF is greater than or equal to $\tilde{U}$. - -We can verify that $x$ is indeed geometrically distributed by the following `numpy` program. - -```{note} -The exponential distribution is the continuous analog of geometric distribution. -``` - -```{code-cell} ipython3 -n, λ = 1_000_000, 0.8 - -# draw uniform numbers -u = np.random.rand(n) - -# transform -x = np.ceil(np.log(1-u)/np.log(λ) - 1) - -# draw geometric distributions -x_g = np.random.geometric(1-λ, n) - -# plot and compare -plt.hist(x, bins=150, density=True) -plt.show() -``` - -```{code-cell} ipython3 -np.random.geometric(1-λ, n).max() -``` - -```{code-cell} ipython3 -np.log(0.4)/np.log(0.3) -``` +The first row is the probability that $Y=j, j=0,1$ conditional on $X=0$. -```{code-cell} ipython3 -plt.hist(x_g, bins=150, density=True, alpha=0.6) -plt.show() -``` +The second row is the probability that $Y=j, j=0,1$ conditional on $X=1$. -## Some Discrete Probability Distributions +Note that +- $\sum_{j}\rho_{ij}= \frac{ \sum_{j}\rho_{ij}}{ \sum_{j}\rho_{ij}}=1$, so each row of $\rho$ is a probability distribution (not so for each column). -Let's write some Python code to compute means and variances of some univariate random variables. -We'll use our code to +## Application: Forecasting a Time Series -- compute population means and variances from the probability distribution -- generate a sample of $N$ independently and identically distributed draws and compute sample means and variances -- compare population and sample means and variances +Suppose that there are two time periods. -## Geometric distribution +- $t=0$ "today" +- $t=1$ "tomorrow" -$$ -\textrm{Prob}(X=k)=(1-p)^{k-1}p,k=1,2, \ldots -$$ +Let $X(0)$ be a random variable to be realized at $t=0$, $X(1)$ be a random variable to be realized at $t=1$. -$\implies$ +Suppose that $$ \begin{aligned} -\mathbb{E}(X) & =\frac{1}{p}\\\mathbb{D}(X) & =\frac{1-p}{p^2} +\text{Prob} \{X(0)=i,X(1)=j\} &=f_{ij}≥0,i=0,\cdots,I-1\\ +\sum_{i}\sum_{j}f_{ij}&=1 \end{aligned} $$ -We draw observations from the distribution and compare the sample mean and variance with the theoretical results. - -```{code-cell} ipython3 -# specify parameters -p, n = 0.3, 1_000_000 - -# draw observations from the distribution -x = np.random.geometric(p, n) - -# compute sample mean and variance -μ_hat = np.mean(x) -σ2_hat = np.var(x) - -print("The sample mean is: ", μ_hat, "\nThe sample variance is: ", σ2_hat) - -# compare with theoretical results -print("\nThe population mean is: ", 1/p) -print("The population variance is: ", (1-p)/(p**2)) -``` - -### Newcomb–Benford distribution - -The **Newcomb–Benford law** fits many data sets, e.g., reports of incomes to tax authorities, in which -the leading digit is more likely to be small than large. - -See +$f_{ij} $ is a joint distribution over $[X(0), X(1)]$. -A Benford probability distribution is +A conditional distribution is -$$ -\textrm{Prob}\{X=d\}=\log _{10}(d+1)-\log _{10}(d)=\log _{10}\left(1+\frac{1}{d}\right) -$$ +$$\text{Prob} \{X(1)=j|X(0)=i\}= \frac{f_{ij}}{ \sum_{j}f_{ij}}$$ -where $d\in\{1,2,\cdots,9\}$ can be thought of as a **first digit** in a sequence of digits. +**Remark:** +- This formula is a workhorse for applied economic forecasters. -This is a well defined discrete distribution since we can verify that probabilities are nonnegative and sum to $1$. -$$ -\log_{10}\left(1+\frac{1}{d}\right)\geq0,\quad\sum_{d=1}^{9}\log_{10}\left(1+\frac{1}{d}\right)=1 -$$ +## Statistical Independence -The mean and variance of a Benford distribution are +Random variables X and Y are statistically **independent** if $$ -\begin{aligned} -\mathbb{E}\left[X\right] &=\sum_{d=1}^{9}d\log_{10}\left(1+\frac{1}{d}\right)\simeq3.4402 \\ -\mathbb{V}\left[X\right] & =\sum_{d=1}^{9}\left(d-\mathbb{E}\left[X\right]\right)^{2}\log_{10}\left(1+\frac{1}{d}\right)\simeq6.0565 -\end{aligned} +\textrm{Prob}\{X=i,Y=j\}={f_ig_j} $$ -We verify the above and compute the mean and variance using `numpy`. - -```{code-cell} ipython3 -Benford_pmf = np.array([np.log10(1+1/d) for d in range(1,10)]) -k = np.array(range(1,10)) - -# mean -mean = np.sum(Benford_pmf * k) - -# variance -var = np.sum([(k-mean)**2 * Benford_pmf]) - -# verify sum to 1 -print(np.sum(Benford_pmf)) -print(mean) -print(var) -``` - -```{code-cell} ipython3 -# plot distribution -plt.plot(range(1,10), Benford_pmf, 'o') -plt.title('Benford\'s distribution') -plt.show() -``` - -### Pascal (negative binomial) distribution - -Consider a sequence of independent Bernoulli trials. - -Let $p$ be the probability of success. - -Let $X$ be a random variable that represents the number of failures before we get $r$ success. - -Its distribution is +where $$ \begin{aligned} -X & \sim NB(r,p) \\ -\textrm{Prob}(X=k;r,p) & = \begin{pmatrix}k+r-1 \\ r-1 \end{pmatrix}p^r(1-p)^{k} +\textrm{Prob}\{X=i\} &=f_i\ge0, \sum{f_i}=1 \cr +\textrm{Prob}\{Y=j\} & =g_j\ge0, \sum{g_j}=1 \end{aligned} $$ -Here, we choose from among $k+r-1$ possible outcomes because the last draw is by definition a success. - -We compute the mean and variance to be - +Conditional distributions are $$ \begin{aligned} -\mathbb{E}(X) & = \frac{k(1-p)}{p} \\ -\mathbb{V}(X) & = \frac{k(1-p)}{p^2} +\textrm{Prob}\{X=i|Y=j\} & =\frac{f_ig_j}{\sum_{i}f_ig_j}=\frac{f_ig_j}{g_j}=f_i \\ +\textrm{Prob}\{Y=j|X=i\} & =\frac{f_ig_j}{\sum_{j}f_ig_j}=\frac{f_ig_j}{f_i}=g_j \end{aligned} $$ -```{code-cell} ipython3 -# specify parameters -r, p, n = 10, 0.3, 1_000_000 - -# draw observations from the distribution -x = np.random.negative_binomial(r, p, n) - -# compute sample mean and variance -μ_hat = np.mean(x) -σ2_hat = np.var(x) - -print("The sample mean is: ", μ_hat, "\nThe sample variance is: ", σ2_hat) -print("\nThe population mean is: ", r*(1-p)/p) -print("The population variance is: ", r*(1-p)/p**2) -``` - -## Continuous Random Variables - -### Univariate Gaussian distribution -We write - -$$ -X \sim N(\mu,\sigma^2) -$$ - -to indicate the probability distribution - -$$f(x|u,\sigma^2)=\frac{1}{\sqrt{2\pi \sigma^2}}e^{[-\frac{1}{2\sigma^2}(x-u)^2]} $$ - -In the below example, we set $\mu = 0, \sigma = 0.1$. - -```{code-cell} ipython3 -# specify parameters -μ, σ = 0, 0.1 - -# specify number of draws -n = 1_000_000 - -# draw observations from the distribution -x = np.random.normal(μ, σ, n) - -# compute sample mean and variance -μ_hat = np.mean(x) -σ_hat = np.std(x) - -print("The sample mean is: ", μ_hat) -print("The sample standard deviation is: ", σ_hat) -``` - -```{code-cell} ipython3 -# compare -print(μ-μ_hat < 1e-3) -print(σ-σ_hat < 1e-3) -``` - -### Uniform Distribution - -$$ -\begin{aligned} -X & \sim U[a,b] \\ -f(x)& = \begin{cases} \frac{1}{b-a}, & a \leq x \leq b \\ \quad0, & \text{otherwise} \end{cases} -\end{aligned} -$$ +## Means and Variances -The population mean and variance are +The mean and variance of a discrete random variable $X$ are $$ \begin{aligned} -\mathbb{E}(X) & = \frac{a+b}{2} \\ -\mathbb{V}(X) & = \frac{(b-a)^2}{12} +\mu_{X} & \equiv\mathbb{E}\left[X\right] +=\sum_{k}k \textrm{Prob}\{X=k\} \\ +\sigma_{X}^{2} & \equiv\mathbb{D}\left[X\right]=\sum_{k}\left(k-\mathbb{E}\left[X\right]\right)^{2}\textrm{Prob}\{X=k\} \end{aligned} $$ -```{code-cell} ipython3 -# specify parameters -a, b = 10, 20 - -# specify number of draws -n = 1_000_000 - -# draw observations from the distribution -x = a + (b-a)*np.random.rand(n) - -# compute sample mean and variance -μ_hat = np.mean(x) -σ2_hat = np.var(x) - -print("The sample mean is: ", μ_hat, "\nThe sample variance is: ", σ2_hat) -print("\nThe population mean is: ", (a+b)/2) -print("The population variance is: ", (b-a)**2/12) -``` - -## A Mixed Discrete-Continuous Distribution - -We'll motivate this example with a little story. - - -Suppose that to apply for a job you take an interview and either pass or fail it. - -You have $5\%$ chance to pass an interview and you know your salary will uniformly distributed in the interval 300~400 a day only if you pass. - -We can describe your daily salary as a discrete-continuous variable with the following probabilities: - -$$ -P(X=0)=0.95 -$$ - -$$ -P(300\le X \le 400)=\int_{300}^{400} f(x)\, dx=0.05 -$$ - -$$ -f(x) = 0.0005 -$$ - -Let's start by generating a random sample and computing sample moments. - -```{code-cell} ipython3 -x = np.random.rand(1_000_000) -# x[x > 0.95] = 100*x[x > 0.95]+300 -x[x > 0.95] = 100*np.random.rand(len(x[x > 0.95]))+300 -x[x <= 0.95] = 0 - -μ_hat = np.mean(x) -σ2_hat = np.var(x) - -print("The sample mean is: ", μ_hat, "\nThe sample variance is: ", σ2_hat) -``` - -The analytical mean and variance can be computed: - -$$ -\begin{aligned} -\mu &= \int_{300}^{400}xf(x)dx \\ -&= 0.0005\int_{300}^{400}xdx \\ -&= 0.0005 \times \frac{1}{2}x^2\bigg|_{300}^{400} -\end{aligned} -$$ +A continuous random variable having density $f_{X}(x)$) has mean and variance $$ \begin{aligned} -\sigma^2 &= 0.95\times(0-17.5)^2+\int_{300}^{400}(x-17.5)^2f(x)dx \\ -&= 0.95\times17.5^2+0.0005\int_{300}^{400}(x-17.5)^2dx \\ -&= 0.95\times17.5^2+0.0005 \times \frac{1}{3}(x-17.5)^3 \bigg|_{300}^{400} +\mu_{X} & \equiv\mathbb{E}\left[X\right]=\int_{-\infty}^{\infty}xf_{X}(x)dx \\ +\sigma_{X}^{2}\equiv\mathbb{D}\left[X\right] & =\mathrm{E}\left[\left(X-\mu_{X}\right)^{2}\right]=\int_{-\infty}^{\infty}\left(x-\mu_{X}\right)^{2}f_{X}(x)dx \end{aligned} $$ -```{code-cell} ipython3 -mean = 0.0005*0.5*(400**2 - 300**2) -var = 0.95*17.5**2+0.0005/3*((400-17.5)**3-(300-17.5)**3) -print("mean: ", mean) -print("variance: ", var) -``` - -## Matrix Representation of Some Bivariate Distributions +## Matrix Representations of Some Bivariate Distributions Let's use matrices to represent a joint distribution, conditional distribution, marginal distribution, and the mean and variance of a bivariate random variable. @@ -1497,53 +1106,6 @@ $$ where $ f_{X}*g_{Y} $ denotes the **convolution** of the $f_X$ and $g_Y$ functions. -## Transition Probability Matrix - -Consider the following joint probability distribution of two random variables. - -Let $X,Y$ be discrete random variables with joint distribution - -$$ -\textrm{Prob}\{X=i,Y=j\} = \rho_{ij} -$$ - -where $i = 0,\dots,I-1; j = 0,\dots,J-1$ and - -$$ -\sum_i\sum_j \rho_{ij} = 1, \quad \rho_{ij} \geqslant 0. -$$ - -An associated conditional distribution is - -$$ -\textrm{Prob}\{Y=i\vert X=j\} = \frac{\rho_{ij}}{ \sum_{i}\rho_{ij}} -= \frac{\textrm{Prob}\{Y=j, X=i\}}{\textrm{Prob}\{ X=i\}} -$$ - -We can define a transition probability matrix - -$$ -p_{ij}=\textrm{Prob}\{Y=j|X=i\}= \frac{\rho_{ij}}{ \sum_{j}\rho_{ij}} -$$ - -where - -$$ -\left[ - \begin{matrix} - p_{11} & p_{12}\\ - p_{21} & p_{22} - \end{matrix} -\right] -$$ - -The first row is the probability of $Y=j, j=0,1$ conditional on $X=0$. - -The second row is the probability of $Y=j, j=0,1$ conditional on $X=1$. - -Note that -- $\sum_{j}\rho_{ij}= \frac{ \sum_{j}\rho_{ij}}{ \sum_{j}\rho_{ij}}=1$, so each row of $\rho$ is a probability distribution (not so for each column. - ## Coupling Start with a joint distribution @@ -1912,30 +1474,3 @@ print(c2_ymtb) We have verified that both joint distributions, $c_1$ and $c_2$, have identical marginal distributions of $X$ and $Y$, respectively. So they are both couplings of $X$ and $Y$. - -## Time Series - -Suppose that there are two time periods. - -- $t=0$ "today" -- $t=1$ "tomorrow" - -Let $X(0)$ be a random variable to be realized at $t=0$, $X(1)$ be a random variable to be realized at $t=1$. - -Suppose that - -$$ -\begin{aligned} -\text{Prob} \{X(0)=i,X(1)=j\} &=f_{ij}≥0,i=0,\cdots,I-1\\ -\sum_{i}\sum_{j}f_{ij}&=1 -\end{aligned} -$$ - -$f_{ij} $ is a joint distribution over $[X(0), X(1)]$. - -A conditional distribution is - -$$\text{Prob} \{X(1)=j|X(0)=i\}= \frac{f_{ij}}{ \sum_{j}f_{ij}}$$ - -**Remark:** -- This is a key formula for a theory of optimally predicting a time series. diff --git a/lectures/stats_examples.md b/lectures/stats_examples.md new file mode 100644 index 000000000..922538628 --- /dev/null +++ b/lectures/stats_examples.md @@ -0,0 +1,540 @@ +--- +jupytext: + text_representation: + extension: .myst + format_name: myst + format_version: 0.13 + jupytext_version: 1.13.8 +kernelspec: + display_name: Python 3 (ipykernel) + language: python + name: python3 +--- + +# Some Probability Distributions + +This lecture is a supplement to {doc}`this lecture on statistics with matrices `. + +It describes some popular distributions and uses Python to sample from them. + +It also describes a way to sample from an arbitrary probability distribution that you make up by +transforming a sample from a uniform probability distribution. + + +In addition to what's in Anaconda, this lecture will need the following libraries: + +```{code-cell} ipython3 +--- +tags: [hide-output] +--- +!pip install prettytable +``` + +As usual, we'll start with some imports + +```{code-cell} ipython3 +import numpy as np +import matplotlib.pyplot as plt +import prettytable as pt +from mpl_toolkits.mplot3d import Axes3D +from matplotlib_inline.backend_inline import set_matplotlib_formats +set_matplotlib_formats('retina') +``` + + +## Some Discrete Probability Distributions + + +Let's write some Python code to compute means and variances of some univariate random variables. + +We'll use our code to + +- compute population means and variances from the probability distribution +- generate a sample of $N$ independently and identically distributed draws and compute sample means and variances +- compare population and sample means and variances + +## Geometric distribution + +$$ +\textrm{Prob}(X=k)=(1-p)^{k-1}p,k=1,2, \ldots +$$ + +$\implies$ + +$$ +\begin{aligned} +\mathbb{E}(X) & =\frac{1}{p}\\\mathbb{D}(X) & =\frac{1-p}{p^2} +\end{aligned} +$$ + +We draw observations from the distribution and compare the sample mean and variance with the theoretical results. + +```{code-cell} ipython3 +# specify parameters +p, n = 0.3, 1_000_000 + +# draw observations from the distribution +x = np.random.geometric(p, n) + +# compute sample mean and variance +μ_hat = np.mean(x) +σ2_hat = np.var(x) + +print("The sample mean is: ", μ_hat, "\nThe sample variance is: ", σ2_hat) + +# compare with theoretical results +print("\nThe population mean is: ", 1/p) +print("The population variance is: ", (1-p)/(p**2)) +``` + + +## Pascal (negative binomial) distribution + +Consider a sequence of independent Bernoulli trials. + +Let $p$ be the probability of success. + +Let $X$ be a random variable that represents the number of failures before we get $r$ success. + +Its distribution is + +$$ +\begin{aligned} +X & \sim NB(r,p) \\ +\textrm{Prob}(X=k;r,p) & = \begin{pmatrix}k+r-1 \\ r-1 \end{pmatrix}p^r(1-p)^{k} +\end{aligned} +$$ + +Here, we choose from among $k+r-1$ possible outcomes because the last draw is by definition a success. + +We compute the mean and variance to be + + +$$ +\begin{aligned} +\mathbb{E}(X) & = \frac{k(1-p)}{p} \\ +\mathbb{V}(X) & = \frac{k(1-p)}{p^2} +\end{aligned} +$$ + +```{code-cell} ipython3 +# specify parameters +r, p, n = 10, 0.3, 1_000_000 + +# draw observations from the distribution +x = np.random.negative_binomial(r, p, n) + +# compute sample mean and variance +μ_hat = np.mean(x) +σ2_hat = np.var(x) + +print("The sample mean is: ", μ_hat, "\nThe sample variance is: ", σ2_hat) +print("\nThe population mean is: ", r*(1-p)/p) +print("The population variance is: ", r*(1-p)/p**2) +``` + + +## Newcomb–Benford distribution + +The **Newcomb–Benford law** fits many data sets, e.g., reports of incomes to tax authorities, in which +the leading digit is more likely to be small than large. + +See + +A Benford probability distribution is + +$$ +\textrm{Prob}\{X=d\}=\log _{10}(d+1)-\log _{10}(d)=\log _{10}\left(1+\frac{1}{d}\right) +$$ + +where $d\in\{1,2,\cdots,9\}$ can be thought of as a **first digit** in a sequence of digits. + +This is a well defined discrete distribution since we can verify that probabilities are nonnegative and sum to $1$. + +$$ +\log_{10}\left(1+\frac{1}{d}\right)\geq0,\quad\sum_{d=1}^{9}\log_{10}\left(1+\frac{1}{d}\right)=1 +$$ + +The mean and variance of a Benford distribution are + +$$ +\begin{aligned} +\mathbb{E}\left[X\right] &=\sum_{d=1}^{9}d\log_{10}\left(1+\frac{1}{d}\right)\simeq3.4402 \\ +\mathbb{V}\left[X\right] & =\sum_{d=1}^{9}\left(d-\mathbb{E}\left[X\right]\right)^{2}\log_{10}\left(1+\frac{1}{d}\right)\simeq6.0565 +\end{aligned} +$$ + +We verify the above and compute the mean and variance using `numpy`. + +```{code-cell} ipython3 +Benford_pmf = np.array([np.log10(1+1/d) for d in range(1,10)]) +k = np.array(range(1,10)) + +# mean +mean = np.sum(Benford_pmf * k) + +# variance +var = np.sum([(k-mean)**2 * Benford_pmf]) + +# verify sum to 1 +print(np.sum(Benford_pmf)) +print(mean) +print(var) +``` + +```{code-cell} ipython3 +# plot distribution +plt.plot(range(1,10), Benford_pmf, 'o') +plt.title('Benford\'s distribution') +plt.show() +``` + +Now let's turn to some continuous random variables. + +## Univariate Gaussian distribution + +We write + +$$ +X \sim N(\mu,\sigma^2) +$$ + +to indicate the probability distribution + +$$f(x|u,\sigma^2)=\frac{1}{\sqrt{2\pi \sigma^2}}e^{[-\frac{1}{2\sigma^2}(x-u)^2]} $$ + +In the below example, we set $\mu = 0, \sigma = 0.1$. + +```{code-cell} ipython3 +# specify parameters +μ, σ = 0, 0.1 + +# specify number of draws +n = 1_000_000 + +# draw observations from the distribution +x = np.random.normal(μ, σ, n) + +# compute sample mean and variance +μ_hat = np.mean(x) +σ_hat = np.std(x) + +print("The sample mean is: ", μ_hat) +print("The sample standard deviation is: ", σ_hat) +``` + +```{code-cell} ipython3 +# compare +print(μ-μ_hat < 1e-3) +print(σ-σ_hat < 1e-3) +``` + +## Uniform Distribution + +$$ +\begin{aligned} +X & \sim U[a,b] \\ +f(x)& = \begin{cases} \frac{1}{b-a}, & a \leq x \leq b \\ \quad0, & \text{otherwise} \end{cases} +\end{aligned} +$$ + +The population mean and variance are + +$$ +\begin{aligned} +\mathbb{E}(X) & = \frac{a+b}{2} \\ +\mathbb{V}(X) & = \frac{(b-a)^2}{12} +\end{aligned} +$$ + +```{code-cell} ipython3 +# specify parameters +a, b = 10, 20 + +# specify number of draws +n = 1_000_000 + +# draw observations from the distribution +x = a + (b-a)*np.random.rand(n) + +# compute sample mean and variance +μ_hat = np.mean(x) +σ2_hat = np.var(x) + +print("The sample mean is: ", μ_hat, "\nThe sample variance is: ", σ2_hat) +print("\nThe population mean is: ", (a+b)/2) +print("The population variance is: ", (b-a)**2/12) +``` + +## A Mixed Discrete-Continuous Distribution + +We'll motivate this example with a little story. + + +Suppose that to apply for a job you take an interview and either pass or fail it. + +You have $5\%$ chance to pass an interview and you know your salary will uniformly distributed in the interval 300~400 a day only if you pass. + +We can describe your daily salary as a discrete-continuous variable with the following probabilities: + +$$ +P(X=0)=0.95 +$$ + +$$ +P(300\le X \le 400)=\int_{300}^{400} f(x)\, dx=0.05 +$$ + +$$ +f(x) = 0.0005 +$$ + +Let's start by generating a random sample and computing sample moments. + +```{code-cell} ipython3 +x = np.random.rand(1_000_000) +# x[x > 0.95] = 100*x[x > 0.95]+300 +x[x > 0.95] = 100*np.random.rand(len(x[x > 0.95]))+300 +x[x <= 0.95] = 0 + +μ_hat = np.mean(x) +σ2_hat = np.var(x) + +print("The sample mean is: ", μ_hat, "\nThe sample variance is: ", σ2_hat) +``` + +The analytical mean and variance can be computed: + +$$ +\begin{aligned} +\mu &= \int_{300}^{400}xf(x)dx \\ +&= 0.0005\int_{300}^{400}xdx \\ +&= 0.0005 \times \frac{1}{2}x^2\bigg|_{300}^{400} +\end{aligned} +$$ + +$$ +\begin{aligned} +\sigma^2 &= 0.95\times(0-17.5)^2+\int_{300}^{400}(x-17.5)^2f(x)dx \\ +&= 0.95\times17.5^2+0.0005\int_{300}^{400}(x-17.5)^2dx \\ +&= 0.95\times17.5^2+0.0005 \times \frac{1}{3}(x-17.5)^3 \bigg|_{300}^{400} +\end{aligned} +$$ + +```{code-cell} ipython3 +mean = 0.0005*0.5*(400**2 - 300**2) +var = 0.95*17.5**2+0.0005/3*((400-17.5)**3-(300-17.5)**3) +print("mean: ", mean) +print("variance: ", var) +``` + + +## Drawing a Random Number from a Particular Distribution + +Suppose we have at our disposal a pseudo random number that draws a uniform random variable, i.e., one with probability distribution + +$$ +\textrm{Prob}\{\tilde{X}=i\}=\frac{1}{I},\quad i=0,\ldots,I-1 +$$ + +How can we transform $\tilde{X}$ to get a random variable $X$ for which $\textrm{Prob}\{X=i\}=f_i,\quad i=0,\ldots,I-1$, + where $f_i$ is an arbitary discrete probability distribution on $i=0,1,\dots,I-1$? + +The key tool is the inverse of a cumulative distribution function (CDF). + +Observe that the CDF of a distribution is monotone and non-decreasing, taking values between $0$ and $1$. + +We can draw a sample of a random variable $X$ with a known CDF as follows: + +- draw a random variable $u$ from a uniform distribution on $[0,1]$ +- pass the sample value of $u$ into the **"inverse"** target CDF for $X$ +- $X$ has the target CDF + + +Thus, knowing the **"inverse"** CDF of a distribution is enough to simulate from this distribution. + +```{note} +The "inverse" CDF needs to exist for this method to work. +``` + +The inverse CDF is + +$$ +F^{-1}(u)\equiv\inf \{x\in \mathbb{R}: F(x) \geq u\} \quad(00$. + +Its density function is + +$$ +\quad f(x)=\lambda e^{-\lambda x} +$$ + +Its CDF is + +$$ +F(x)=\int_{0}^{\infty}\lambda e^{-\lambda x}=1-e^{-\lambda x} +$$ + +Let $U$ follow a uniform distribution on $[0,1]$. + +$X$ is a random variable such that $U=F(X)$. + +The distribution $X$ can be deduced from + +$$ +\begin{aligned} +U& =F(X)=1-e^{-\lambda X}\qquad\\ +\implies & \quad -U=e^{-\lambda X}\\ +\implies& \quad \log(1-U)=-\lambda X\\ +\implies & \quad X=\frac{(1-U)}{-\lambda} +\end{aligned} +$$ + +Let's draw $u$ from $U[0,1]$ and calculate $x=\frac{log(1-U)}{-\lambda}$. + + +We'll check whether $X$ seems to follow a **continuous geometric** (exponential) distribution. + +Let's check with `numpy`. + +```{code-cell} ipython3 +n, λ = 1_000_000, 0.3 + +# draw uniform numbers +u = np.random.rand(n) + +# transform +x = -np.log(1-u)/λ + +# draw geometric distributions +x_g = np.random.exponential(1 / λ, n) + +# plot and compare +plt.hist(x, bins=100, density=True) +plt.show() +``` + +```{code-cell} ipython3 +plt.hist(x_g, bins=100, density=True, alpha=0.6) +plt.show() +``` + +**Geometric distribution** + +Let $X$ distributed geometrically, that is + +$$ +\begin{aligned} +\textrm{Prob}(X=i) & =(1-\lambda)\lambda^i,\quad\lambda\in(0,1), \quad i=0,1,\dots \\ + & \sum_{i=0}^{\infty}\textrm{Prob}(X=i)=1\longleftrightarrow(1- \lambda)\sum_{i=0}^{\infty}\lambda^i=\frac{1-\lambda}{1-\lambda}=1 +\end{aligned} +$$ + +Its CDF is given by + +$$ +\begin{aligned} +\textrm{Prob}(X\le i)& =(1-\lambda)\sum_{j=0}^{i}\lambda^i\\ +& =(1-\lambda)[\frac{1-\lambda^{i+1}}{1-\lambda}]\\ +& =1-\lambda^{i+1}\\ +& =F(X)=F_i \quad +\end{aligned} +$$ + +Again, let $\tilde{U}$ follow a uniform distribution and we want to find $X$ such that $F(X)=\tilde{U}$. + +Let's deduce the distribution of $X$ from + +$$ +\begin{aligned} +\tilde{U} & =F(X)=1-\lambda^{x+1}\\ +1-\tilde{U} & =\lambda^{x+1}\\ +\log(1-\tilde{U})& =(x+1)\log\lambda\\ +\frac{\log(1-\tilde{U})}{\log\lambda}& =x+1\\ +\frac{\log(1-\tilde{U})}{\log\lambda}-1 &=x +\end{aligned} +$$ + +However, $\tilde{U}=F^{-1}(X)$ may not be an integer for any $x\geq0$. + +So let + +$$ +x=\lceil\frac{\log(1-\tilde{U})}{\log\lambda}-1\rceil +$$ + +where $\lceil . \rceil$ is the ceiling function. + +Thus $x$ is the smallest integer such that the discrete geometric CDF is greater than or equal to $\tilde{U}$. + +We can verify that $x$ is indeed geometrically distributed by the following `numpy` program. + +```{note} +The exponential distribution is the continuous analog of geometric distribution. +``` + +```{code-cell} ipython3 +n, λ = 1_000_000, 0.8 + +# draw uniform numbers +u = np.random.rand(n) + +# transform +x = np.ceil(np.log(1-u)/np.log(λ) - 1) + +# draw geometric distributions +x_g = np.random.geometric(1-λ, n) + +# plot and compare +plt.hist(x, bins=150, density=True) +plt.show() +``` + +```{code-cell} ipython3 +np.random.geometric(1-λ, n).max() +``` + +```{code-cell} ipython3 +np.log(0.4)/np.log(0.3) +``` + +```{code-cell} ipython3 +plt.hist(x_g, bins=150, density=True, alpha=0.6) +plt.show() +```