Skip to content

Commit b0dd401

Browse files
committed
finish estimates post
1 parent ef6a059 commit b0dd401

10 files changed

+362
-43
lines changed

_pages/category-intro-data-science.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,4 @@ permalink: /intro-data-science/
55
taxonomy: Intro-Data-Science
66
---
77

8-
Sample post listing for the category.
8+
Series of posts introducing data science.

_pages/category-prob-mod.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
title: Probabilistic Models
3+
layout: category
4+
permalink: /probabilistic-models/
5+
taxonomy: Probabilistic-Models
6+
---
7+
8+
Series of posts surrounding probabilistic models.
File renamed without changes.

_pages/tag-estimation.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
title: Estimation
3+
layout: tag
4+
permalink: /tags/estimation/
5+
taxonomy: estimation
6+
---

_pages/tag-regression.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
title: Regression
3+
layout: tag
4+
permalink: /tags/regression/
5+
taxonomy: regression
6+
---

_posts/2025-02-20-intro.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ categories:
77
tags:
88
- regression
99
- classification
10+
- estimation
1011
toc: true
1112
toc_label: "Table of Contents"
1213
# toc_icon:
Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,6 @@ toc_label: "Table of Contents"
1414
# teaser:
1515
---
1616

17-
# Single Variable Linear Regression
18-
1917
## Linear Model and Residual Assumption
2018

2119
Single variable linear regression models a response variable $y$ as a linear function of a regressor variable $x$ plus a random component.

_posts/2025-03-01-reg1.md

Lines changed: 0 additions & 40 deletions
This file was deleted.

_posts/2025-03-05-estimates.md

Lines changed: 265 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,265 @@
1+
---
2+
layout: single
3+
title: "Statistical Estimation"
4+
excerpt: "How to estimate a parameter of a probability distribution"
5+
categories:
6+
- Intro-Data-Science
7+
tags:
8+
- estimation
9+
toc: true
10+
toc_label: "Table of Contents"
11+
# toc_icon:
12+
# header:
13+
# image:
14+
# teaser:
15+
---
16+
17+
# What is an estimate?
18+
19+
An **estimate** (or estimator) predicts a parameter from sample data. This post discusses point estimates, estimates that predict a single value.
20+
Interval estimates (or confidence intervals) provide a range of values that contain the parameter with a set level of confidence.
21+
22+
## Bias
23+
24+
The **bias** of an estimate is the expected difference between the estimate and true parameter value. An estimator is **unbiased** if the bias is zero.
25+
26+
$$Bias(\hat{\theta}) = \mathbb{E}[\hat{\theta} - \theta] = \mathbb{E}[\hat{\theta}] - \theta
27+
$$
28+
29+
## Example
30+
31+
$X_1, X_2, ..., X_n$ are i.i.d. (independent and identically distributed) random variables with mean $\mu$. Given a sample dataset $x_1, ..., x_n$, one estimate for $\mu$ is the sample mean.
32+
33+
$$\hat{\mu}_1 = \frac{1}{n} \sum_{i=1}^n x_i
34+
$$
35+
36+
Another estimate is a time-weighted average of the samples.
37+
38+
$$\hat{\mu}_2 = \frac{1\cdot x_1 + 2\cdot x_2 + ... + n \cdot x_n}{1+2+...+n} = \frac{\sum_{i=1}^n i \cdot x_i}{\sum_{i=1}^n i} = \frac{\sum_{i=1}^n i \cdot x_i}{n(n+1)/2}
39+
$$
40+
41+
Both are unbiased estimates.
42+
43+
$$Bias(\hat{\mu}_1) = \mathbb{E}[\frac{1}{n} \sum_{i=1}^n x_i] - \mu = \frac{1}{n} \sum_{i=1}^n \mathbb{E}[x_i] - \mu = \frac{1}{n} n \mu - \mu = 0
44+
$$
45+
$$\begin{aligned}
46+
Bias(\hat{\mu}_2) & = \mathbb{E}[\frac{\sum_{i=1}^n i \cdot x_i}{\sum_{i=1}^n i}] - \mu = \frac{\sum_{i=1}^n i \cdot \mathbb{E}[x_i]}{\sum_{i=1}^n i} - \mu
47+
\\
48+
& = \frac{\sum_{i=1}^n i \cdot \mu}{\sum_{i=1}^n i} - \mu = \frac{\mu \sum_{i=1}^n i}{\sum_{i=1}^n i} - \mu = \mu - \mu = 0
49+
\end{aligned}
50+
$$
51+
52+
53+
# Method of Moments Estimation
54+
55+
Method of moments is a simple approach to get an intuitive estimate of the parameters.
56+
It uses the $k$-th population moment calculated from the pdf and $k$-th sample moment calculated from the sample.
57+
The $k$-th population moment of a random variable $X$ is $$\mathbb{E}[X^k]$$. The $k$-th sample moment is the mean observed value in a sample $$\frac{1}{n}\sum_{i=1}^{n} x_i^k$$.
58+
59+
The method of moment estimates the parameters as the values that result in population moments equal to the sample moments. To estimate $k$ parameters, the first $k$ moments are used.
60+
61+
$$
62+
\begin{cases}
63+
\mathbb{E}[X] = \frac{1}{n} \sum_{i=1}^{n} x_i \\
64+
\mathbb{E}[X^2] = \frac{1}{n} \sum_{i=1}^{n} x_i^2 \\
65+
\vdots \\
66+
\mathbb{E}[X^k] = \frac{1}{n} \sum_{i=1}^{n} x_i^k
67+
\end{cases}
68+
$$
69+
70+
Shorthand notation for the $k$-th sample moment is $$\bar{X^k}$$.
71+
Method of moments depends on expressing the population moments as a function of the parameters, which may be impossible if the pdf has no closed form integral.
72+
73+
## Exponential
74+
75+
To estimate a single parameter, only the first moment (the mean) is needed.
76+
For example, the method of moments estimate for $\lambda$ when $$X\sim \text{Exponential}(\lambda)$$ uses the mean $$\mathbb{E}[X]=1/\lambda$$ to produce the estimate:
77+
78+
$$\frac{1}{\hat{\lambda}}=\frac{1}{n} \sum_{i=1}^{n} x_i $$.
79+
80+
$$\hat{\lambda}=\frac{n}{\sum_{i=1}^{n} x_i} $$.
81+
82+
## Normal
83+
Method of moments can quickly estimate the parameters $$\mu, \sigma^2$$ of the Normal distribution.
84+
$\mu$ is the mean of the distribution for a random variable $X$, $$\mathbb{E}[X]$$, and $$\sigma^2$$ is the variance of the distribution $\text{Var}(X)$.
85+
These parameters could still be estimated for a non-normal random variable. Then they would still represent the mean and variance, but not appear in the pmf/pdf.
86+
87+
Recall the computation of variance
88+
89+
$$\text{Var}(X)=\mathbb{E}[X^2]-(\mathbb{E}[X])^2$$
90+
91+
Then the second moment $$\mathbb{E}[X^2]$$ can be written in terms of the parameters
92+
93+
$$\mathbb{E}[X^2] = \text{Var}(X) + (\mathbb{E}[X])^2 = \sigma^2 + \mu^2$$
94+
95+
The method of moment system is
96+
97+
$$
98+
\begin{cases}
99+
\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} x_i \\
100+
\hat{\sigma}^2 + \hat{\mu}^2 = \frac{1}{n} \sum_{i=1}^{n} x_i^2
101+
\end{cases}
102+
$$
103+
104+
In the first equation, the true population mean $\mu$ is estimated as the sample mean $\hat{\mu}=\bar{X}$. Substituing the first equation into the second provides
105+
106+
$$\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} x_i^2 - (\frac{1}{n} \sum_{i=1}^{n} x_i) ^ 2 $$
107+
108+
This means that the method of moments estimate for $\sigma^2$ is an unbiased version of the sample variance since
109+
110+
$$ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{X})^2 $$
111+
112+
The sample variance $$S^2$$ is not an unbiased estimate for the true population variance. The sample variance instead subtracts one from the denominator.
113+
114+
$$ S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{X})^2 $$
115+
116+
$$ \mathbb{E}[S^2] = \frac{n}{n-1} \sigma^2 $$
117+
118+
Although the bias decreases with $n$ so for large $n$ the sample variance is close to unbiased.
119+
120+
$$Bias(S^2) = \mathbb{E}[S^2 - \sigma^2] = (\frac{n}{n-1} - 1) \sigma^2 = \frac{\sigma^2}{n-1} $$
121+
122+
# Maximum Likelihood Estimation
123+
**Maximum Likelihood Estimation (MLE)** is a [data model]({% link _posts/2025-02-20-intro.md %}) that chooses the parameter estimates that maximize the likelihood of the data sample given the parameters.
124+
125+
$$\hat{\theta} = \arg \max_\theta L(\theta)$$
126+
127+
The likelihood function $L(\theta)$ represents the probability of observing the sample data given that $\theta$ is the true parameter value.
128+
Let $f(x; \theta)$ be the pmf/pdf of a random variable $X$ given parameter value $\theta$ and a dataset $x_1, x_2, ..., x_n$. Then the likelihood is
129+
130+
$$L(\theta; x_1, ..., x_n) = \prod_{i=1}^{n} f(x_i; \theta) $$
131+
132+
To find the maximum likelihood solution $\hat{\theta}$, it is easier to maximize the loglikelood function
133+
134+
$$l(\theta) = \log L(\theta) = \log \prod_{i=1}^{n} f(x_i; \theta) = \sum_{i=1}^{n} \log(f(x_i; \theta)) $$
135+
136+
The input that maximizes the loglikelihood must maximize the likelihood since log doesn't change the position of the max.
137+
138+
$$\hat{\theta} = \arg \max_\theta L(\theta) = \arg \max_\theta l(\theta)$$
139+
140+
The loglikelihood function is often concave so the maximum can be found by taking the first derivative (gradient) and setting it equal to zero.
141+
142+
## Exponential
143+
For example, let's find the MLE of the Exponential distribution. $X\sim \text{Exponential}(\lambda)$ has pdf
144+
145+
$$ f(x; \lambda) = \lambda e^{-\lambda x}, \quad x \geq 0 $$
146+
147+
The loglikelihood function and derivative follow.
148+
149+
$$l(\lambda) = \sum_{i=1}^{n} \log(\lambda e^{-\lambda x}) = n \log \lambda - \lambda \sum_{i=1}^n x_i$$
150+
$$ \frac{dl}{d\lambda} = \frac{n}{\lambda} - \sum_{i=1}^{n} x_i $$
151+
152+
The MLE has derivative equal to zero
153+
$$\frac{n}{\lambda} - \sum_{i=1}^{n} x_i = 0 $$
154+
$$ \hat{\lambda} = \frac{n}{\sum_{i=1}^{n} x_i} $$
155+
156+
## Normal
157+
158+
For a normal distribution with mean $\mu$ and variance $\sigma^2$, the pdf is
159+
$$ f(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} $$
160+
161+
The likelihood is
162+
$$ L(\mu, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}} $$
163+
164+
The loglikelihood is
165+
$$ l(\mu, \sigma^2) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2 $$
166+
167+
To find the MLEs, we take the partial derivatives of $l$ with respect to $\mu$ and $\sigma^2$ and set them equal to zero. For $\mu$
168+
169+
$$ \frac{\partial l}{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^{n} (x_i - \mu) = 0 $$
170+
$$ \hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} x_i $$
171+
172+
For $\sigma^2$
173+
174+
$$ \frac{\partial l}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{i=1}^{n} (x_i - \mu)^2 = 0 $$
175+
$$ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \hat{\mu})^2 $$
176+
177+
For Exponential and Normal, the maximum likelihood estimates are the same as the method of moments estimates.
178+
That's a good illustration of why method of moments is a good rule of thumb for simple distributions.
179+
However, MLE is generally a better method and finds the optimal solution for more distributions.
180+
Even when the likelihood is not concave, the first order approximation of a local maximum is often good enough.
181+
182+
# Estimate Evaluation
183+
How should an estimate be selected? MLE produces good estimates, but they can be biased or higher variance than other estimates. The "best" estimate could be considered to be the estimate with lowest MSE or the most efficient estimate. Even when estimates besides MLE are not considered, the MSE, efficiency, and sufficiency of the MLE describe its quality.
184+
185+
## Mean Squared Error (MSE)
186+
In addition to bias, mean squared error is a metric to evaluate estimates. **Mean squared error (MSE)** is the mean of the squared difference between the estimate and true parameter value.
187+
188+
$$
189+
\text{MSE}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \theta)^2] \qquad \text{(definition)}
190+
$$
191+
192+
The definition can be simplified to use in computations.
193+
194+
$$
195+
\begin{aligned}
196+
\text{MSE}(\hat{\theta})
197+
& = \mathbb{E}[(\hat{\theta} - \theta)^2] \\
198+
& = \mathbb{E}[(\hat{\theta} - \mathbb{E}\hat{\theta} + \mathbb{E}\hat{\theta} - \theta)^2] \\
199+
& = \mathbb{E}[(\hat{\theta} - \mathbb{E}\hat{\theta})^2 + 2(\hat{\theta} - \mathbb{E}\hat{\theta})(\mathbb{E}\hat{\theta} - \theta) + (\mathbb{E}\hat{\theta} - \theta)^2] \\
200+
& = \mathbb{E}[(\hat{\theta} - \mathbb{E}\hat{\theta})^2 + 2(0)(\mathbb{E}\hat{\theta} - \theta) + (\mathbb{E}\hat{\theta} - \theta)^2] \\
201+
& = \mathbb{E}[(\hat{\theta} - \mathbb{E}\hat{\theta})^2] + (\mathbb{E}\hat{\theta} - \theta)^2 \\
202+
\text{MSE}(\hat{\theta}) & = Var(\hat{\theta}) + Bias(\hat{\theta})^2 \quad \text{(computation)}
203+
\end{aligned}
204+
$$
205+
206+
Thus, MSE is a metric that considers both variance and bias. MSE could be used to compare estimates. However, when making predictions, lack of bias is preferable to lower variance. So the best estimate is an unbiased estimate with the lowest MSE and thus lowest variance. This is called an miminum variance unbiased estimate or **efficient estimate**.
207+
208+
## Efficient Estimate and Cramer-Rao Bound
209+
210+
An efficient estimate acheives the minimum variance among all unbiased estimators.
211+
The minimum possible variance is not a straightforward computation from a model, but a lower bound is known.
212+
The **Cramer-Rao Bound** provides a lower bound on the variance of any unbiased estimator $\hat{\theta}$ of a parameter $\theta$.
213+
An estimate can be compared against this bound to determine if the variance is relatively small and the estimate relatively good.
214+
The Cramer-Rao Bound is
215+
216+
$$
217+
\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}
218+
$$
219+
220+
where $\(I(\theta)\)$ is the Fisher Information, which is the expected squared derivative of the loglikelihood (see [MLE](#maximum-likelihood-estimation) ) of the data $X$ given $\theta$
221+
222+
$$
223+
I(\theta) = \mathbb{E} \left[ \left( \frac{\partial}{\partial \theta} \log L(X; \theta) \right)^2 \right]
224+
$$
225+
226+
The Fisher Information quantifies the amount of information that an observable random variable $X$ provides about an unknown parameter $\theta$.
227+
228+
## Sufficient Estimate and Fisher-Neyman Theorem
229+
230+
Let $X=(X_1,X_2,...,X_N)$ be a vector of i.i.d. random variables, $T(X)$ be a transformation of $X$, and $x$ denote an outcome/value of $X$. Then $X_i$ has pmf/pdf $f(x_i;\theta)$ with parameter $\theta$ and $X$ has joint pmf/pdf $f(x;\theta) $ $=$ $f(x_1,x_2,...,x_n; \theta) $ $ =$ $\prod_{i=1}^n f(x_i; \theta)$.
231+
An estimate for $\theta$ calculated from a data sample is not necessarily a sufficient statistic. However, any efficient estimate for $\theta$ must be a function of a sufficient statistic $T(X)$. Definition: $T(X)$ is a sufficient statistic if the probability distribution of $X$ given a value for $T(X)$ is constant with respect to $\theta$. In math notation, $f(x \| T(X)=T(x))$ is not a function of $\theta$. In other words, the statistic $T(X)$ provides as much information about $\theta$ as the entire data sample $X$.
232+
233+
The Fisher-Neyman Factorization Theorem provides a shortcut to prove that a statistic is sufficient. The theorem states $T(X)$ is a sufficient statistic for $\theta$ if joint pmf/pdf $f(x; \theta)$ $=$ $g(T(x), \theta) h(x)$ for some $g$ that is a function of $T(X)$ and the parameters and some $h$ that is any function of $X$ and the parameters besides $\theta$.
234+
235+
For example, let $X_1, X_2, ..., X_n \sim \text{Poisson} (\lambda)$. The maximum likelihood estimate for $\lambda$ is the sample mean $\frac{1}{n} \sum_{i=1}^n x_i$.
236+
This is a function of $T(x)$ $ =$ $\sum_{i=1}^n x_i$, which is a sufficient statistic for $\lambda$. The sufficiency can be shown by the definition or theorem.
237+
238+
Recall $X_i \sim \text{Poisson} (\lambda)$ has pmf
239+
$$
240+
f\left(x_i\right)=\frac{e^{-\lambda} \lambda^{x_i}}{x_{i}!}
241+
$$
242+
243+
and $T(X)=\sum_{i=1}^n X_i$ follows a $\text{Poisson}(n\lambda)$ distribution.
244+
245+
To prove sufficiency using the definition, show that $f(x \| T(X)=s)$ is equivalent to a function without $\lambda$. I'm using $s$ since the statistic is the sample sum.
246+
247+
$$f\big(x\big| \sum_{i=1}^nx_i = s\big) = \frac{\prod_{i=1}^n \frac{e^{-\lambda}\lambda^{x_i}}{x_{i}!}}{\frac{e^{-n \lambda} \cdot (n \lambda)^{s} }{s!}}
248+
$$
249+
250+
$$= \frac{\frac{e^{-n\lambda} \cdot \lambda^{\sum_{i=1}^n x_i}}{\prod_{i=1}^n x_{i}!}}{\frac{e^{-n \lambda} \cdot (n \lambda)^{s} }{s!}}
251+
$$
252+
253+
$$= \frac{\frac{e^{-n\lambda} \cdot \lambda^{s}}{\prod_{i=1}^n x_{i}!}}{\frac{e^{-n \lambda} \cdot n^s \lambda^{s} }{s!}}
254+
$$
255+
256+
$$= \frac{s! \cdot n^{-s}}{\prod_{i=1}^n x_{i}!}
257+
$$
258+
259+
A proof using the theorem is much faster.
260+
261+
$$f(x;\lambda) = \prod_{i=1}^n f(x_i) = \prod_{i=1}^n \frac{e^{-\lambda} \lambda^{x_i} }{x_{i}!}
262+
= \frac{e^{-n\lambda} \lambda^{\sum_{i=1}^n x_i}}{\prod_{i=1}^n x_{i}!} $$
263+
264+
The numerator is a function $g$ of $T(x) = \sum_{i=1}^n x_i$ and $\lambda$ and the denominator is a function $h$ of $x$ without $\lambda$.
265+

0 commit comments

Comments
 (0)