You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: lectures/wald_friedman.md
+41-36Lines changed: 41 additions & 36 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,9 +44,9 @@ In the spirit of {doc}`this earlier lecture <prob_meaning>`, the present lecture
44
44
In this lecture, we describe Wald's formulation of the problem from the perspective of a statistician
45
45
working within the Neyman-Pearson tradition of a frequentist statistician who thinks about testing hypotheses and consequently use laws of large numbers to investigate limiting properties of particular statistics under a given **hypothesis**, i.e., a vector of **parameters** that pins down a particular member of a manifold of statistical models that interest the statistician.
46
46
47
-
* From {doc}`this earlier lecture on frequentist and bayesian statistics<prob_meaning>`, please remember that a frequentist statistician routinely calculates functions of sequences of random variables, conditioning on a vector of parameters.
47
+
* From {doc}`this lecture on frequentist and bayesian statistics<prob_meaning>`, please remember that a frequentist statistician routinely calculates functions of sequences of random variables, conditioning on a vector of parameters.
48
48
49
-
In {doc}`this sequel <wald_friedman_2>` we'll discuss another formulation that adopts the perspective of a **Bayesian statistician** who views parameters as vectors of random variables that are jointly distributed with observable variables that he is concerned about.
49
+
In {doc}`this related lecture <wald_friedman_2>` we'll discuss another formulation that adopts the perspective of a **Bayesian statistician** who views parameters as random variables that are jointly distributed with observable variables that he is concerned about.
50
50
51
51
Because we are taking a frequentist perspective that is concerned about relative frequencies conditioned on alternative parameter values, i.e.,
52
52
alternative **hypotheses**, key ideas in this lecture
@@ -85,7 +85,7 @@ during World War II, when they worked at the US Government's
85
85
Statistical Research Group at Columbia University.
86
86
87
87
```{note}
88
-
See pages 25 and 26 of Allen Wallis's 1980 article {cite}`wallis1980statistical` about the Statistical Research Group at Columbia University during World War II for his account of the episode and for important contributions that Harold Hotelling made to formulating the problem. Also see chapter 5 of Jennifer Burns book about
88
+
See pages 25 and 26 of Allen Wallis's 1980 article {cite}`wallis1980statistical` about the Statistical Research Group at Columbia University during World War II for his account of the episode and for important contributions that Harold Hotelling made to formulating the problem. Also see chapter 5 of Jennifer Burns' book about
89
89
Milton Friedman {cite}`Burns_2023`.
90
90
```
91
91
@@ -117,16 +117,18 @@ Let's listen to Milton Friedman tell us what happened
117
117
> because it is obviously superior beyond what was hoped for
118
118
> $\ldots$.
119
119
120
-
Friedman and Wallis worked on the problem but, after realizing that
121
-
they were not able to solve it, they told Abraham Wald about the problem.
120
+
Friedman and Wallis worked on the problem for a while but didn't completely solve it.
122
121
123
-
That started Wald on the path that led him to *Sequential Analysis* {cite}`Wald47`.
122
+
Realizing that, they told Abraham Wald about the problem.
123
+
124
+
That set Wald on a path that led him to create *Sequential Analysis* {cite}`Wald47`.
124
125
125
126
## Neyman-Pearson Formulation
126
127
127
128
It is useful to begin by describing the theory underlying the test
128
-
that Navy Captain G. S. Schuyler had been told to use and that led him
129
-
to approach Milton Friedman and Allan Wallis to convey his conjecture
129
+
that the U.S. Navy told Captain G. S. Schuyler to use.
130
+
131
+
Captain Schulyer's doubts motivated him to tell Milton Friedman and Allan Wallis his conjecture
130
132
that superior practical procedures existed.
131
133
132
134
Evidently, the Navy had told Captain Schuyler to use what was then a state-of-the-art
@@ -275,7 +277,7 @@ Here is how Wald introduces the notion of a sequential test
275
277
276
278
## Wald's Sequential Formulation
277
279
278
-
In contradistinction to Neyman and Pearson's formulation of the problem, in Wald's formulation
280
+
By way of contrast to Neyman and Pearson's formulation of the problem, in Wald's formulation
279
281
280
282
281
283
- The sample size $n$ is not fixed but rather a random variable.
@@ -296,7 +298,7 @@ The density of a Beta probability distribution with parameters $a$ and $b$ is
296
298
$$
297
299
f(z; a, b) = \frac{\Gamma(a+b) z^{a-1} (1-z)^{b-1}}{\Gamma(a) \Gamma(b)}
298
300
\quad \text{where} \quad
299
-
\Gamma(t) := \int_{0}^{\infty} x^{t-1} e^{-x} dx
301
+
\Gamma(p) := \int_{0}^{\infty} x^{p-1} e^{-x} dx
300
302
$$
301
303
302
304
The next figure shows two beta distributions.
@@ -363,19 +365,17 @@ chooses among three distinct actions:
363
365
$z_{k+1}$
364
366
365
367
366
-
Wald proceeds as follows.
367
-
368
-
He defines
368
+
Wald defines
369
369
370
370
- $p_{0m} = f_0(z_1) \cdots f_0(z_m)$
371
371
- $p_{1m} = f_1(z_1) \cdots f_1(z_m)$
372
372
- $L_{m} = \frac{p_{1m}}{p_{0m}}$
373
373
374
374
Here $\{L_m\}_{m=0}^\infty$ is a **likelihood ratio process**.
375
375
376
-
One of Wald's sequential decision rule is parameterized by two real numbers $B < A$.
376
+
Wald's sequential decision rule is parameterized by real numbers $B < A$.
377
377
378
-
For a given pair $A, B$ the decision rule is
378
+
For a given pair $A, B$, the decision rule is
379
379
380
380
$$
381
381
\begin{aligned}
@@ -429,12 +429,14 @@ In particular, Wald constructs a mathematical argument that leads him to conclud
429
429
> the number of observations required by the test.
430
430
431
431
432
+
We'll write some Python code to help us illustrate Wald's claims about how $\alpha$ and $\beta$ are related to the parameters $A$ and $B$
433
+
that characterize his sequential probability ratio test.
432
434
433
435
## Simulations
434
436
435
-
In this section, we experiment with different distributions $f_0$ and $f_1$ to examine how Wald's test performs under various conditions.
437
+
We experiment with different distributions $f_0$ and $f_1$ to examine how Wald's test performs under various conditions.
436
438
437
-
The goal of these simulations is to understand trade-offs between decision speed and accuracy associated with Wald's **sequential probability ratio test**.
439
+
Our goal in conducting these simulations is to understand trade-offs between decision speed and accuracy associated with Wald's **sequential probability ratio test**.
We can see a clear pattern in the stopping times and how close "separated" the two distributions are.
774
+
Notice that the stopping times are less when the two distributions are farther apart.
775
+
776
+
This makes sense.
777
+
778
+
When two distributions are "far apart", it should not take too long to decide which one is generating the data.
773
779
774
-
We can link this to the discussion of [Kullback–Leibler divergence](rel_entropy) in {doc}`likelihood_ratio_process`.
780
+
When two distributions are "close", it should takes longer to decide which one is generating the data.
775
781
776
-
Intuitively, KL divergence is large when the distribution from one distribution to another is
777
-
large.
782
+
It is tempting to link this pattern to our discussion of [Kullback–Leibler divergence](rel_entropy) in {doc}`likelihood_ratio_process`.
778
783
779
-
When two distributions are "far apart", it should not take long to decide which one is generating the data.
784
+
While, KL divergence is larger when two distribution differ more, KL divergence is not symmetric, meaning that the KL divergence of distribution $f$ from distribution $g$ is not necessarily equal to the KL
785
+
divergence of $g$ from $f$.
780
786
781
-
When two distributions are "close" to each other, it takes longer to decide which one is generating the data.
787
+
If we want a symmetric measure of divergence that actually a metric, we can instead use [Jensen-Shannon distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jensenshannon.html).
782
788
783
-
However, KL divergence is not symmetric, meaning that the divergence from one distribution to another is not necessarily the same as the reverse.
789
+
That is what we shall do now.
784
790
785
-
To measure the discrepancy between two distributions, we use a metric
786
-
called [Jensen-Shannon distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jensenshannon.html) and plot it against the average stopping times.
791
+
We shall compute Jensen-Shannon distance and plot it against the average stopping times.
787
792
788
793
```{code-cell} ipython3
789
794
def kl_div(h, f):
@@ -843,7 +848,7 @@ plt.show()
843
848
844
849
The plot demonstrates a clear negative correlation between relative entropy and mean stopping time.
845
850
846
-
As the KL divergence increases (distributions become more separated), the mean stopping time decreases exponentially.
851
+
As Jensen-Shannon divergence increases (distributions become more separated), the mean stopping time decreases exponentially.
847
852
848
853
Below are sampled examples from the experiments we have above
0 commit comments