Tom's edits of first wald_friedman lecture

thomassargent30 · thomassargent30 · commit cf0d75c48e54 · 2025-08-03T09:44:56.000-06:00
diff --git a/lectures/wald_friedman.md b/lectures/wald_friedman.md
@@ -44,9 +44,9 @@ In the spirit of {doc}`this earlier lecture <prob_meaning>`, the present lecture
 In this lecture, we describe  Wald's formulation of the problem from the perspective of a  statistician
 working within the Neyman-Pearson tradition of a frequentist statistician who thinks about testing  hypotheses and consequently  use  laws of large numbers to  investigate limiting properties of particular statistics under a given  **hypothesis**, i.e., a vector of **parameters** that pins down a  particular member of a manifold of statistical models that interest the statistician.   
 
-  * From {doc}`this earlier lecture on frequentist and bayesian statistics<prob_meaning>`, please remember that a  frequentist statistician routinely calculates functions of sequences of random variables, conditioning on a vector of parameters.
+  * From {doc}`this lecture on frequentist and bayesian statistics<prob_meaning>`, please remember that a  frequentist statistician routinely calculates functions of sequences of random variables, conditioning on a vector of parameters.
 
-In {doc}`this sequel <wald_friedman_2>` we'll discuss another formulation that adopts   the perspective of a **Bayesian statistician** who views parameters as vectors of random variables that are jointly distributed with  observable variables that he is concerned about.
+In {doc}`this related lecture <wald_friedman_2>` we'll discuss another formulation that adopts   the perspective of a **Bayesian statistician** who views parameters as random variables that are jointly distributed with  observable variables that he is concerned about.
 
 Because we are taking a frequentist perspective that is concerned about relative frequencies conditioned on alternative parameter values, i.e., 
 alternative **hypotheses**, key ideas in this lecture
@@ -85,7 +85,7 @@ during World War II, when they worked at the US Government's
 Statistical Research Group at Columbia University.
 
 ```{note}
-See pages 25 and 26  of Allen Wallis's 1980 article {cite}`wallis1980statistical`  about the Statistical Research Group at Columbia University during World War II for his account of the episode and  for important contributions  that Harold Hotelling made to formulating the problem.   Also see  chapter 5 of Jennifer Burns book about
+See pages 25 and 26  of Allen Wallis's 1980 article {cite}`wallis1980statistical`  about the Statistical Research Group at Columbia University during World War II for his account of the episode and  for important contributions  that Harold Hotelling made to formulating the problem.   Also see  chapter 5 of Jennifer Burns' book about
 Milton Friedman {cite}`Burns_2023`.
 ```
 
@@ -117,16 +117,18 @@ Let's listen to Milton Friedman tell us what happened
 > because it is obviously superior beyond what was hoped for
 > $\ldots$.
 
-Friedman and Wallis worked on  the problem but, after realizing that
-they were not able to solve it,  they told Abraham Wald about the problem.
+Friedman and Wallis worked on  the problem for a while but didn't completely solve it.
 
-That started Wald on the path that led him  to *Sequential Analysis* {cite}`Wald47`.
+Realizing that, they told Abraham Wald about the problem.
+
+That set  Wald on a path that led him  to create  *Sequential Analysis* {cite}`Wald47`.
 
 ##  Neyman-Pearson Formulation
 
 It is useful to begin by describing the theory underlying the test
-that Navy Captain G. S. Schuyler had been told to use and that led him
-to approach Milton Friedman and Allan Wallis to convey his conjecture
+that the U.S. Navy told  Captain G. S. Schuyler to use.
+
+Captain Schulyer's doubts  motivated  him to tell  Milton Friedman and Allan Wallis his conjecture
 that superior practical procedures existed.
 
 Evidently, the Navy had told Captain Schuyler to use what was then  a state-of-the-art
@@ -275,7 +277,7 @@ Here is how Wald introduces the notion of a sequential test
 
 ## Wald's Sequential Formulation 
 
-In contradistinction to Neyman and Pearson's formulation of the problem, in Wald's formulation
+By way of contrast to Neyman and Pearson's formulation of the problem, in Wald's formulation
 
 
 - The sample size $n$ is not fixed but rather  a random variable.
@@ -296,7 +298,7 @@ The density of a Beta probability distribution with parameters $a$ and $b$ is
 $$
 f(z; a, b) = \frac{\Gamma(a+b) z^{a-1} (1-z)^{b-1}}{\Gamma(a) \Gamma(b)}
 \quad \text{where} \quad
-\Gamma(t) := \int_{0}^{\infty} x^{t-1} e^{-x} dx
+\Gamma(p) := \int_{0}^{\infty} x^{p-1} e^{-x} dx
 $$
 
 The next figure shows two beta distributions.
@@ -363,19 +365,17 @@ chooses among three distinct actions:
   $z_{k+1}$
 
 
-Wald proceeds as follows.
-
-He defines
+Wald  defines
 
 - $p_{0m} = f_0(z_1) \cdots f_0(z_m)$
 - $p_{1m} = f_1(z_1) \cdots f_1(z_m)$
 - $L_{m} = \frac{p_{1m}}{p_{0m}}$
 
 Here $\{L_m\}_{m=0}^\infty$ is a **likelihood ratio process**.
 
-One of Wald's sequential  decision rule is parameterized by two real numbers $B < A$.
+Wald's sequential  decision rule is parameterized by  real numbers $B < A$.
 
-For a given pair $A, B$ the decision rule is 
+For a given pair $A, B$, the decision rule is 
 
 $$
 \begin{aligned}
@@ -429,12 +429,14 @@ In particular, Wald constructs a mathematical argument that leads him to conclud
  > the  number of observations required by the test. 
 
 
+We'll write some Python code to help us illustrate Wald's claims about how $\alpha$ and $\beta$ are related to the parameters $A$ and $B$
+that characterize his sequential probability ratio test.
 
 ## Simulations
 
-In this section, we experiment with different distributions $f_0$ and $f_1$ to examine how Wald's test performs under various conditions.
+We experiment with different distributions $f_0$ and $f_1$ to examine how Wald's test performs under various conditions.
 
-The goal of these simulations is to understand  trade-offs between decision speed and accuracy associated with Wald's  **sequential probability ratio test**.
+Our goal in conducting these simulations is to understand  trade-offs between decision speed and accuracy associated with Wald's  **sequential probability ratio test**.
 
 Specifically, we will watch  how:
 
@@ -457,9 +459,9 @@ SPRTParams = namedtuple('SPRTParams',
 
 Now we can run the simulation following Wald's recommendation. 
 
-We use the log-likelihood ratio and compare it to the logarithms of the thresholds $\log(A)$ and $\log(B)$.
+We'll compare the log-likelihood ratio  to logarithms of the thresholds $\log(A)$ and $\log(B)$.
 
-Below is the algorithm for the simulation.
+The following algorithm underlies our simulations.
 
 1. Compute thresholds $A = \frac{1-\beta}{\alpha}$, $B = \frac{\beta}{1-\alpha}$ and work with $\log A$, $\log B$.
 
@@ -589,8 +591,8 @@ print(f"Empirical type II error: {results['type_II']:.3f}   (target = {params.β
 
 As anticipated in the passage above in which Wald discussed the quality of 
 $a(\alpha, \beta), b(\alpha, \beta)$ given in approximation {eq}`eq:Waldrule`, 
-we find that the algorithm "overshoots" the error rates by giving us a 
-lower type I and type II error rates than the target values.
+we find that the algorithm actually gives 
+**lower** type I and type II error rates than the target values.
 
 ```{note}
 For recent work on the quality of approximation {eq}`eq:Waldrule`, see, e.g., {cite}`fischer2024improving`.
@@ -624,11 +626,11 @@ axes[1].set_ylabel("frequency")
 plt.show()
 ```
 
-In this simple case, the stopping time stays below 10.
+In this example, the stopping time stays below 10.
 
-We can also examine a $2 \times 2$  "confusion matrix" whose  diagonal elements 
-show the number of times when Wald's rule results in correct acceptance and 
-rejection of the null hypothesis.
+We  can construct a $2 \times 2$  "confusion matrix" whose  diagonal elements 
+count the number of times that Wald's  decision rule  correctly  accepts and 
+rejects the null hypothesis.
 
 ```{code-cell} ipython3
 # Accept H0 when H0 is true (correct)
@@ -769,21 +771,24 @@ plot_sprt_results(results_2, params_2)
 plot_sprt_results(results_3, params_3)
 ```
 
-We can see a clear pattern in the stopping times and how close "separated" the two distributions are.
+Notice that  the stopping times are less when the  two  distributions are farther apart.
+
+This makes sense.
+
+When two distributions are "far apart", it should not take too long to decide which one is generating the data.
 
-We can link this to the discussion of [Kullback–Leibler divergence](rel_entropy) in {doc}`likelihood_ratio_process`.
+When two distributions are "close", it should  takes longer to decide which one is generating the data.
 
-Intuitively, KL divergence is large when the distribution from one distribution to another is 
-large. 
+It is tempting to link this pattern to our discussion of [Kullback–Leibler divergence](rel_entropy) in {doc}`likelihood_ratio_process`.
 
-When two distributions are "far apart", it should not take long to decide which one is generating the data.
+While, KL divergence is larger when two distribution differ more, KL divergence is not symmetric, meaning that the KL divergence of distribution $f$ from distribution $g$  is not necessarily equal to the KL
+divergence of $g$ from $f$.  
 
-When two distributions are "close" to each other, it takes longer to decide which one is generating the data.
+If we want a symmetric measure of divergence that actually a metric, we can instead use  [Jensen-Shannon distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jensenshannon.html).
 
-However, KL divergence is not symmetric, meaning that the divergence from one distribution to another is not necessarily the same as the reverse.
+That is what we shall do now. 
 
-To measure the discrepancy between two distributions, we use a metric 
-called [Jensen-Shannon distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jensenshannon.html) and plot it against the average stopping times.
+We shall compute Jensen-Shannon distance  and plot it against the average stopping times.
 
 ```{code-cell} ipython3
 def kl_div(h, f):
@@ -843,7 +848,7 @@ plt.show()
 
 The plot demonstrates a clear negative correlation between relative entropy and mean stopping time. 
 
-As the KL divergence increases (distributions become more separated), the mean stopping time decreases exponentially.
+As  Jensen-Shannon divergence increases (distributions become more separated), the mean stopping time decreases exponentially.
 
 Below are sampled examples from the experiments we have above
 
@@ -972,7 +977,7 @@ plot_likelihood_paths(params_3, n_highlight=10, n_background=100)
 
 Next, let's adjust the decision thresholds $A$ and $B$ and examine how the mean stopping time and the type I and type II error rates change.
 
-In the code below, we break Wald's rule by adjusting the thresholds $A$ and $B$ using factors $A_f$ and $B_f$.
+In the code below, we adjust  Wald's rule by adjusting the thresholds $A$ and $B$ using factors $A_f$ and $B_f$.
 
 ```{code-cell} ipython3
 @njit(parallel=True)