diff --git a/lectures/_config.yml b/lectures/_config.yml index b94706698..23a9a0c04 100644 --- a/lectures/_config.yml +++ b/lectures/_config.yml @@ -94,6 +94,9 @@ sphinx: intro: - "https://intro.quantecon.org/" - null + advanced: + - "https://python-advanced.quantecon.org" + - null mathjax3_config: tex: macros: diff --git a/lectures/likelihood_ratio_process.md b/lectures/likelihood_ratio_process.md index c0c845359..23bf00433 100644 --- a/lectures/likelihood_ratio_process.md +++ b/lectures/likelihood_ratio_process.md @@ -3,8 +3,10 @@ jupytext: text_representation: extension: .md format_name: myst + format_version: 0.13 + jupytext_version: 1.17.1 kernelspec: - display_name: Python 3 + display_name: Python 3 (ipykernel) language: python name: python3 --- @@ -29,26 +31,30 @@ kernelspec: This lecture describes likelihood ratio processes and some of their uses. -We'll use a setting described in {doc}`this lecture `. +We'll study the same setting that is also used in {doc}`this lecture on exchangeability `. Among things that we'll learn are -* A peculiar property of likelihood ratio processes * How a likelihood ratio process is a key ingredient in frequentist hypothesis testing * How a **receiver operator characteristic curve** summarizes information about a false alarm probability and power in frequentist hypothesis testing +* How a statistician can combine frequentist probabilities of type I and type II errors to form posterior probabilities of erroneous model selection or missclassification of individuals * How during World War II the United States Navy devised a decision rule that Captain Garret L. Schyler challenged, a topic to be studied in {doc}`this lecture ` +* A peculiar property of likelihood ratio processes -Let's start by importing some Python tools. +Let's start by importing some Python tools. -```{code-cell} ipython +```{code-cell} ipython3 import matplotlib.pyplot as plt plt.rcParams["figure.figsize"] = (11, 5) #set default figure size import numpy as np from numba import vectorize, jit from math import gamma from scipy.integrate import quad +from scipy.optimize import brentq +from scipy.stats import beta as beta_dist + ``` ## Likelihood Ratio Process @@ -102,8 +108,7 @@ observations up to and including time $t$. Sometimes for shorthand we'll write $L_t = L(w^t)$. -Notice that the likelihood process satisfies the *recursion* or -*multiplicative decomposition* +Notice that the likelihood process satisfies the *recursion* $$ L(w^t) = \ell (w_t) L (w^{t-1}) . @@ -116,9 +121,9 @@ Pearson {cite}`Neyman_Pearson`. To help us appreciate how things work, the following Python code evaluates $f$ and $g$ as two different beta distributions, then computes and simulates an associated likelihood ratio process by generating a sequence $w^t$ from one of the two -probability distributionss, for example, a sequence of IID draws from $g$. +probability distributions, for example, a sequence of IID draws from $g$. -```{code-cell} python3 +```{code-cell} ipython3 # Parameters in the two beta distributions. F_a, F_b = 1, 1 G_a, G_b = 3, 1.2 @@ -133,7 +138,7 @@ f = jit(lambda x: p(x, F_a, F_b)) g = jit(lambda x: p(x, G_a, G_b)) ``` -```{code-cell} python3 +```{code-cell} ipython3 @jit def simulate(a, b, T=50, N=500): ''' @@ -153,17 +158,18 @@ def simulate(a, b, T=50, N=500): return l_arr ``` +(nature_likeli)= ## Nature Permanently Draws from Density g We first simulate the likelihood ratio process when nature permanently draws from $g$. -```{code-cell} python3 +```{code-cell} ipython3 l_arr_g = simulate(G_a, G_b) l_seq_g = np.cumprod(l_arr_g, axis=1) ``` -```{code-cell} python3 +```{code-cell} ipython3 N, T = l_arr_g.shape for i in range(N): @@ -181,8 +187,9 @@ To see it this more clearly clearly, we plot over time the fraction of paths $L\left(w^{t}\right)$ that fall in the interval $\left[0, 0.01\right]$. -```{code-cell} python3 +```{code-cell} ipython3 plt.plot(range(T), np.sum(l_seq_g <= 0.01, axis=0) / N) +plt.show() ``` Despite the evident convergence of most probability mass to a @@ -245,7 +252,7 @@ To illustrate this peculiar property, we simulate many paths and calculate the unconditional mean of $L\left(w^t\right)$ by averaging across these many paths at each $t$. -```{code-cell} python3 +```{code-cell} ipython3 l_arr_g = simulate(G_a, G_b, N=50000) l_seq_g = np.cumprod(l_arr_g, axis=1) ``` @@ -292,22 +299,24 @@ Simulations below confirm this conclusion. Please note the scale of the $y$ axis. -```{code-cell} python3 +```{code-cell} ipython3 l_arr_f = simulate(F_a, F_b, N=50000) l_seq_f = np.cumprod(l_arr_f, axis=1) ``` -```{code-cell} python3 +```{code-cell} ipython3 N, T = l_arr_f.shape plt.plot(range(T), np.mean(l_seq_f, axis=0)) +plt.show() ``` We also plot the probability that $L\left(w^t\right)$ falls into the interval $[10000, \infty)$ as a function of time and watch how fast probability mass diverges to $+\infty$. -```{code-cell} python3 +```{code-cell} ipython3 plt.plot(range(T), np.sum(l_seq_f > 10000, axis=0) / N) +plt.show() ``` ## Likelihood Ratio Test @@ -331,12 +340,17 @@ We specify Neyman and Pearson proved that the best way to test this hypothesis is to use a **likelihood ratio test** that takes the form: +- accept $H_0$ if $L(W^t) > c$, - reject $H_0$ if $L(W^t) < c$, -- accept $H_0$ otherwise. -where $c$ is a given discrimination threshold, to be chosen in a way we'll soon describe. -This test is *best* in the sense that it is a **uniformly most powerful** test. +where $c$ is a given discrimination threshold. + +Setting $c =1$ is a common choice. + +We'll discuss consequences of other choices of $c$ below. + +This test is *best* in the sense that it is **uniformly most powerful**. To understand what this means, we have to define probabilities of two important events that allow us to characterize a test associated with a given @@ -344,13 +358,22 @@ threshold $c$. The two probabilities are: -- Probability of detection (= power = 1 minus probability - of Type II error): +- Probability of a Type I error in which we reject $H_0$ when it is true: + $$ - 1-\beta \equiv \Pr\left\{ L\left(w^{t}\right)c\mid q=g\right\} $$ +These two probabilities underly the following two concepts: + + - Probability of false alarm (= significance level = probability of Type I error): @@ -358,6 +381,14 @@ The two probabilities are: \alpha \equiv \Pr\left\{ L\left(w^{t}\right) 1$ + When we hold $c$ fixed at $c=1$, the following graph shows that * the probability of detection monotonically increases with increases in $t$ * the probability of a false alarm monotonically decreases with increases in $t$. -```{code-cell} python3 +```{code-cell} ipython3 PD = np.empty(T) PFA = np.empty(T) @@ -474,7 +511,7 @@ curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). Below, we plot receiver operating characteristic curves for different sample sizes $t$. -```{code-cell} python3 +```{code-cell} ipython3 PFA = np.arange(0, 100, 1) for t in range(1, 15, 4): @@ -520,7 +557,7 @@ $0.05$. The required sample size for making a decision is then determined by a target probability of detection, for example, $0.9$, as depicted in the following graph. -```{code-cell} python3 +```{code-cell} ipython3 PFA = 0.05 PD = np.empty(T) @@ -599,13 +636,13 @@ We’ll now experiment with an $h$ is also a beta distribution We’ll start by setting parameters $G_a$ and $G_b$ so that $h$ is closer to $g$ -```{code-cell} python3 +```{code-cell} ipython3 H_a, H_b = 3.5, 1.8 h = jit(lambda x: p(x, H_a, H_b)) ``` -```{code-cell} python3 +```{code-cell} ipython3 x_range = np.linspace(0, 1, 100) plt.plot(x_range, f(x_range), label='f') plt.plot(x_range, g(x_range), label='g') @@ -618,7 +655,7 @@ plt.show() Let’s compute the Kullback–Leibler discrepancies by quadrature integration. -```{code-cell} python3 +```{code-cell} ipython3 def KL_integrand(w, q, h): m = q(w) / h(w) @@ -626,7 +663,7 @@ def KL_integrand(w, q, h): return np.log(m) * q(w) ``` -```{code-cell} python3 +```{code-cell} ipython3 def compute_KL(h, f, g): Kf, _ = quad(KL_integrand, 0, 1, args=(f, h)) @@ -635,7 +672,7 @@ def compute_KL(h, f, g): return Kf, Kg ``` -```{code-cell} python3 +```{code-cell} ipython3 Kf, Kg = compute_KL(h, f, g) Kf, Kg ``` @@ -645,7 +682,7 @@ We have $K_g < K_f$. Next, we can verify our conjecture about $L\left(w^t\right)$ by simulation. -```{code-cell} python3 +```{code-cell} ipython3 l_arr_h = simulate(H_a, H_b) l_seq_h = np.cumprod(l_arr_h, axis=1) ``` @@ -653,10 +690,10 @@ l_seq_h = np.cumprod(l_arr_h, axis=1) The figure below plots over time the fraction of paths $L\left(w^t\right)$ that fall in the interval $[0,0.01]$. -Notice that it converges to 1 as expected when $g$ is closer to +Notice that, as expected, it converges to 1 when $g$ is closer to $h$ than $f$ is. -```{code-cell} python3 +```{code-cell} ipython3 N, T = l_arr_h.shape plt.plot(range(T), np.sum(l_seq_h <= 0.01, axis=0) / N) ``` @@ -664,17 +701,17 @@ plt.plot(range(T), np.sum(l_seq_h <= 0.01, axis=0) / N) We can also try an $h$ that is closer to $f$ than is $g$ so that now $K_g$ is larger than $K_f$. -```{code-cell} python3 +```{code-cell} ipython3 H_a, H_b = 1.2, 1.2 h = jit(lambda x: p(x, H_a, H_b)) ``` -```{code-cell} python3 +```{code-cell} ipython3 Kf, Kg = compute_KL(h, f, g) Kf, Kg ``` -```{code-cell} python3 +```{code-cell} ipython3 l_arr_h = simulate(H_a, H_b) l_seq_h = np.cumprod(l_arr_h, axis=1) ``` @@ -682,15 +719,365 @@ l_seq_h = np.cumprod(l_arr_h, axis=1) Now probability mass of $L\left(w^t\right)$ falling above $10000$ diverges to $+\infty$. -```{code-cell} python3 +```{code-cell} ipython3 N, T = l_arr_h.shape plt.plot(range(T), np.sum(l_seq_h > 10000, axis=0) / N) ``` +## Hypothesis Testing and Classification + +We now describe how a statistician can combine frequentist probabilities of type I and type II errors in order to + +* compute an anticipated frequency of selecting a wrong model based on a sample length $T$ +* compute an anticipated error rate in a classification problem + +We consider a situation in which nature generates data by mixing known densities $f$ and $g$ with known mixing +parameter $\pi_{-1} \in (0,1)$ so that the random variable $w$ is drawn from the density + +$$ +h (w) = \pi_{-1} f(w) + (1-\pi_{-1}) g(w) +$$ + +We assume that the statistician knows the densities $f$ and $g$ and also the mixing parameter $\pi_{-1}$. + +Below, we'll set $\pi_{-1} = .5$, although much of the analysis would follow through with other settings of $\pi_{-1} \in (0,1)$. + +We assume that $f$ and $g$ both put positive probabilities on the same intervals of possible realizations of the random variable $W$. + + + +In the simulations below, we specify that $f$ is a $\text{Beta}(1, 1)$ distribution and that $g$ is $\text{Beta}(3, 1.2)$ distribution, +just as we did often earlier in this lecture. + +We consider two alternative timing protocols. + + * Timing protocol 1 is for the model selection problem + * Timing protocol 2 is for the individual classification problem + +**Protocol 1:** Nature flips a coin once at time $t=-1$ and with probability $\pi_{-1}$ generates a sequence $\{w_t\}_{t=1}^T$ +of IID draws from $f$ and with probability $1-\pi_{-1}$ generates a sequence $\{w_t\}_{t=1}^T$ +of IID draws from $g$. + +Let's write some Python code that implements timing protocol 1. + +```{code-cell} ipython3 +def protocol_1(π_minus_1, T, N=1000): + """ + Simulate Protocol 1: + Nature decides once at t=-1 which model to use. + """ + + # On-off coin flip for the true model + true_models_F = np.random.rand(N) < π_minus_1 + + sequences = np.empty((N, T)) + + n_f = np.sum(true_models_F) + n_g = N - n_f + if n_f > 0: + sequences[true_models_F, :] = np.random.beta(F_a, F_b, (n_f, T)) + if n_g > 0: + sequences[~true_models_F, :] = np.random.beta(G_a, G_b, (n_g, T)) + + return sequences, true_models_F +``` + +**Protocol 2.** At each time $t \geq 0$, nature flips a coin and with probability $\pi_{-1}$ draws $w_t$ from $f$ and with probability $1-\pi_{-1}$ draws $w_t$ from $g$. + +Here is Python code that we'll use to implement timing protocol 2. + +```{code-cell} ipython3 +def protocol_2(π_minus_1, T, N=1000): + """ + Simulate Protocol 2: + Nature decides at each time step which model to use. + """ + + # Coin flips for each time t upto T + true_models_F = np.random.rand(N, T) < π_minus_1 + + sequences = np.empty((N, T)) + + n_f = np.sum(true_models_F) + n_g = N * T - n_f + if n_f > 0: + sequences[true_models_F] = np.random.beta(F_a, F_b, n_f) + if n_g > 0: + sequences[~true_models_F] = np.random.beta(G_a, G_b, n_g) + + return sequences, true_models_F +``` + +**Remark:** Under protocol 2, the $\{w_t\}_{t=1}^T$ is a sequence of IID draws from $h(w)$. Under protocol 1, the the $\{w_t\}_{t=1}^T$ is +not IID. It is **conditionally IID** -- meaning that with probability $\pi_{-1}$ it is a sequence of IID draws from $f(w)$ and with probability $1-\pi_{-1}$ it is a sequence of IID draws from $g(w)$. For more about this, see {doc}`this lecture about exchangeability `. + +We again deploy a **likelihood ratio process** with time $t$ component being the likelihood ratio + +$$ +\ell (w_t)=\frac{f\left(w_t\right)}{g\left(w_t\right)},\quad t\geq1. +$$ + +The **likelihood ratio process** for sequence $\left\{ w_{t}\right\} _{t=1}^{\infty}$ is + +$$ +L\left(w^{t}\right)=\prod_{i=1}^{t} \ell (w_i), +$$ + +For shorthand we'll write $L_t = L(w^t)$. + +In the next cell, we write the likelihood ratio calculation that we have done [previously](nature_likeli) into a function + +```{code-cell} ipython3 +def compute_likelihood_ratios(sequences): + """ + Compute likelihood ratios for given sequences. + """ + + l_ratios = f(sequences) / g(sequences) + L_cumulative = np.cumprod(l_ratios, axis=1) + return l_ratios, L_cumulative +``` + +## Model Selection Mistake Probability + +We first study a problem that assumes timing protocol 1. + +Consider a decision maker who wants to know whether model $f$ or model $g$ governs a data set of length $T$ observations. + +The decision makers has observed a sequence $\{w_t\}_{t=1}^T$. + +On the basis of that observed sequence, a likelihood ratio test selects model $f$ when + $L_T \geq 1 $ and model $g$ when $L_T < 1$. + +When model $f$ generates the data, the probability that the likelihood ratio test selects the wrong model is + +$$ +p_f = {\rm Prob}\left(L_T < 1\Big| f\right) = \alpha_T . +$$ + +When model $g$ generates the data, the probability that the likelihood ratio test selects the wrong model is + +$$ +p_g = {\rm Prob}\left(L_T \geq 1 \Big|g \right) = \beta_T. +$$ + +We can construct a probability that the likelihood ratio selects the wrong model by assigning a Bayesian prior probability of $\pi_{-1} = .5$ that nature selects model $f$ then averaging $p_f$ and $p_g$ to form the Bayesian posterior probability of a detection error equal to + +$$ +p(\textrm{wrong decision}) = {1 \over 2} (\alpha_T + \beta_T) . +$$ (eq:detectionerrorprob) + +Now let's simulate the protocol 1 and compute the error probabilities + +```{code-cell} ipython3 +# Set parameters +π_minus_1 = 0.5 +T_max = 30 +N_simulations = 10_000 + +sequences_p1, true_models_p1 = protocol_1( + π_minus_1, T_max, N_simulations) +l_ratios_p1, L_cumulative_p1 = compute_likelihood_ratios(sequences_p1) + +# Compute error probabilities for different sample sizes +T_range = np.arange(1, T_max + 1) + +# Boolean masks for true models +mask_f = true_models_p1 +mask_g = ~true_models_p1 + +# Select cumulative likelihoods for each model +L_f = L_cumulative_p1[mask_f, :] +L_g = L_cumulative_p1[mask_g, :] + +α_T = np.mean(L_f < 1, axis=0) +β_T = np.mean(L_g >= 1, axis=0) + +error_prob = 0.5 * (α_T + β_T) + +# Plot results +fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5)) + +ax1.plot(T_range, α_T, 'b-', + label=r'$\alpha_T$', linewidth=2) +ax1.plot(T_range, β_T, 'r-', + label=r'$\beta_T$', linewidth=2) +ax1.set_xlabel('$T$') +ax1.set_ylabel('error probability') +ax1.legend() + +ax2.plot(T_range, error_prob, 'g-', + label=r'$\frac{1}{2}(\alpha_T+\beta_T)$', linewidth=2) +ax2.set_xlabel('$T$') +ax2.set_ylabel('error probability') +ax2.legend() + +plt.tight_layout() +plt.show() + +print(f"At T={T_max}:") +print(f"α_{T_max} = {α_T[-1]:.4f}") +print(f"β_{T_max} = {β_T[-1]:.4f}") +print(f"Model selection error probability = {error_prob[-1]:.4f}") +``` + + +Notice how the model selection error probability approaches zero as $T$ grows. + +## Classification + +We now consider a problem that assumes timing protocol 2. + +A decision maker wants to classify components of an observed sequence $\{w_t\}_{t=1}^T$ as having been drawn from either $f$ or $g$. + +The decision maker uses the following classification rule: + +$$ +\begin{aligned} +w_t & \ {\rm is \ from \ f \ if \ } l_t > 1 \\ +w_t & \ {\rm is \ from \ g \ if \ } l_t \leq 1 . +\end{aligned} +$$ + +Under this rule, the expected misclassification rate is + +$$ +p(\textrm{misclassification}) = {1 \over 2} (\tilde \alpha_t + \tilde \beta_t) +$$ (eq:classerrorprob) + +where $\tilde \alpha_t = {\rm Prob}(l_t < 1 \mid f)$ and $\tilde \beta_t = {\rm Prob}(l_t \geq 1 \mid g)$. + +Now let's simulate protocol 2 and compute the classification error probability. + +```{code-cell} ipython3 +sequences_p2, true_sources_p2 = protocol_2( + π_minus_1, T_max, N_simulations) +l_ratios_p2, _ = compute_likelihood_ratios(sequences_p2) + +# Find decision boundary where f(w) = g(w) +root = brentq(lambda w: f(w) / g(w) - 1, 0.001, 0.999) + +# Compute theoretical tilde α_t and tilde β_t +def α_integrand(w): + """Integrand for tilde α_t = P(l_t < 1 | f)""" + return f(w) if f(w) / g(w) < 1 else 0 + +def β_integrand(w): + """Integrand for tilde β_t = P(l_t >= 1 | g)""" + return g(w) if f(w) / g(w) >= 1 else 0 + +# Compute the integrals +α_theory, _ = quad(α_integrand, 0, 1, limit=100) +β_theory, _ = quad(β_integrand, 0, 1, limit=100) + +theory_error = 0.5 * (α_theory + β_theory) + +print(f"theoretical tilde α_t = {α_theory:.4f}") +print(f"theoretical tilde β_t = {β_theory:.4f}") +print(f"theoretical classification error probability = {theory_error:.4f}") +``` + +Since for each $t$, the decision boundary is the same, we can plot the distributions of $f$ and $g$ and the decision boundary + +```{code-cell} ipython3 +:tags: [hide-input] + +fig, ax = plt.subplots(figsize=(7, 6)) + +w_range = np.linspace(0.001, 0.999, 1000) +f_values = [f(w) for w in w_range] +g_values = [g(w) for w in w_range] +ratio_values = [f(w)/g(w) for w in w_range] + +ax.plot(w_range, f_values, 'b-', + label=r'$f(w) \sim Beta(1,1)$', linewidth=2) +ax.plot(w_range, g_values, 'r-', + label=r'$g(w) \sim Beta(3,1.2)$', linewidth=2) + +type1_prob = 1 - beta_dist.cdf(root, F_a, F_b) +type2_prob = beta_dist.cdf(root, G_a, G_b) + +w_type1 = w_range[w_range >= root] +f_type1 = [f(w) for w in w_type1] +ax.fill_between(w_type1, 0, f_type1, alpha=0.3, color='blue', + label=fr'$\tilde \alpha_t = {type1_prob:.2f}$') + +w_type2 = w_range[w_range <= root] +g_type2 = [g(w) for w in w_type2] +ax.fill_between(w_type2, 0, g_type2, alpha=0.3, color='red', + label=fr'$\tilde \beta_t = {type2_prob:.2f}$') + +ax.axvline(root, color='green', linestyle='--', alpha=0.7, + label=f'decision boundary: $w=${root:.3f}') + +ax.set_xlabel('w') +ax.set_ylabel('probability density') +ax.legend() + +plt.tight_layout() +plt.show() +``` + +To the left of the green vertical line $f < g $, so $l_t < 1$; therefore a $w_t$ that falls to the left of the green line is classified as a type $g$ individual. + + * The shaded orange area equals $\beta$ -- the probability of classifying someone as a type $g$ individual when it is really a type $f$ individual. + +To the right of the green vertical line $g > f$, so $l_t >1 $; therefore a $w_t$ that falls to the right of the green line is classified as a type $f$ individual. + + * The shaded blue area equals $\alpha$ -- the probability of classifying someone as a type $f$ when it is really a type $g$ individual. + + + +Let's see the classification algorithm performs in simulated data. + +```{code-cell} ipython3 +accuracy = np.empty(T_max) + +for t in range(T_max): + predictions = (l_ratios_p2[:, t] >= 1) + actual = true_sources_p2[:, t] + accuracy[t] = np.mean(predictions == actual) + +plt.figure(figsize=(10, 6)) +plt.plot(range(1, T_max + 1), accuracy, + 'b-', linewidth=2, label='empirical accuracy') +plt.axhline(1 - theory_error, color='r', linestyle='--', + label=f'theoretical accuracy = {1 - theory_error:.4f}') +plt.xlabel('$t$') +plt.ylabel('accuracy') +plt.legend() +plt.ylim(0.5, 1.0) +plt.show() +``` + +Let's watch decisions made by the two protocols as more and more observations accrue. + +```{code-cell} ipython3 +fig, ax = plt.subplots(figsize=(7, 6)) + +ax.plot(T_range, error_prob, linewidth=2, + label='Protocol 1') +ax.plot(T_range, 1-accuracy, linestyle='--', linewidth=2, + label=f'Protocol 2') +ax.set_ylabel('error probability') +ax.legend() +plt.show() +``` + +From the figure above, we can see: + +- For both protocols, the error probability starts at the same level, subject to a little randomness. + +- For protocol 1, the error probability decreases as the sample size increaes because we are just making **one** decision -- i.e., selecting whether $f$ or $g$ governs **all** individuals. More data provides better evidence. + +- For protocol 2, the error probability remains constant because we are making **many** decisions -- one classification decision for each observation. + +**Remark:** Think about how laws of large numbers are applied to compute error probabilities for the model selection problem and the classification problem. + ## Sequels -Likelihood processes play an important role in Bayesian learning, as described in {doc}`this lecture ` -and as applied in {doc}`this lecture `. +Likelihood processes play an important role in Bayesian learning, as described in {doc}`likelihood_bayes` +and as applied in {doc}`odu`. -Likelihood ratio processes appear again in [this lecture](https://python-advanced.quantecon.org/additive_functionals.html), which contains another illustration +Likelihood ratio processes appear again in {doc}`advanced:additive_functionals`, which contains another illustration of the **peculiar property** of likelihood ratio processes described above. diff --git a/lectures/prob_matrix.md b/lectures/prob_matrix.md index b53c7788f..4d7a964da 100644 --- a/lectures/prob_matrix.md +++ b/lectures/prob_matrix.md @@ -464,11 +464,11 @@ $$ An associated conditional distribution is $$ -\textrm{Prob}\{Y=i\vert X=j\} = \frac{\rho_{ij}}{ \sum_{i}\rho_{ij}} +\textrm{Prob}\{Y=i\vert X=j\} = \frac{\rho_{ij}}{ \sum_{j}\rho_{ij}} = \frac{\textrm{Prob}\{Y=j, X=i\}}{\textrm{Prob}\{ X=i\}} $$ -We can define a transition probability matrix +We can define a transition probability matrix $P$ with $i,j$ component $$ p_{ij}=\textrm{Prob}\{Y=j|X=i\}= \frac{\rho_{ij}}{ \sum_{j}\rho_{ij}} @@ -490,7 +490,7 @@ The first row is the probability that $Y=j, j=0,1$ conditional on $X=0$. The second row is the probability that $Y=j, j=0,1$ conditional on $X=1$. Note that -- $\sum_{j}\rho_{ij}= \frac{ \sum_{j}\rho_{ij}}{ \sum_{j}\rho_{ij}}=1$, so each row of $\rho$ is a probability distribution (not so for each column). +- $\sum_{j}\rho_{ij}= \frac{ \sum_{j}\rho_{ij}}{ \sum_{j}\rho_{ij}}=1$, so each row of the transition matrix $P$ is a probability distribution (not so for each column).