|
25 | 25 | "highlight-style: pygments\n", |
26 | 26 | "date-format: full\n", |
27 | 27 | "lang: zh\n", |
28 | | - "bibliography: [../../references.bib]\n", |
| 28 | + "bibliography: [../../coding_projects/digital_processing_of_speech_signals/hmm.bib]\n", |
29 | 29 | "format: \n", |
30 | 30 | " html:\n", |
31 | 31 | " code-fold: false\n", |
|
924 | 924 | "cell_type": "markdown", |
925 | 925 | "metadata": {}, |
926 | 926 | "source": [ |
927 | | - "- 前向概率和后向概率的定义是什么?\n", |
| 927 | + "\n", |
| 928 | + "- 我们需要理解题目给的式子中符号的含义。前向概率和后向概率的定义是什么?\n", |
928 | 929 | "- 李航给出的这个式子的意义是什么?\n", |
929 | 930 | "\n", |
930 | | - "我们需要理解题目给的式子中符号的含义。\n" |
| 931 | + "首先我们来明确一下,这道题目提出的问题还是概率计算问题,也就是说给定隐马尔可夫模型(HMM)$\\lambda=(A,B,pi)$,观察序列 $O=(o_1,o_2,...,o_T)$,我们想要求模型参数生成了这个观察序列的概率 $P(O|\\lambda)$。\n", |
| 932 | + "\n", |
| 933 | + "我们首先来复习一下前向概率和后向概率的定义。\n", |
| 934 | + "\n", |
| 935 | + "如果我们没有对观察序列是一个可以拆开的序列的理解,只是把 O, I 当做一个整体去看,李航首先在书中给出了暴力的(概念上可行但是计算上不可行的)直接计算法 [@LiHang_2022] ,Rabiner论文叫做\"straightforward way\" [@Rabiner_1989]。而根据进一步的观察,可以导出这个问题的两个动态规划算法,分别是前向算法和后向算法。\n", |
| 936 | + "\n", |
| 937 | + "**而李航在这道题目似乎是给出了一个新的算法,又使用前向概率,又使用后向概率!**\n", |
| 938 | + "\n", |
| 939 | + "::: {.callout-note}\n", |
| 940 | + "注意,深度学习中反向传播求梯度算法与之不同,Hinton推广的反向传播算法确实也是节省计算量的算法,但是那个里面模型需要前向传播也需要反向传播,都要计算,而我们这里前向算法和后向算法是两个独立的、都可以解决问题的算法,尽管李航 [@LiHang_2022] 和 Rabiner [@Rabiner_1989] 都不恰当的合在一起命名为 \"前向-后向算法\"和\"The Forward-Backward Procedure\",让人误解。\n", |
| 941 | + "\n", |
| 942 | + "特别提到这个问题时因为最近Hinton又提出了所谓的 \"Forward-Forward Algorithm\",这里面就是两个Forward都要做。所以我们命名方法的时候一定要精确。\n", |
| 943 | + "::: \n", |
| 944 | + "\n", |
| 945 | + "#### 前向算法的复习 \n", |
| 946 | + "\n", |
| 947 | + "前向算法是典型的动态规划思路 (不只是维特比算法是动态规划)。\n", |
| 948 | + "首先我求 $P(O|\\lambda)$ 可以有多种加和方法,不一定是从 $P(O, I|\\lambda)$ 边缘化得到的。我们可以考虑所谓的前向概率,即 $P(O_{1:t}, I_t = q_i | \\lambda)$ ,其中O_{1:t}的冒号是索引表达,而P(X=x)表示随机变量X取一个值x的概率密度。 这个概率记作 $a_{t}(i)$ ,是t和i的函数, 其中 $q_i$ 表示第i个状态。\n", |
| 949 | + "\n", |
| 950 | + "如果我求出了 $P(O_{1:T}, I_T = q_i | \\lambda)$ 所有i下的情况,那我就能求出 $P(O|\\lambda)$ 。\n", |
| 951 | + "\n", |
| 952 | + "而 $P(O_{1:T}, I_T = q_i | \\lambda)$ 本来也是不好求的。因为我们可以看着贝叶斯网络 @fig-hmm-demo, 可以发现我只知道 $P(O_t | I_t)$ 和 $P(I_t | I_{t-1})$ 就算你帮我减少到只有 $I_T = q_i$, $O_T$自然是和前面大家无关了, $O_{1:T}$ 这里还有 T-1个独立的rv需要遍历所有情况。\n", |
| 953 | + "\n", |
| 954 | + "我们首先通过贝叶斯网络的观察,化简了问题为 $P(O_{1:T}, I_T = q_i | \\lambda) = P(O_{1:T-1}, I_T = q_i | \\lambda) \\times P( O_T | I_T = q_i, \\lambda) $,但是接下来$P(O_{1:T-1}, I_T = q_i | \\lambda)$要怎么办呢?\n", |
| 955 | + "\n", |
| 956 | + "这个时候,我们将问题拆解为重叠的子问题,如果我知道 $P(O_{1:T-1}, I_{T-1} = q_j | \\lambda)$ 所有的j的情况呢?\n", |
| 957 | + "\n", |
| 958 | + "想到了这个,我们恍然大悟,只需要再来一个A矩阵,就可以用求和公式和贝叶斯公式得到 $P(O_{1:T-1}, I_T = q_i | \\lambda) = [\\sum_{j=1}^N P(O_{1:t-1}, I_{t-1} = q_j | \\lambda) \\cdot a_{ji}]$ 。\n", |
| 959 | + "<!-- TODO -->\n", |
| 960 | + "\n", |
| 961 | + "所以\n", |
| 962 | + "\n", |
| 963 | + "$P(O_{1:t}, I_T = q_i | \\lambda) = [\\sum_{j=1}^N P(O_{1:t-1}, I_{t-1} = q_j | \\lambda) \\cdot a_{ji}] \\times b_i(O_t)$\n", |
| 964 | + "\n", |
| 965 | + "至此,我们就推导出了前向算法,按照李航书上的符号总结为算法步骤如下:\n", |
| 966 | + "\n", |
| 967 | + "$$\n", |
| 968 | + "\\boxed{\n", |
| 969 | + "\\begin{aligned}\n", |
| 970 | + "&\\text{输入:隐马尔可夫模型 } \\lambda, \\text{ 观测序列 } O. \\\\\n", |
| 971 | + "&\\text{输出:观测序列概率 } P(O|\\lambda). \\\\\n", |
| 972 | + "&\\text{(1) 初值} \\\\\n", |
| 973 | + "&\\alpha_1(i) = \\pi_i b_i(o_1), \\quad i = 1, 2, \\ldots, N \\quad \\\\\n", |
| 974 | + "&\\text{(2) 递推} \\\\\n", |
| 975 | + "&\\text{对于 } t = 1, 2, \\ldots, T - 1: \\\\\n", |
| 976 | + "&\\alpha_{t+1}(i) = \\left[ \\sum_{j=1}^{N} \\alpha_t(j) a_{ji} \\right] b_i(o_{t+1}), \\quad i = 1, 2, \\ldots, N \\quad \\\\\n", |
| 977 | + "&\\text{(3) 终止} \\\\\n", |
| 978 | + "&P(O|\\lambda) = \\sum_{i=1}^{N} \\alpha_T(i) \\quad \n", |
| 979 | + "\\end{aligned}\n", |
| 980 | + "}\n", |
| 981 | + "$$\n", |
| 982 | + "\n", |
| 983 | + "前向算法运用广泛,Zhang在论文中就采用了前向算法来计算HMM的概率 [@Zhang_2024]。\n", |
| 984 | + "\n", |
| 985 | + "#### 后向算法的复习 \n", |
| 986 | + "\n", |
| 987 | + "后向算法的思路与前向算法基本一致,只是递推的方向反过来,最后会尝试从 i_1 去求和得到 $P(O|\\lambda)$ 。前向算法是从前到后计算$\\alpha_t(i)$,而后向算法是从后往前计算$\\beta_t(i)$。\n", |
| 988 | + "\n", |
| 989 | + "首先我们按照李航书上对后向概率 $\\beta$ 进行定义 [@LiHang_2022]:\n", |
| 990 | + "\n", |
| 991 | + "> $$\n", |
| 992 | + "> \\beta_t(i) = P(o_{t+1}, o_{t+2}, \\cdots, o_T \\mid i_t = q_i, \\lambda)\n", |
| 993 | + "> $$\n", |
| 994 | + "\n", |
| 995 | + "也就是时刻t位于某个状态$q_i$的条件下,观测序列的后续部分的概率。\n", |
| 996 | + "\n", |
| 997 | + "这里的关键还是递推。从时间$T-1$向前递推,计算每个时间$t$的$\\beta_t(i)$。在时间$t$,如果模型处于状态$i$,那么接下来观测到$o_{t+1}$的概率取决于从状态$i$转移到所有可能的下一个状态$j$的概率$a_{ij}$,以及在状态$j$下观测到$o_{t+1}$的概率$b_j(o_{t+1})$,再乘以下一个时间点的$\\beta_{t+1}(j)$。\n", |
| 998 | + "\n", |
| 999 | + "因此,递推公式为:\n", |
| 1000 | + "\n", |
| 1001 | + "$$\n", |
| 1002 | + "\\beta_t(i) = \\sum_{j=1}^{N} a_{ij} b_j(o_{t+1}) \\beta_{t+1}(j), \\quad \\forall i = 1, 2, \\ldots, N; \\ t = T-1, T-2, \\ldots, 1 \\quad \n", |
| 1003 | + "$$\n", |
| 1004 | + "\n", |
| 1005 | + "整体的算法如下\n", |
| 1006 | + "\n", |
| 1007 | + "$$\n", |
| 1008 | + "\\boxed{\n", |
| 1009 | + "\\begin{aligned}\n", |
| 1010 | + "&\\text{输入:隐马尔可夫模型 } \\lambda, \\text{ 观测序列 } O. \\\\\n", |
| 1011 | + "&\\text{输出:观测序列概率 } P(O|\\lambda). \\\\\n", |
| 1012 | + "&\\text{(1) 初始化} \\\\\n", |
| 1013 | + "&\\beta_T(i) = 1, \\quad \\forall i = 1, 2, \\ldots, N \\quad \\\\\n", |
| 1014 | + "&\\text{(2) 递推} \\\\\n", |
| 1015 | + "&\\text{对于 } t = T-1, T-2, \\ldots, 1: \\\\\n", |
| 1016 | + "&\\beta_t(i) = \\sum_{j=1}^{N} a_{ij} b_j(o_{t+1}) \\beta_{t+1}(j), \\quad \\forall i = 1, 2, \\ldots, N \\quad \\\\\n", |
| 1017 | + "&\\text{(3) 终止} \\\\\n", |
| 1018 | + "&P(O|\\lambda) = \\sum_{i=1}^{N} \\pi_i b_i(o_1) \\beta_1(i) \\quad \n", |
| 1019 | + "\\end{aligned}\n", |
| 1020 | + "}\n", |
| 1021 | + "$$\n" |
931 | 1022 | ] |
932 | 1023 | }, |
933 | 1024 | { |
|
941 | 1032 | "cell_type": "markdown", |
942 | 1033 | "metadata": {}, |
943 | 1034 | "source": [ |
944 | | - "### 题目扩展问题\n", |
| 1035 | + "我们首先来看题目所给的式子\n", |
945 | 1036 | "\n", |
| 1037 | + "> $P(O|\\lambda)=\\sum_{i=1}^{N}\\sum_{j=1}^{N}\\alpha_t(i)a_{ij}b_j(o_{t+1})\\beta_{t+1}(j), \\quad t=1,2,\\cdots,T-1$\n", |
946 | 1038 | "\n", |
947 | | - "\n" |
| 1039 | + "其中 \n", |
| 1040 | + "$\\beta_t(i) = P(o_{t+1}, o_{t+2}, \\cdots, o_T \\mid i_t = q_i, \\lambda)$, $\\alpha_t(i) = P(o_1, o_2, \\cdots, o_t, i_t = q_i | \\lambda)$" |
948 | 1041 | ] |
949 | 1042 | }, |
950 | 1043 | { |
951 | | - "cell_type": "code", |
952 | | - "execution_count": null, |
| 1044 | + "cell_type": "markdown", |
953 | 1045 | "metadata": {}, |
954 | | - "outputs": [], |
955 | | - "source": [] |
| 1046 | + "source": [ |
| 1047 | + "题目式子是说,如果我们考虑任何一个时间点t和下一个时间点t+1, t时刻的状态设为$q_i$, t+1时刻的状态设为$q_j$。首先,发生这个转移的状态转移的概率是 $a_ij$,这很好理解。从t+1的状态导出的 $o_{t+1}$的概率是 $b_j(o_{t+1})$,这也很好理解。\n", |
| 1048 | + "\n", |
| 1049 | + "题目想问的是,如果我们这个时候把 t的前向概率和t+1的后向概率拼接在一起,能不能得到整一个O的概率?答案是肯定的。我们使用概率公式进行推导证明这个公式。\n", |
| 1050 | + "\n", |
| 1051 | + "--- \n" |
| 1052 | + ] |
| 1053 | + }, |
| 1054 | + { |
| 1055 | + "cell_type": "markdown", |
| 1056 | + "metadata": {}, |
| 1057 | + "source": [ |
| 1058 | + "所谓证明,我们要不是从左边的式子展开得到右边的式子,要不是从右边的式子合并得到左边的式子。\n", |
| 1059 | + "\n", |
| 1060 | + "这里我们选择从左边的式子展开得到右边的式子。我们可以将$P(O|\\lambda)$在t和t+1的位置展开。\n", |
| 1061 | + "\n", |
| 1062 | + "根据全概率公式,我们可以将$P(O|\\lambda)$表示为所有可能的状态序列的概率之和:\n", |
| 1063 | + "\n", |
| 1064 | + "$$\n", |
| 1065 | + "P(O|\\lambda) = \\sum_{i=1}^{N} \\sum_{j=1}^{N} P(o_1, \\ldots, o_t, i_t = q_i, i_{t+1} = q_j, o_{t+1}, \\ldots, o_T | \\lambda)\n", |
| 1066 | + "$$\n", |
| 1067 | + "\n", |
| 1068 | + "这里,我们引入了中间状态变量$i_t = q_i$和$i_{t+1} = q_j$。\n", |
| 1069 | + "\n", |
| 1070 | + "注意李航的书上符号稍微有些混乱,不要把下标i和状态序列i搞混了。\n", |
| 1071 | + "\n", |
| 1072 | + "这里我们已经初步看到了右边式子求和的结构。\n", |
| 1073 | + "\n", |
| 1074 | + "---\n" |
| 1075 | + ] |
| 1076 | + }, |
| 1077 | + { |
| 1078 | + "cell_type": "markdown", |
| 1079 | + "metadata": {}, |
| 1080 | + "source": [ |
| 1081 | + "接下来,我们只需要证明这个联合概率可以分解成已知的变量:\n", |
| 1082 | + "\n", |
| 1083 | + "$$\n", |
| 1084 | + "P(o_1, \\ldots, o_t, i_t = q_i, i_{t+1} = q_j, o_{t+1}, \\ldots, o_T | \\lambda) = \\alpha_t(i) \\cdot a_{ij} \\cdot b_j(o_{t+1}) \\cdot \\beta_{t+1}(j)\n", |
| 1085 | + "$$\n", |
| 1086 | + "\n", |
| 1087 | + "要想证明这个,我们补充一下贝叶斯网络的知识 [@Zhou_2016]。\n", |
| 1088 | + "\n", |
| 1089 | + "贝叶斯网络展示了这些变量之间的依赖关系。重要的是:\n", |
| 1090 | + "\n", |
| 1091 | + "1. 每个隐状态 $ i_{t} $ 只依赖于前一个隐状态 $ i_{t-1} $。\n", |
| 1092 | + "2. 每个观测 $ o_t $ 只依赖于对应的隐状态 $ i_t $.\n", |
| 1093 | + "\n", |
| 1094 | + "因此,贝叶斯网络的依赖结构如下:\n", |
| 1095 | + "\n", |
| 1096 | + "$$ i_1 \\rightarrow i_2 \\rightarrow \\ldots \\rightarrow i_T $$\n", |
| 1097 | + "$$ i_1 \\rightarrow o_1, \\quad i_2 \\rightarrow o_2, \\ldots, i_T \\rightarrow o_T $$\n" |
| 1098 | + ] |
| 1099 | + }, |
| 1100 | + { |
| 1101 | + "cell_type": "markdown", |
| 1102 | + "metadata": {}, |
| 1103 | + "source": [ |
| 1104 | + "根据贝叶斯网络的概率的链式法则,联合概率可以分解,每一个联合概率里面出现的rv,只需要考虑自己的父节点作为条件,而不用考虑其他节点的影响,最后直接乘在一起。因此,我们可以将联合概率分解为如下形式:\n", |
| 1105 | + "\n", |
| 1106 | + "$$\n", |
| 1107 | + "P(o_1, \\ldots, o_t, i_t = q_i, i_{t+1} = q_j, o_{t+1}, \\ldots, o_T | \\lambda) = P(o_1, \\ldots, o_t, i_t = q_i | \\lambda) \n", |
| 1108 | + "\\\\ \\cdot P(i_{t+1} = q_j | i_t = q_i, \\lambda) \n", |
| 1109 | + "\\\\ \\cdot P(o_{t+1}, \\ldots, o_T | i_{t+1} = q_j, \\lambda)\n", |
| 1110 | + "$$\n", |
| 1111 | + "\n", |
| 1112 | + "很多同学推导到这里就懵了,这里我们只有三个式子,原本有四个式子呀,去哪了?\n", |
| 1113 | + "\n", |
| 1114 | + "原来,因为 $\\beta_t(i) = P(o_{t+1}, o_{t+2}, \\cdots, o_T \\mid i_t = q_i, \\lambda)$ \n", |
| 1115 | + "也就是说 $\\beta_{t+1}(i) = P(o_{t+2}, o_{t+3}, \\cdots, o_T \\mid i_{t+1} = q_i, \\lambda)$ \n", |
| 1116 | + "\n", |
| 1117 | + "而这里面的 第三项乘积项是 $P(o_{t+1}, \\ldots, o_T | i_{t+1} = q_j, \\lambda)$\n", |
| 1118 | + "\n", |
| 1119 | + "正是因为这个差别,所以多出了 $b_j(o_{t+1})$, 因为\n", |
| 1120 | + "\n", |
| 1121 | + "$$\n", |
| 1122 | + "P(o_{t+1}, \\ldots, o_T | i_{t+1} = q_j, \\lambda) = P(o_{t+2}, o_{t+3}, \\cdots, o_T \\mid i_{t+1} = q_j, \\lambda) \\times P(o_{t+1} \\mid i_{t+1} = q_j, \\lambda)\n", |
| 1123 | + "$$\n", |
| 1124 | + "\n", |
| 1125 | + "至此我们就证明清楚了\n", |
| 1126 | + "\n", |
| 1127 | + "\n", |
| 1128 | + "$$\n", |
| 1129 | + "P(o_1, \\ldots, o_t, i_t = q_i, i_{t+1} = q_j, o_{t+1}, \\ldots, o_T | \\lambda) = \\alpha_t(i) \\cdot a_{ij} \\cdot b_j(o_{t+1}) \\cdot \\beta_{t+1}(j)\n", |
| 1130 | + "$$" |
| 1131 | + ] |
| 1132 | + }, |
| 1133 | + { |
| 1134 | + "cell_type": "markdown", |
| 1135 | + "metadata": {}, |
| 1136 | + "source": [ |
| 1137 | + "\n", |
| 1138 | + "\n", |
| 1139 | + "--- \n" |
| 1140 | + ] |
| 1141 | + }, |
| 1142 | + { |
| 1143 | + "cell_type": "markdown", |
| 1144 | + "metadata": {}, |
| 1145 | + "source": [ |
| 1146 | + "结合以上结果, 把 $P(o_1, \\ldots, o_t, i_t = q_i, i_{t+1} = q_j, o_{t+1}, \\ldots, o_T | \\lambda) = \\alpha_t(i) \\cdot a_{ij} \\cdot b_j(o_{t+1}) \\cdot \\beta_{t+1}(j)$ 代入到全概率公式里面,那么\n", |
| 1147 | + "\n", |
| 1148 | + "$$\n", |
| 1149 | + "P(O|\\lambda) = \\sum_{i=1}^{N} \\sum_{j=1}^{N} \\alpha_t(i) \\cdot a_{ij} \\cdot b_j(o_{t+1}) \\cdot \\beta_{t+1}(j)\n", |
| 1150 | + "$$\n", |
| 1151 | + "\n", |
| 1152 | + "\n", |
| 1153 | + "这个公式表明,在每个时间点$t$,观测序列的概率可以通过前向变量$\\alpha_t(i)$、状态转移概率$a_{ij}$、观测概率$b_j(o_{t+1})$和后向变量$\\beta_{t+1}(j)$的结合来计算。\n" |
| 1154 | + ] |
| 1155 | + }, |
| 1156 | + { |
| 1157 | + "cell_type": "markdown", |
| 1158 | + "metadata": {}, |
| 1159 | + "source": [ |
| 1160 | + "### 题目扩展问题\n", |
| 1161 | + "\n", |
| 1162 | + "\n", |
| 1163 | + "\n" |
| 1164 | + ] |
956 | 1165 | }, |
957 | 1166 | { |
958 | 1167 | "cell_type": "code", |
|
0 commit comments