zhengchenyu.github.com/index.html at master · zhengchenyu/zhengchenyu.github.com · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">


  <title>Hexo</title>
  <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
  <meta property="og:type" content="website">
<meta property="og:title" content="Hexo">
<meta property="og:url" content="http://example.com/index.html">
<meta property="og:site_name" content="Hexo">
<meta property="og:locale" content="en_US">
<meta property="article:author" content="zhengchenyu">
<meta name="twitter:card" content="summary">

    <link rel="alternate" href="/atom.xml" title="Hexo" type="application/atom+xml">


    <link rel="shortcut icon" href="/favicon.png">


<link rel="stylesheet" href="/css/style.css">


<link rel="stylesheet" href="/fancybox/jquery.fancybox.min.css">


<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/fork-awesome@1.2.0/css/fork-awesome.min.css">

<meta name="generator" content="Hexo 8.1.1"></head>

<body>
  <div id="container">
    <div id="wrap">
      <header id="header">
  <div id="banner"></div>
  <div id="header-outer" class="outer">
    <div id="header-title" class="inner">
      <h1 id="logo-wrap">
        <a href="/" id="logo">Hexo</a>
      </h1>

    </div>
    <div id="header-inner" class="inner">
      <nav id="main-nav">
        <a id="main-nav-toggle" class="nav-icon"><span class="fa fa-bars"></span></a>

          <a class="main-nav-link" href="/">Home</a>

          <a class="main-nav-link" href="/archives">Archives</a>

      </nav>
      <nav id="sub-nav">


          <a class="nav-icon" href="/atom.xml" title="RSS Feed"><span class="fa fa-rss"></span></a>

        <a class="nav-icon nav-search-btn" title="Search"><span class="fa fa-search"></span></a>
      </nav>
      <div id="search-form-wrap">
        <form action="//google.com/search" method="get" accept-charset="UTF-8" class="search-form"><input type="search" name="q" class="search-form-input" placeholder="Search"><button type="submit" class="search-form-submit">&#xF002;</button><input type="hidden" name="sitesearch" value="http://example.com"></form>
      </div>
    </div>
  </div>
</header>

      <div class="outer">
        <section id="main">

    <article id="post-ErasuceCode算法揭秘" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
  <div class="article-meta">
    <a href="/2025/02/07/ErasuceCode%E7%AE%97%E6%B3%95%E6%8F%AD%E7%A7%98/" class="article-date">
  <time class="dt-published" datetime="2025-02-07T10:58:15.000Z" itemprop="datePublished">2025-02-07</time>
</a>

  </div>
  <div class="article-inner">


      <header class="article-header">


    <h1 itemprop="name">
      <a class="p-name article-title" href="/2025/02/07/ErasuceCode%E7%AE%97%E6%B3%95%E6%8F%AD%E7%A7%98/">ErasuceCode算法揭秘</a>
    </h1>


      </header>

    <div class="e-content article-entry" itemprop="articleBody">

        <blockquote>
<p>本文侧重于如何从依据理论指导实践，在不适用任何第三方库的情况下实现ErasuceCode算法。本文仅会对用到的理论做简单的解释并标记引用，不会做详细解释。因此阅读本文前请熟读文献[1]第四章内容。<br>本文代码存储于<a target="_blank" rel="noopener" href="https://github.com/zhengchenyu/SimpleErasureCode">https://github.com/zhengchenyu/SimpleErasureCode</a>, SimpleErasureCode只是为了便于方便理解，为了保持与文章保持一致，因此没有优化。部分关键流程文章中会有链接到对应的代码。</p>
</blockquote>
<p>EC算法在存储领域和通信领域都有广泛的应用。在分布式存储领域，为了避免机器宕机，需要存储3份冗余副本，这3台机器最多允许2台宕机。如果使用EC RS-6-3的话，可以实现使用9个副本冗余6份数据，这9台机器最多允许3台宕机，而且存储量却减少了一半。分布式存储使用EC在几乎不降低可用性的前提下，降低了冗余副本数，大大节约了存储资源。</p>
<p>实现EC算法是检验是否理解EC算法的重要手段。对EC算法的理解对工程上也有很大的帮助，笔者实现EC算法的初衷也是为了解决HADOOP-19180提出的问题。</p>
<h1 id="1-算法概述"><a href="#1-算法概述" class="headerlink" title="1 算法概述"></a>1 算法概述</h1><p>本文以RS-6-3算法为例, 存在6个数据块和3个校验块。EC算法的两个基本问题:</p>
<ul>
<li>如何通过数据块生成校验块?</li>
<li>如何通过部分数据块和校验块恢复丢失的数据块?</li>
</ul>
<h2 id="1-1-生成校验块"><a href="#1-1-生成校验块" class="headerlink" title="1.1 生成校验块"></a>1.1 生成校验块</h2><p>生成校验块即编码过程(<a target="_blank" rel="noopener" href="https://github.com/zhengchenyu/SimpleErasureCode/blob/24be76083a0c3172f1d2fe7af8e1ad972935657f/src/main/java/zcy/ec/coder/ErasureCoder.java#L29">encode</a>)。对6个数据块依次取出byte, 即d<sub>0</sub>,d<sub>1</sub>,d<sub>2</sub>,d<sub>3</sub>,d<sub>4</sub>,d<sub>5</sub>。如下图所示，用编码矩阵(<a target="_blank" rel="noopener" href="https://github.com/zhengchenyu/SimpleErasureCode/blob/24be76083a0c3172f1d2fe7af8e1ad972935657f/src/main/java/zcy/ec/coder/ErasureCoder.java#L14C24-L14C36">encodeMatrix</a>)乘以对应的数据块,即得到原有的数据d<sub>0</sub>,d<sub>1</sub>,d<sub>2</sub>,d<sub>3</sub>,d<sub>4</sub>,d<sub>5</sub>和对应的校验字节c<sub>0</sub>,c<sub>1</sub>,c<sub>2</sub>(<a target="_blank" rel="noopener" href="https://github.com/zhengchenyu/SimpleErasureCode/blob/24be76083a0c3172f1d2fe7af8e1ad972935657f/src/main/java/zcy/ec/coder/ErasureCoder.java#L50">编码过程</a>)。对数据块的所有字节依次执行上述的操作就得到了校验块。</p>
<img src="/images/ec/ec编码矩阵.png" width=50% height=50% text-align=center/>

<h2 id="1-2-恢复数据块"><a href="#1-2-恢复数据块" class="headerlink" title="1.2 恢复数据块"></a>1.2 恢复数据块</h2><p>恢复数据块的过程即解码过程(<a target="_blank" rel="noopener" href="https://github.com/zhengchenyu/SimpleErasureCode/blob/24be76083a0c3172f1d2fe7af8e1ad972935657f/src/main/java/zcy/ec/coder/ErasureCoder.java#L65">decode</a>)。假如d2,d3,d4丢失了，我们需要通过d<sub>0</sub>,d<sub>1</sub>,d<sub>5</sub>,c<sub>0</sub>,c<sub>1</sub>,c<sub>2</sub>恢复d<sub>2</sub>,d<sub>3</sub>,d<sub>4</sub>。在上面公式中删除对应的行，公式中用d<sub>2?</sub>,d<sub>3?</sub>,d<sub>4?</sub>表示数据块丢失。有如下公式:</p>
<img src="/images/ec/ec解码过程1.png" width=50% height=50% text-align=center/>

<p>对上面的公式两侧都乘以裁剪后的矩阵的逆矩阵，这个逆矩阵即解码矩阵(<a target="_blank" rel="noopener" href="https://github.com/zhengchenyu/SimpleErasureCode/blob/24be76083a0c3172f1d2fe7af8e1ad972935657f/src/main/java/zcy/ec/coder/ErasureCoder.java#L117C18-L117C30">decodeMatrix</a>)。可以得到如下的公式:</p>
<img src="/images/ec/ec解码过程2.png" width=50% height=50% text-align=center/>

<p>得到解码矩阵后其实计算过程与编码类似，只是输入为为d<sub>0</sub>,d<sub>1</sub>,d<sub>5</sub>,c<sub>0</sub>,c<sub>1</sub>,c<sub>2</sub>，输出为d<sub>0</sub>,d<sub>1</sub>,d<sub>2</sub>,d<sub>3</sub>,d<sub>4</sub>,d<sub>5</sub>。</p>
<h1 id="2-关于矩阵"><a href="#2-关于矩阵" class="headerlink" title="2 关于矩阵"></a>2 关于矩阵</h1><p>上一章节已经介绍了EC编解码的基本过程，但工程实现上让然会有一些问题需要解决。这一小节主要介绍一下关于矩阵方面的问题。</p>
<h2 id="2-1-如何选择矩阵"><a href="#2-1-如何选择矩阵" class="headerlink" title="2.1 如何选择矩阵"></a>2.1 如何选择矩阵</h2><p>根据第一小节的分析，编码矩阵是一个9 * 6的矩阵。而解码矩阵是在编码矩阵的基础上删除任意三行，然后再求逆的。因此我们定义编码矩阵的时候要保证，对于这个9 * 6的编码矩阵，任意删除三行得到的6 * 6的矩阵是一个可逆矩阵。<br>为了便于计算，编码矩阵的上半部分使用的是6 * 6 的单位矩阵，单位矩阵是可逆的，是满足条件的。接下来就是给下半部分的3 * 6的找到合理的矩阵。本文使用的是范德蒙矩阵。</p>
<img src="/images/ec/ec范德蒙矩阵.png" width=50% height=50% text-align=center/>

<blockquote>
<p>由于MathType不支持新版Mac, 而且markdown经常因为调整不好乱码，这里的共识还是贴图吧…</p>
</blockquote>
<p>如下为范德蒙矩阵的行列式:</p>
<img src="/images/ec/ec范德蒙矩阵行列式.png" width=50% height=50% text-align=center/>

<p>只要ai各不相等且不为0，则范德蒙矩阵一定可逆，意味着任意挑出3行向量，一定是线性无关的。那我们就挑出前三行，令a<sub>1</sub>&#x3D;1,a<sub>2</sub>&#x3D;2,a<sub>3</sub>&#x3D;3，便可以得到如下编码矩阵:</p>
<img src="/images/ec/ec真实编码矩阵.png" width=50% height=50% text-align=center/>

<p>前面已经说明了前6行向量是互为线性无关，后3行向量也是互为线性无关的。如果在前6行中选取n行，在后三行中去6-n行，那么这6行向量还是线性无关的吗？假设n为5，由于后三行的没有元素为0，因此肯定是缺一个维度来保证线性相关。如果n为4和3也是同样的道理。因此这个9 * 6的矩阵随意挑出6行组成的6 * 6阶矩阵一定是可逆的。</p>
<h2 id="2-2-高斯消元求逆"><a href="#2-2-高斯消元求逆" class="headerlink" title="2.2 高斯消元求逆"></a>2.2 高斯消元求逆</h2><p>本文使用了容易理解的高斯消元法求逆(<a target="_blank" rel="noopener" href="https://github.com/zhengchenyu/SimpleErasureCode/blob/24be76083a0c3172f1d2fe7af8e1ad972935657f/src/main/java/zcy/ec/coder/math/Matrix.java#L73">inverse</a>)。假设当前6 * 6矩阵为A,  在右侧再拼接6 * 6的单位矩阵E，得到矩阵[A | E]。如果使用A<sup>-1</sup>乘以这个矩阵，会得到[E|A<sup>-1</sup>]。这样我们只需将A矩阵转化为单元矩阵，E矩阵自然就变成了A<sup>-1</sup>。(<a target="_blank" rel="noopener" href="https://github.com/zhengchenyu/SimpleErasureCode/blob/24be76083a0c3172f1d2fe7af8e1ad972935657f/src/main/java/zcy/ec/coder/math/Matrix.java#L73C24-L73C31">高斯消元求逆</a>)<br>对每一行行依次执行如下的过程:</p>
<ul>
<li>(1) 对第i行, 找到第i行第i列的值。(<a target="_blank" rel="noopener" href="https://github.com/zhengchenyu/SimpleErasureCode/blob/24be76083a0c3172f1d2fe7af8e1ad972935657f/src/main/java/zcy/ec/coder/math/Matrix.java#L120">Step1</a>)</li>
<li>(2) 然后计算该值的乘法逆元。然后让该行的每个元素均乘以这个乘法逆元。(<a target="_blank" rel="noopener" href="https://github.com/zhengchenyu/SimpleErasureCode/blob/24be76083a0c3172f1d2fe7af8e1ad972935657f/src/main/java/zcy/ec/coder/math/Matrix.java#L126">Step2</a>)</li>
<li>(3) 对其他行j，使用第i行经过线性变换将地地j行第i列的值消为0。</li>
</ul>
<p>经过这三步变换为，对于第i列，有且只有地i行的数据为1，其他均为0。对每一行均执行如下操作，左侧变得到了单位矩阵，右侧的结果也即A<sup>-1</sup>。</p>
<blockquote>
<p>对于有理数域中，乘以乘法逆元即除法，加上加法逆元即减法。这样描述对伽罗华域中的数学运算的描述更准确。</p>
</blockquote>
<p>对于第0行计算的过程如下，其他行同理。</p>
<img src="/images/ec/ec高斯消元处理第0行.png" width=50% height=50% text-align=center/>

<p>还有一种特殊的情况，譬如上述矩阵如果处理第2行。第2行和第2列的元素值为0，0是没有乘法逆元的。所以这时候就需要找到第2行下面的行中第2列不为0的情况，把该行加到第2行上，这样可以保证算法可运行(<a target="_blank" rel="noopener" href="https://github.com/zhengchenyu/SimpleErasureCode/blob/24be76083a0c3172f1d2fe7af8e1ad972935657f/src/main/java/zcy/ec/coder/math/Matrix.java#L105">setup for step1</a>)。其实移位可能更高效，但是为了便于理解，采用这样的方式。</p>
<h1 id="3-伽罗华域"><a href="#3-伽罗华域" class="headerlink" title="3 伽罗华域"></a>3 伽罗华域</h1><p>矩阵运算还存在一个问题，数据都是byte存储的，是一个0-255的数值。经过有理数运算后，数值是很容易越界的。因此我们需要一套新的计算体系同时满足以下两个条件:</p>
<ul>
<li>该算法体系的值和计算得到的值只能是在有限的集合范围内，即不越界。</li>
<li>可以通过有限的运算恢复得到原有值，即可解码。</li>
</ul>
<p>这就需要使用伽罗华有限域。伽罗华有限域可以保证任何计算得到的结果均在有限集合内，这满足了不越界的要求。另外伽罗华有限域的运算都有对应的逆运算。譬如对a，如果a乘以b，再乘以b的加法逆元，结果仍然是a。这满足了可解码的要求。</p>
<p>这里为了便于理解先简单的介绍GF(7)，然后再解释实际使用的GF(2<sup>8</sup>)。</p>
<blockquote>
<p>本章并没有详细的证明。譬如将7扩张到无穷大的质数的时候，如何证明相应的定理为什么成立。事实上笔者也不知道，参考文献上也没有严谨的证明。也只是简单的说明了合数不能作为模数。但对于工程上有限的集合内，很容易通过穷举法来证明算法的可行性。</p>
</blockquote>
<h2 id="3-1-GF-7"><a href="#3-1-GF-7" class="headerlink" title="3.1  GF(7)"></a>3.1  GF(7)</h2><p>首先给出如下定义:</p>
<ul>
<li>加法逆元: 给定x，如果存在x’，使得x+x’&#x3D;x’+x&#x3D;0，则称x’是x的加法逆元。</li>
<li>乘法逆元: 给定x，如果存在x’，使得x * x’&#x3D;x’ * x&#x3D;e，则称x’为x的乘法逆元。其中e为该群的单位元。对于GF(7)，e为1。</li>
</ul>
<p>我们定义一种新的运算，即模7运算。对于加法和乘法运算，我们会将结果然后模7。譬如, 8 + 3 &#x3D; 11 mod 7 &#x3D; 4。8 * 3 &#x3D; 24 mod  7 &#x3D; 3。<br>对于加法逆元和乘法逆元，根据定义我们可以穷举加法和乘法计算，根据结果得到对应的逆元。譬如, 4 + 3 &#x3D; 7 mod 7 &#x3D; 0，我们就说4和3互为加法逆元。2 * 4 &#x3D; 8 mod 7 &#x3D; 1，我们就说2和4互为乘法逆元。<br>下表穷举了GF7的所有加法和乘法运算，根据码表也得到了GF(7)下所有元素对应的加法逆元和乘法逆元。</p>
<img src="/images/ec/ecgf7table.png" width=50% height=50% text-align=center/>

<p>来看下是否满足我们的要求，不越界在定义上就满足了。可解码的特点，我们随机指定数组假设为5，用5乘以3在乘以3的乘法逆元5, 得到的值为5 * 3 * 5 &#x3D; 75 mod 1 &#x3D; 5。事实上根据乘法交换律很容易证明这个。对于加法逆元的解码同理。</p>
<h2 id="3-2-GF-28"><a href="#3-2-GF-28" class="headerlink" title="3.2 GF(28)"></a>3.2 GF(2<sup>8</sup>)</h2><h3 id="3-2-1-基本原理"><a href="#3-2-1-基本原理" class="headerlink" title="3.2.1 基本原理"></a>3.2.1 基本原理</h3><p>实际的数字存储的值是0-255, 那么是否意味着我们可以直接使用GF(256)呢? 答案是不可以的。因为256是合数，譬如16 * 16 &#x3D; 256 mod 256 &#x3D; 0，那么16就不存在乘法逆元，也就难以进行边界吗。<br>因此需要引入多项式的运算，且要求同指数幂下遵循GF(2)。模数为不可约多项式。对于不可约多项式a, 无法找到两个不为1的多项式b和c使得b * c &#x3D; a。可以使用穷举法得到每个多项式的加法逆元和乘法逆元。事实上只要集合内的每个元素都有1对1对应的乘法逆元，就可以满足我们的要求。文献1的表4.6也通过穷举证明了GF(2<sup>3</sup>)的有效性。<br>多项式的计算与对byte的编码有什么关系呢？由于多项式的同指数幂是GF(2)，也就意味着对于多项式a<sub>0</sub> + a<sub>1</sub>x + a<sub>2</sub>x<sup>2</sup> + a<sub>3</sub>x<sup>3</sup> + a<sub>4</sub>x<sup>4</sup> + a<sub>5</sub>x<sup>5</sup>  +  a<sub>6</sub>x<sup>6</sup> +  a<sub>7</sub>x<sup>7</sup> ，a<sub>i</sub> 为0和1。如果a<sub>0</sub>为第0位，a<sub>1</sub>为第1位，依次类推，这组系数就是一个byte。这样就把要存储的byte与多项式运算结合了。</p>
<h3 id="3-2-2-计算"><a href="#3-2-2-计算" class="headerlink" title="3.2.2 计算"></a>3.2.2 计算</h3><p>本章主要介绍如何计算GF(2<sup>8</sup>)。对于加法和乘法，我们直接使用多项式运算。对于加法逆元和乘法逆元，我们穷举加法和乘法运算，然后通过码表来得到加法和乘法逆元。<br>本文的算法的不可约多项式为x<sup>8</sup> + x<sup>4</sup>+ x<sup>3</sup> + 1。<br>f(x) &#x3D; x<sup>6</sup> + x<sup>4</sup>+ x<sup>2</sup> + x + 1, g(x) &#x3D;  x<sup>7</sup> + x + 1<br>对于加法, 指数幂执行GF(2)运算，G(2)运算实际上就是异或。得到f(x)+g(x)&#x3D; x<sup>7</sup> + x<sup>6</sup>  +x<sup>4</sup>+ x<sup>2</sup> + 1。实际可以理解为f(x)对应的二进制0b01010111与g(x)对应的二进制0b10000011进行按位的异或计算。(<a target="_blank" rel="noopener" href="https://github.com/zhengchenyu/SimpleErasureCode/blob/24be76083a0c3172f1d2fe7af8e1ad972935657f/src/main/java/zcy/ec/coder/operator/GaloisFieldComputation.java#L101">GF(2<sup>8</sup>)</a>)<br>对于乘法, 先考虑h(x) &#x3D; a<sub>0</sub> + a<sub>1</sub>x + a<sub>2</sub>x<sup>2</sup> + a<sub>3</sub>x<sup>3</sup> + a<sub>4</sub>x<sup>4</sup> + a<sub>5</sub>x<sup>5</sup>  +  a<sub>6</sub>x<sup>6</sup> +  a<sub>7</sub>x<sup>7</sup>, 计算 h(x) * x &#x3D; (a<sub>0</sub>x + a<sub>1</sub>x<sup>2</sup> + a<sub>2</sub>x<sup>3</sup> + a<sub>3</sub>x<sup>4</sup> + a<sub>4</sub>x<sup>5</sup> + a<sub>5</sub>x<sup>6</sup>  +  a<sub>6</sub>x<sup>7</sup> +  a<sub>7</sub>x<sup>8</sup> ) mod (x<sup>8</sup> + x<sup>4</sup>+ x<sup>3</sup> + x + 1)。存在两种情况</p>
<ul>
<li>(1) a<sub>7</sub> 等于 0<br>那么该式已经不可约，因此h(x) * x&#x3D; a<sub>0</sub>x + a<sub>1</sub>x<sup>2</sup> + a<sub>2</sub>x<sup>3</sup> + a<sub>3</sub>x<sup>4</sup> + a<sub>4</sub>x<sup>5</sup> + a<sub>5</sub>x<sup>6</sup>  +  a<sub>6</sub>x<sup>7</sup>。也就意味着[a<sub>7</sub>a<sub>6</sub>a<sub>5</sub>a<sub>4</sub>a<sub>3</sub>a<sub>2</sub>a<sub>1</sub>a<sub>0</sub>] * [00000010] &#x3D; [a<sub>6</sub>a<sub>5</sub>a<sub>4</sub>a<sub>3</sub>a<sub>2</sub>a<sub>1</sub>a<sub>0</sub>0]，即向左移动一位。(<a target="_blank" rel="noopener" href="https://github.com/zhengchenyu/SimpleErasureCode/blob/24be76083a0c3172f1d2fe7af8e1ad972935657f/src/main/java/zcy/ec/coder/operator/GaloisFieldComputation.java#L85">a<sub>7</sub> 等于 0</a>)</li>
<li>(1) a<sub>7</sub> 不等于0<br>h(x) * x &#x3D; (a<sub>0</sub>x + a<sub>1</sub>x<sup>2</sup> + a<sub>2</sub>x<sup>3</sup> + a<sub>3</sub>x<sup>4</sup> + a<sub>4</sub>x<sup>5</sup> + a<sub>5</sub>x<sup>6</sup>  +  a<sub>6</sub>x<sup>7</sup>  + x<sup>8</sup>) mod (x<sup>8</sup> + x<sup>4</sup>+ x<sup>3</sup> +x+ 1) &#x3D; a<sub>0</sub>x + a<sub>1</sub>x<sup>2</sup> + a<sub>2</sub>x<sup>3</sup> + a<sub>3</sub>x<sup>4</sup> + a<sub>4</sub>x<sup>5</sup> + a<sub>5</sub>x<sup>6</sup>  +  a<sub>6</sub>x<sup>7</sup>  + x<sup>8</sup> - x<sup>8</sup> -  x<sup>4</sup> -  x<sup>3</sup> -1 &#x3D; 1 + (a<sub>0</sub>+1)x + a<sub>1</sub>x<sup>2</sup> + (a<sub>2</sub>+1)x<sup>3</sup> + (a<sub>3</sub>+1)x<sup>4</sup> + a<sub>4</sub>x<sup>5</sup> + a<sub>5</sub>x<sup>6</sup>  +  a<sub>6</sub>x<sup>7</sup>。以为意味着[a<sub>7</sub>a<sub>6</sub>a<sub>5</sub>a<sub>4</sub>a<sub>3</sub>a<sub>2</sub>a<sub>1</sub>a<sub>0</sub>] * [00000010] &#x3D; [a<sub>6</sub>a<sub>5</sub>a<sub>4</sub>a<sub>3</sub>a<sub>2</sub>a<sub>1</sub>a<sub>0</sub>0] ^ [00011011]。即左移一位后，在于不包含最高位系统的字节进行异或。(<a target="_blank" rel="noopener" href="https://github.com/zhengchenyu/SimpleErasureCode/blob/24be76083a0c3172f1d2fe7af8e1ad972935657f/src/main/java/zcy/ec/coder/operator/GaloisFieldComputation.java#L88">a<sub>7</sub> 不等于 0</a>)</li>
</ul>
<blockquote>
<p>这里减法为加法逆元，对于GF(2)，加法逆元就是自己。</p>
</blockquote>
<p>f(x) * g(x) &#x3D; ((x<sup>6</sup> + x<sup>4</sup>+ x<sup>2</sup> + x + 1) * ( x<sup>7</sup> + x + 1))。我们可以把他分解为一下三个运算之和。</p>
<ul>
<li>(1)  (x<sup>6</sup> + x<sup>4</sup>+ x<sup>2</sup> + x + 1) * x<sup>7</sup></li>
<li>(2) (x<sup>6</sup> + x<sup>4</sup>+ x<sup>2</sup> + x + 1) * x</li>
<li>(3) (x<sup>6</sup> + x<sup>4</sup>+ x<sup>2</sup> + x + 1) * 1<br>对于(2)在前面已经介绍了，对于(3)实质上不用计算。对于(1), 我们可以使用(2)的计算方法进行递归计算(<a target="_blank" rel="noopener" href="https://github.com/zhengchenyu/SimpleErasureCode/blob/24be76083a0c3172f1d2fe7af8e1ad972935657f/src/main/java/zcy/ec/coder/operator/GaloisFieldComputation.java#L73">fxxn</a>)。然后分别计算的三个值进行相加便得到了最终的结果(<a target="_blank" rel="noopener" href="https://github.com/zhengchenyu/SimpleErasureCode/blob/24be76083a0c3172f1d2fe7af8e1ad972935657f/src/main/java/zcy/ec/coder/operator/GaloisFieldComputation.java#L63">_mul</a>)。</li>
</ul>
<p>至此，ErasureCode算法的所有问题都已经得到了解释。本文以自顶向下的方式描述了如何从理论到实践来实现ErasureCode算法。完整的代码见<a target="_blank" rel="noopener" href="https://github.com/zhengchenyu/SimpleErasureCode">SimpleErasureCode</a>。</p>
<p>参考文献:</p>
<ul>
<li>[1]William Stallings. 密码编码学与网络安全 原理与实践 第六版[M] 第四章</li>
<li>[2]<a target="_blank" rel="noopener" href="https://drmingdrmer.github.io/tech/distributed/2017/02/01/ec.html">https://drmingdrmer.github.io/tech/distributed/2017/02/01/ec.html</a></li>
</ul>


    </div>
    <footer class="article-footer">
      <a data-url="http://example.com/2025/02/07/ErasuceCode%E7%AE%97%E6%B3%95%E6%8F%AD%E7%A7%98/" data-id="cuid5DFj29mQpLC0a7B5fVb3W" data-title="ErasuceCode算法揭秘" class="article-share-link"><span class="fa fa-share">Share</span></a>


  <ul class="article-tag-list" itemprop="keywords"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/EC/" rel="tag">EC</a></li></ul>

    </footer>
  </div>

</article>


    <article id="post-机器学习-2-8-监督学习之k近邻算法" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
  <div class="article-meta">
    <a href="/2025/02/07/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0-2-8-%E7%9B%91%E7%9D%A3%E5%AD%A6%E4%B9%A0%E4%B9%8Bk%E8%BF%91%E9%82%BB%E7%AE%97%E6%B3%95/" class="article-date">
  <time class="dt-published" datetime="2025-02-07T08:06:18.199Z" itemprop="datePublished">2025-02-07</time>
</a>

  </div>
  <div class="article-inner">


    <div class="e-content article-entry" itemprop="articleBody">

        <hr>
<h2 id="title-机器学习-2-8-监督学习之k近邻算法date-2017-10-15-20-29-06tags-机器学习"><a href="#title-机器学习-2-8-监督学习之k近邻算法date-2017-10-15-20-29-06tags-机器学习" class="headerlink" title="title: 机器学习-2.8-监督学习之k近邻算法date: 2017-10-15 20:29:06tags: [机器学习]"></a>title: 机器学习-2.8-监督学习之k近邻算法<br>date: 2017-10-15 20:29:06<br>tags: [机器学习]</h2><script type="text/javascript" src="/Users/zcy/Desktop/study/git/MathJax-master/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>


<h1 id="K近邻算法"><a href="#K近邻算法" class="headerlink" title="K近邻算法"></a>K近邻算法</h1><p>K近邻算法是一种常用的监督学习算法。其工作机制非常简单: 给定测试样本，基于某种距离度量找到与测试样本最近的k个训练样本，然后可以根据k个样本决定分类。譬如可以选择k个样本中最多的测试样本进行分类。</p>
<p>下面我们使用k近邻算法识别手写数字。手写数字为一个32*32维的向量，我们可以把它看成一个1024维的向量。然后任意两个样本的距离完全可以通过计算1024维上的两个点之间的距离获得，然后我们可以找到与测试样本距离最近的的10个样本，其中最多的分类我们认为他就是测试样本的分类。具体代码如下：</p>
<pre><code class="language-python">import numpy as np
import os
from numpy import linalg

global g_train
global g_label

def makeTrain(dir):
  dirs=os.listdir(dir)
  files = filter(lambda item:not os.path.isdir(item), dirs)
  mat = []
  label = []

  for file in files:
    arr = []
    f = open(dir+&quot;/&quot;+file)
    while True:
      line = f.readline()
      if not line:
        break
      for i in range(len(line)-1):        # line[len(line)]=&#39;\n&#39;
        arr.append(int(line[i]))
    mat.append(arr)
    label.append(int(file.split(&quot;_&quot;)[0]))
  return (np.array(mat),np.array(label))

def testClassify(dir):
  dirs=os.listdir(dir)
  files = filter(lambda item:not os.path.isdir(item), dirs)
  mat = []
  right=0
  wrong=0
  for file in files:
    arr = []
    f = open(dir+&quot;/&quot;+file)
    while True:
      line = f.readline()
      if not line:
        break
      for i in range(len(line)-1):        # line[len(line)]=&#39;\n&#39;
        arr.append(int(line[i]))
    mat.append(arr)
    testLabel = file.split(&quot;_&quot;)[0]
    label=classify(np.array(arr),10)
    if testLabel == label:
      right=right+1
    else:
      wrong=wrong+1
  print(&quot;right=&quot;,right,&quot;, wrong=&quot;,wrong)

def classify(vec,k):
  # 计算各个训练数据与测试数据的距离
  m = len(g_label)
  dis = []
  for i in range(m):
    dis.append([linalg.norm(vec-g_train[i]),g_label[i]])
  dis = sorted(dis, key=lambda v:v[0])

  # 计算相似度最高的k个值，这里写入map做累积
  dic = {}
  for j in range(k):
    if not str(dis[j][1]) in dic:
      dic[str(dis[j][1])]=1
    else:
      dic[str(dis[j][1])]=dic[str(dis[j][1])]+1
  return max(dic.items(), key=lambda x: x[1])[0]

if __name__ == &quot;__main__&quot;:
  # 1 formate trainning date
  (g_train,g_label) =makeTrain(&quot;/Users/zcy/Desktop/study/git/mlearning/res/trainingDigits1&quot;)

  # 2 test
  testClassify(&quot;/Users/zcy/Desktop/study/git/mlearning/res/testDigits1&quot;)
</code></pre>
<p>计算946个测试样本集，有926个为正确计算，20个为错误计算，识别为97.89%</p>


    </div>
    <footer class="article-footer">
      <a data-url="http://example.com/2025/02/07/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0-2-8-%E7%9B%91%E7%9D%A3%E5%AD%A6%E4%B9%A0%E4%B9%8Bk%E8%BF%91%E9%82%BB%E7%AE%97%E6%B3%95/" data-id="cuid5zV4UwetSFyrDJdnGOIvW" data-title="" class="article-share-link"><span class="fa fa-share">Share</span></a>


    </footer>
  </div>

</article>


    <article id="post-RSS-远程Merge的设计EN" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
  <div class="article-meta">
    <a href="/2023/12/25/RSS-%E8%BF%9C%E7%A8%8BMerge%E7%9A%84%E8%AE%BE%E8%AE%A1EN/" class="article-date">
  <time class="dt-published" datetime="2023-12-25T10:58:15.000Z" itemprop="datePublished">2023-12-25</time>
</a>

  </div>
  <div class="article-inner">


      <header class="article-header">


    <h1 itemprop="name">
      <a class="p-name article-title" href="/2023/12/25/RSS-%E8%BF%9C%E7%A8%8BMerge%E7%9A%84%E8%AE%BE%E8%AE%A1EN/">RSS-Remote Merge Design</a>
    </h1>


      </header>

    <div class="e-content article-entry" itemprop="articleBody">

        <h1 id="1-Default-shuffle"><a href="#1-Default-shuffle" class="headerlink" title="1 Default shuffle"></a>1 Default shuffle</h1><blockquote>
<p>Note: The first chapter briefly introduces the principle of default shuffle, with the purpose of find where local disks are used, then design remote merge. If you know enough, you can ignore this part.</p>
</blockquote>
<p>We will analyze the shuffle of MapReduce, Tez, and Spark in turn.</p>
<h2 id="1-1-MapReduce"><a href="#1-1-MapReduce" class="headerlink" title="1.1 MapReduce"></a>1.1 MapReduce</h2><p>Map writes the record to the memory. When the memory exceeds the threshold, the memory data is spilled to the disk file, and the Record is written to the disk file in order of partitionid+key. After Map has processed all records, it will spill the data currently in memory to a disk file. Then read all the files spilled to the disk and merge them in the order of partitionid+key to get the sorted Records.</p>
<blockquote>
<p>Note: The purpose of sorting according to partitionid is that when the Reduce side obtains the data from the Map side, it should be read as sequentially as possible. For MR, Tez, and Spark, regardless of whether they are sorted or not, as long as there are partitioned, they need to be sorted according to partitionid.</p>
</blockquote>
<p>The reduce will pull the records of the corresponding partition remotely or locally from the Map, which is called MapOutput. Under normal circumstances, the memory will be used directly. If the memory exceeds the threshold, these records will be written to the disk. Then the reduce will perform merge operations on MapOutputs using minimum heap K-way merge sorting to obtain globally sorted records. During the Merge process, temporary results may be spilled to disk because the memory exceeds the threshold. In addition, if there are too many files spilled to disk, additional merges will be triggered.</p>
<h2 id="1-2-Tez"><a href="#1-2-Tez" class="headerlink" title="1.2 Tez"></a>1.2 Tez</h2><p>There are two cases of tez: (1) ordered io  (2) unordered io.</p>
<p>Ordered io is the same as MapReduce and so ignore it here.</p>
<p>Unordered io is generally used in hashjoin and other situations where keys are not required for sorting. Non-sorted io adopts a ready-to-use solution. Map writes the Record directly to the file or writes it to the file through cache. The Reduce side can also read and use it when reading data.</p>
<h2 id="1-3-Spark"><a href="#1-3-Spark" class="headerlink" title="1.3 Spark"></a>1.3 Spark</h2><p>Spark’s shuffle is more complex and more reasonable. Some tasks do not require sort and combine, so spark users can determine the shuffle logic according to their needs.</p>
<h3 id="1-3-1-Shuffle-write-operation"><a href="#1-3-1-Shuffle-write-operation" class="headerlink" title="1.3.1 Shuffle write operation"></a>1.3.1 Shuffle write operation</h3><p>When writing shuffle data, three writers are supported:</p>
<ul>
<li>(1) BypassMergeSortShuffleWriter</li>
</ul>
<p>A temporary file is generated for each partition. When writing record, find the corresponding partition and write it directly to the corresponding temporary file. Then when all data is processed, these temporary files are written to a final file in order of the partitions, and the temporary files are deleted.</p>
<ul>
<li>(2) UnsafeShuffleWriter</li>
</ul>
<p>UnsafeShuffleWriter mainly implements specific logic through ShuffleExternalSorter. When writing a Record, the serialization operation is performed directly and the serialized bytes are copied to the requested memory. At the same time, the address and partition of the record will also be recorded into the memory (inMemSorter).</p>
<p>When the memory reaches the threshold, spill operation will be performed. Based on the information in memory (inMemSorter), we can easily get a Record sorted by partition and write it to a file.</p>
<p>When all Records are processed, we will spill the records currently in memory into the file. Finally, all spilled files are aggregated once. Since the previously spilled files have been sorted according to the partition, we can copy the corresponding copies of all the spilled files to the final file in the order of the partitions. The final file obtained in this way is the partition-sorted file.</p>
<ul>
<li>(3) SortShuffleWriter</li>
</ul>
<p>SortShuffleWriter mainly implements specific logic through ExternalSorter. ExternalSorter decides whether to combine and sort based on the user’s needs.</p>
<p>When writing record, it will be inserted directly into memory. If combine is required, the memory architecture is map, otherwise it is buffer.</p>
<p>If the current evaluation memory is greater than the threshold, the spill operation will be triggered. During the spill operation, the Record will be spilled to the disk. This process requires sorting. The specific comparator will use different values according to different user needs. If keyordering is set, it will be sorted by key. If keyordering is not set, but aggregator (i.e. combine) is set, the keys are sorted according to  the hashcode of key, thus ensuring that the same keys are organized together to facilitate combine operations. If neither keyordering nor aggregator is set, it will be sorted according to partition.</p>
<p>When all Records are written, the spill files need to be read and merged into a globally ordered file.<br><br>Comparison of three writers</p>
<table>
<thead>
<tr>
<th>writer</th>
<th>advantages</th>
<th>disadvantages</th>
<th>scene</th>
</tr>
</thead>
<tbody><tr>
<td>BypassMergeSortShuffleWriter</td>
<td>(1) Only serialized once. <br>(2) Using hashmap-like data structure, inserting data is fast.</td>
<td>(1) Combine and sort are not supported <br>(2) Each partition must generate a temporary file, which will generate too many temporary files.</td>
<td>Suitable for situations where the number of partitions is small (default is less than or equal to 200) and there is no combine.</td>
</tr>
<tr>
<td>UnsafeShuffleWriter</td>
<td>(1) Only serialized once. <br>(2) The number of files spilled to disk is limited and is no longer based on the number of partitions, and can support larger partitions.</td>
<td>(1) Combine, sort is not supported <br>(2) The writing order Record order will be disrupted, and supportsRelocationOfSerializedObjects is required.</td>
<td>Applicable to situations where combine does not exist, and supportsRelocationOfSerializedObjects is true, and the maximum number of supported partitions is 16777216.</td>
</tr>
<tr>
<td>SortShuffleWriter</td>
<td>(1) Supports combine, sort <br> (2) Suitable for all scenarios <br> (3) The number of files spilled to disk is limited</td>
<td>(1) Multiple serializations are required</td>
<td>Suitable for all scenarios.</td>
</tr>
</tbody></table>
<h3 id="1-3-2-shuffle-read"><a href="#1-3-2-shuffle-read" class="headerlink" title="1.3.2 shuffle read"></a>1.3.2 shuffle read</h3><p>Currently there is only one implementation of BlockStoreShuffleReader. The implementation is similar to MapReduce.<br>The reduce will pull the records of the corresponding partition remotely or locally from the map. Under normal circumstances, it will be written directly to the memory, but if the block size to be obtained exceeds the threshold, will use disk.<br>Then it will be decided according to the user’s needs whether to combine or sort, and finally form a record iterator required by the user.<br>Combine and sort use ExternalAppendOnlyMap and ExternalSorter respectively. When the memory exceeds the threshold, the data will be spilled to the local disk.</p>
<h2 id="1-4-Summary"><a href="#1-4-Summary" class="headerlink" title="1.4 Summary"></a>1.4 Summary</h2><p>(1) About the semantics of each framework</p>
<p>For MapReduce and the ordered io of Tez, it is a special case of spark sorting. For Tez’s unordered io, it is essentially a special case of spark’s non-sorting. In essence, the semantics of each framework are the same, and spark is more general.</p>
<p>(2) Where will generate local disk files?</p>
<p>After analyzing the three computing frameworks, we learned that the following processes will use disks:</p>
<ul>
<li>(1) Map may generate intermediate temporary files because the memory exceeds the threshold.</li>
<li>(2) The map will eventually generate disk files to provide shuffle services.</li>
<li>(3) When reduce pulls records, disk files may be generated because the threshold is exceeded.</li>
<li>(4) When merging on the reduce side, temporary disk files may be generated for global sorting.</li>
</ul>
<p>In fact, uniffle has solved (1), (2). For (3), if the parameters are adjusted effectively, it is difficult to generate disk files. In fact, only (4) needs to be discussed in this article.</p>
<h1 id="2-Plans"><a href="#2-Plans" class="headerlink" title="2 Plans"></a>2 Plans</h1><p>In order to solve the problem that Merge on the Reduce side may spill to disk, there are two main solutions:</p>
<ul>
<li>(1) Merge on Shuffle Server</li>
<li>(2) Reduce side Merge on demand</li>
</ul>
<h2 id="2-1-Option-1-Merge-on-ShuffleServer"><a href="#2-1-Option-1-Merge-on-ShuffleServer" class="headerlink" title="2.1 Option 1: Merge on ShuffleServer"></a>2.1 Option 1: Merge on ShuffleServer</h2><p>Move the merge process of reduce to the ShuffleServer side. ShuffleServer will merge the locally sorted Records sent from the map side into a globally sorted records sequence. The reduce side reads directly in the order of the records sequence.</p>
<ul>
<li>Advantages: Does not require too much memory and network RPC.</li>
<li>Disadvantages: Shuffle Server needs to parse Key, Value and Comparator. The Shuffle side cannot combine.</li>
</ul>
<h2 id="2-2-Option-2-On-demand-Merge-on-the-Reduce-side"><a href="#2-2-Option-2-On-demand-Merge-on-the-Reduce-side" class="headerlink" title="2.2 Option 2: On-demand Merge on the Reduce side"></a>2.2 Option 2: On-demand Merge on the Reduce side</h2><img src="/images/rss/on_demand_merge.png" width=50% height=50% text-align=center/>
Since the memory on the reduce side is limited, in order to avoid spilling data to disk when merging on the reduce side. When reduce obtains segment, it can only read part of the buffer of each segment, and then merge all the buffers. Then when the partial buffer reading of a certain segment is completed, the next buffer of this segment will continue to be read, and this buffer will continue to be added to the merge process.
There is a problem with this. The number of times the Reduce side reads data from ShuffleServer is approximately segments_num * (segment_size / buffer_size), which is a large value for large tasks. Too many RPCs means decreased performance.

<blockquote>
<p>The segment here refers to the sorted record collection, which can be understood as the block in which the records have been sorted according to key.</p>
</blockquote>
<ul>
<li>Advantages: Shuffle Server does not need to do anything extra.</li>
<li>Disadvantages: Too many RPCs.</li>
</ul>
<p>**This article chooses option 1, and the following content mainly discusses option 1. **</p>
<h1 id="3-Demand-analysis"><a href="#3-Demand-analysis" class="headerlink" title="3 Demand analysis"></a>3 Demand analysis</h1><h2 id="3-1-What-types-of-tasks-require-remote-merge"><a href="#3-1-What-types-of-tasks-require-remote-merge" class="headerlink" title="3.1 What types of tasks require remote merge?"></a>3.1 What types of tasks require remote merge?</h2><p>Currently, uniffle’s map-side no longer spill disk. This article mainly considers the situation on the reduce. Mainly divided into the following situations:</p>
<ul>
<li>(1) For spark’s non-sorted, non-aggregated, tez unordered io. It does not require any global aggregation and sorting operations, and only requires very little memory. The current version of uniffle will not use disk if related settings are reasonable. Just use the current uniffle solution. This article will not discuss this aspect.</li>
<li>(2) For spark sorting or aggregation tasks, tez ordered io, mapreduce, due to the need for global sorting or aggregation, the memory may not be enough, and the record may be spilled to the disk. This article mainly discusses this situation.<br>**In summary, it can be seen that remote merge is only used for shuffles that require sorting or aggregation. **</li>
</ul>
<h2 id="3-2-How-does-ShuffleServer-sort"><a href="#3-2-How-does-ShuffleServer-sort" class="headerlink" title="3.2 How does ShuffleServer sort?"></a>3.2 How does ShuffleServer sort?</h2><p>For sorting, map is generally sorted to obtain a set of partially sorted records, which is called segment here. Then reduce will obtain all segments and merge them. Spark, MR, and Tez all use minimum heap K-way merge sorting. This method can still be used for remote sorting.</p>
<p>BufferManager and FlushManager maintain block information in memory and disk. We only need to add MergeManager to ShuffleServer and merge the blocks under the same Shuffle to obtain globally sorted Records.</p>
<p>Introducing sorting on the ShuffleServer produces a side effect: the Shuffle’s KeyClass, ValueClass and KeyComparator need to be passed to ShuffleServer.</p>
<h2 id="3-3-How-does-ShuffleServer-combine"><a href="#3-3-How-does-ShuffleServer-combine" class="headerlink" title="3.3 How does ShuffleServer combine?"></a>3.3 How does ShuffleServer combine?</h2><p>Combine is generally a user-defined operation, so ShuffleServer is prohibited from performing combine operations. If combine is performed on the Reduce side, wouldn’t it violate our theme of avoiding spill to disk on the task side? In fact we don’t have to use ExternalAppendOnlyMap for combine. If the Records obtained from ShuffleServer are sorted by key, it means that the same keys have been organized together, and only a small amount of memory is needed to combine.</p>
<h2 id="3-4-How-does-Writer-write"><a href="#3-4-How-does-Writer-write" class="headerlink" title="3.4 How does Writer write?"></a>3.4 How does Writer write?</h2><p>Just write it the way we have it.</p>
<h2 id="3-5-How-does-Reader-read"><a href="#3-5-How-does-Reader-read" class="headerlink" title="3.5 How does Reader read?"></a>3.5 How does Reader read?</h2><p>Currently, Uniffle’s shuffle reader uses blockid as the read mark, which makes it easy to verify whether an accurate and complete records are obtained. For remote merge, MergeManager has merged the original Block collection into a new sequence sorted records by key. Therefore, the blockid generated by the map segment cannot be used:<br>We will use a new way to read Records. When MergerManager performs global Merge, an index will be generated. Reader will read according to this index.</p>
<blockquote>
<p>Note: In principle, using key as a read index is more semantic, and the first version of the demo program was also implemented by this way. However, this proposal was not friendly enough to deal with the problem of data skew, so  gave up the plan.</p>
</blockquote>
<h1 id="4-Scheme-Design"><a href="#4-Scheme-Design" class="headerlink" title="4 Scheme Design"></a>4 Scheme Design</h1><h2 id="4-1-Basic-procedures-for-RemoteMerge"><a href="#4-1-Basic-procedures-for-RemoteMerge" class="headerlink" title="4.1 Basic procedures for RemoteMerge"></a>4.1 Basic procedures for RemoteMerge</h2><img src="/images/rss/remote_merge_structure.png" width=50% height=50% text-align=center/>

<p>The following introduces the process of Remote Merge:</p>
<ul>
<li>(1) Register<br>When AM&#x2F;Driver calls the registerShuffle method, it will additionally register keyClass, valueClass and keyComparator. This information is mainly used by ShuffleServer to parse and sort the Record during merging.</li>
<li>(2) sendShuffleData<br>The sendShuffleData logic is basically consistent with existing RSS tasks. The only difference is the use of unified serializers and deserializers, which ensures that ShuffleServer can parse the Record normally no matter which framework it is.</li>
<li>(3) buffer and flush<br>The shuffle server will store the data in the cache, or cache it to the local file system or remote file system through the flush manager. The logic of the original ShuffleServer is still reused here.</li>
<li>(4) reportUniqueBlocks<br>A new API is provided, reportUniqueBlocks. The Reduce end will deduplicate the blocks generated by the map, and then send the valid block set to ShuffleServer through reportUniqueBlocks. After ShuffleServer receives a valid blocks collection, it will trigger Remote Merge. The results of Remote Merge will be written to the bufferPool like a normal block, and may be flushed to disk when necessary. The result generated by RemoteMerge is an ordinary block, but for convenience of explanation, it is called merged block here. The merged block records the results sorted by key, so when reading the merged block, you need to read it in ascending order in the order of blockid.</li>
<li>(5) getSortedShuffleData<br>Reduce will read merged blocks in the order of block serial numbers, and then choose when to use them for reduce calculations based on certain conditions.</li>
</ul>
<h2 id="4-2-Analyze-the-process-from-the-perspective-of-Record"><a href="#4-2-Analyze-the-process-from-the-perspective-of-Record" class="headerlink" title="4.2 Analyze the process from the perspective of Record"></a>4.2 Analyze the process from the perspective of Record</h2><p>We can use WordCount as an example to explain the flow of record in the entire process. In this example, there are two partitions and one reduce, that is, one reduce processes the data of two partitions.</p>
<img src="/images/rss/remote_merge_from_record_perspective.jpg" width=50% height=50% text-align=center/>

<ul>
<li>(1) MAP side<br>After the document data is processed on the map side, it will be sorted. For Map1, since there are two partitions, records with odd numbers as keys will be written to block101, and records with even numbers as keys will be written to block102. The same goes for Map2. Note that the Records in the block here are all sorted.</li>
<li>(2) Map side sends data<br>The map side sends the block to ShuffleServer through sendShuffleData, and ShuffleServer stores it in the bufferPool.<br>What I mean here is that when registering, the app named APP1 will also be registered, and the app named APP1@RemoteMerge will also be registered, which will be introduced later.</li>
<li>(3) ShuffleServer side Merge<br>After reduce is started, reportUniqueBlocks will be called to report the available block set, and the corresponding partition in ShuffleServer will be triggered to merge. The result of Merge is a globally sorted record collection under this partition.<br>Then the question is where are the results of merge stored? The merge process occurs in memory. Whenever a certain number of records are merged, the results are written to a new block. In order to distinguish it from the original appid, this group of blocks will be managed in an appid ending with “@RemoteMerge”. The blockid of this new group of blocks increases from 1 and is sorted globally. That is, the records inside each block are sorted, and the records with blockid&#x3D;1 must be less than or equal to the records with blockid&#x3D;2.</li>
<li>(4) Reduce end reading<br>According to the previous analysis, the reduce side only needs to read the block managed by the appid ending with “@RemoteMerge”. When reduce reads a block, it starts from the block with blockid&#x3D;1 and reads in the order of blockid. We know that when reduce performs calculations, it is calculated in order. Since the data we obtain on the ShuffleServer side is already sorted, we only need to obtain a small amount of data from the ShuffleServer side each time. This enables on-demand reading from the ShuffleServer side, which greatly reduces memory usage.<br>There are two special situations here, detailed in 5.5.</li>
</ul>
<h1 id="5-Plan"><a href="#5-Plan" class="headerlink" title="5 Plan"></a>5 Plan</h1><h2 id="5-1-Unified-serializer"><a href="#5-1-Unified-serializer" class="headerlink" title="5.1 Unified serializer"></a>5.1 Unified serializer</h2><p>Since Merge needs to be performed on the ShuffleServer side, a unified serializer that is independent of the computing framework needs to be extracted. Two types of serializers are extracted here: (1) Writable (2) Kryo. Writable serialization is used for classes that handle the org.apache.hadoop.io.Writable interface, used in the MR and TEZ frameworks. Kryo can serialize most classes and is generally used in the Spark framework.</p>
<h2 id="5-2-RecordsFileWriter-RecordFileReader"><a href="#5-2-RecordsFileWriter-RecordFileReader" class="headerlink" title="5.2 RecordsFileWriter&#x2F;RecordFileReader"></a>5.2 RecordsFileWriter&#x2F;RecordFileReader</h2><p>Provides abstract methods for processing Records</p>
<h2 id="5-3-Merger"><a href="#5-3-Merger" class="headerlink" title="5.3 Merger"></a>5.3 Merger</h2><p>Provides basic Merger service to merge multiple data streams according to key. Minimum heap K-way merge sorting is used to merge and sort the data streams that have been partially sorted.</p>
<h2 id="5-4-MergeManager"><a href="#5-4-MergeManager" class="headerlink" title="5.4 MergeManager"></a>5.4 MergeManager</h2><p>Used to merge Records on the server side.</p>
<h2 id="5-5-Tools-for-reading-sorted-data"><a href="#5-5-Tools-for-reading-sorted-data" class="headerlink" title="5.5 Tools for reading sorted data"></a>5.5 Tools for reading sorted data</h2><p>Generally speaking, when the Reduce side reads data, it can be sent directly to downstream calculations. But there are two special situations:<br>(1) For situations where it is necessary to combine in Merge, we need to wait for all the same keys to arrive before combining, and then send them to the downstream.<br>(2) For spark and tez, the reduce end can read data from multiple partitions. Therefore, we need to merge the data of multiple partitions again on the reduce side, and then send it to the downstream.<br>RMRecordsReader is a tool for reading sorted data. The general structure is as follows:<br><img src="/images/rss/rm_records_reader.png" width=50% height=50% text-align=center/><br>The figure depicts a situation where a single Reduce processes two partitions. The RecordsFetcher thread will read the block of the corresponding partition and then parse it into Records. Then send it to combineBuffers. RecordCombiner reads Records from combineBuffer. When all records of a certain key are collected, a combine operation is performed. The result will be sent to mergedBuffer. RecordMerge will obtain all mergedBuffers and then merge and sort them again in memory. Finally, the global sorting results are obtained for downstream.</p>
<h2 id="5-6-Framework-adaptation"><a href="#5-6-Framework-adaptation" class="headerlink" title="5.6 Framework adaptation"></a>5.6 Framework adaptation</h2><p>Compatible with MR, Tez, and Spark architectures.</p>
<blockquote>
<p>I has used online application to conduct large-scale stress testing on MR and Tez. Spark has currently only tested some basic examples and still needs a lot of testing.</p>
</blockquote>
<h2 id="5-7-Isolated-classloader"><a href="#5-7-Isolated-classloader" class="headerlink" title="5.7 Isolated classloader"></a>5.7 Isolated classloader</h2><p>For different versions of keyclass, valueclass and comparatorclass, use isolated classloaders to load them.</p>
<blockquote>
<p>Chinese version document: <a target="_blank" rel="noopener" href="https://zhengchenyu.github.io/2023/12/25/RSS-%E8%BF%9C%E7%A8%8BMerge%E7%9A%84%E8%AE%BE%E8%AE%A1/">https://zhengchenyu.github.io/2023/12/25/RSS-%E8%BF%9C%E7%A8%8BMerge%E7%9A%84%E8%AE%BE%E8%AE%A1/</a></p>
</blockquote>


    </div>
    <footer class="article-footer">
      <a data-url="http://example.com/2023/12/25/RSS-%E8%BF%9C%E7%A8%8BMerge%E7%9A%84%E8%AE%BE%E8%AE%A1EN/" data-id="cuidw-SriAesP7L5rq8ZzV61k" data-title="RSS-Remote Merge Design" class="article-share-link"><span class="fa fa-share">Share</span></a>


  <ul class="article-tag-list" itemprop="keywords"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/RSS/" rel="tag">RSS</a></li></ul>

    </footer>
  </div>

</article>


    <article id="post-RSS-远程Merge的设计" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
  <div class="article-meta">
    <a href="/2023/12/25/RSS-%E8%BF%9C%E7%A8%8BMerge%E7%9A%84%E8%AE%BE%E8%AE%A1/" class="article-date">
  <time class="dt-published" datetime="2023-12-25T10:58:15.000Z" itemprop="datePublished">2023-12-25</time>
</a>

  </div>
  <div class="article-inner">


      <header class="article-header">


    <h1 itemprop="name">
      <a class="p-name article-title" href="/2023/12/25/RSS-%E8%BF%9C%E7%A8%8BMerge%E7%9A%84%E8%AE%BE%E8%AE%A1/">RSS-远程Merge的设计</a>
    </h1>


      </header>

    <div class="e-content article-entry" itemprop="articleBody">

        <h1 id="1-默认的shuffle"><a href="#1-默认的shuffle" class="headerlink" title="1 默认的shuffle"></a>1 默认的shuffle</h1><blockquote>
<p>注: 第一节简单介绍了主流框架默认的shuffle的原理，目的为了那里会使用本地磁盘，从而设计远程Merge。如果足够了解，可以忽略这部分内容。</p>
</blockquote>
<p>我们依次对MapReduce, Tez, Spark的shuffle进行分析。</p>
<h2 id="1-1-MapReduce"><a href="#1-1-MapReduce" class="headerlink" title="1.1 MapReduce"></a>1.1 MapReduce</h2><p>Map将Record写入到内存，当内存超过阈值的时候，会将内存的数据spill到磁盘文件中，按照partitionid+key的顺序依次将Record写入到磁盘文件。当Map处理完所有Record后, 会将当前内存中的数据spill到磁盘文件。然后读取所有的spill到磁盘的文件，并按照partitionid+key的顺序进行merge，得到排序的Records。</p>
<blockquote>
<p>注: 按照partitionid排序的目的是Reduce端从Map端获取的数据的时候，尽可能顺序读。对于MR、Tez、Spark, 无论是否排序，只要有分区，都需要按照partitionid进行排序。</p>
</blockquote>
<p>Reduce端会从Map端以远程或本地的方式拉取对应分区的Records，称之为MapOutput。正常情况下会直接使用内存，如果内存超过阈值会将这些Records写入到磁盘。然后Reduce端会对所有MapOutput使用最小堆K路归并排序进行一些Merge操作，得到全局排序的Reccords。Merge的过程中，可能会因为内存超过阈值，会将临时的结果spill到磁盘。另外如果spill到磁盘的文件过多，也会触发额外的merge。</p>
<h2 id="1-2-Tez"><a href="#1-2-Tez" class="headerlink" title="1.2 Tez"></a>1.2 Tez</h2><p>Tez的情况略微复杂。Tez分为ordered io和unordered io。</p>
<p>ordered io与MapReduce相同，不再展开分析。</p>
<p>unordered io一般用于hashjoin等不需要key进行排序的情况。非排序的io采用来之即用方案。Map直接将Record写入文件或者通过缓存在写入文件。Reduce端也是读数据的时候也是读之即用。</p>
<h2 id="1-3-Spark"><a href="#1-3-Spark" class="headerlink" title="1.3 Spark"></a>1.3 Spark</h2><p>spark的shuffle更复杂，也更合理。部分任务是不需要sort和combine的，因此spark用户可以需求决定shuffle的逻辑。</p>
<h3 id="1-3-1-Shuffle写操作"><a href="#1-3-1-Shuffle写操作" class="headerlink" title="1.3.1 Shuffle写操作"></a>1.3.1 Shuffle写操作</h3><p>写shuffle数据的时候，支持三种writer:</p>
<ul>
<li>(1) BypassMergeSortShuffleWriter</li>
</ul>
<p>为每个partition都生成一个临时文件。在写Record的时候，找到对应的分区直接写入对应的临时文件。然后当所有数据都处理完成后，将这些临时的文件按照分区的顺序依次写入到一个最终的文件，并删除临时文件。</p>
<ul>
<li>(2) UnsafeShuffleWriter</li>
</ul>
<p>UnsafeShuffleWriter主要通过ShuffleExternalSorter来实现具体的逻辑。当写Record的时候，直接序列化操作，并将序列化的字节拷贝到申请的内存中。同时也会将Record的地址和分区记录到内存中(inMemSorter)。</p>
<p>当内存的Record达到阈值的时候，会进行spill操作。根据内存(inMemSorter)中的信息，我们很容易得到一个按照分区排序的Record，并写到文件中。</p>
<p>当处理完所有Record，我们会把当前内存中的Records spill到文件中。最后对所有spilled文件进行一次聚合。由于之前spilled的文件已经是按照分区排序的，所以我们可以按照分区的顺序依次将所有spill的文件的对应自己拷贝到最终文件。这样得到的最终文件即为分区排序的文件。</p>
<ul>
<li>(3) SortShuffleWriter</li>
</ul>
<p>SortShuffleWriter主要通过ExternalSorter来实现具体逻辑。ExternalSorter根据用户的需求决定是否combine和sort。</p>
<p>当写Record的时候，会直接插入到内存中。如果需要combine，内存架构是map，否则是buffer。</p>
<p>如果当前评估内存大于阈值会触发spill操作。spill操作的时候，会将Record，然后spill到磁盘。这个过程是需要进行排序的。而具体的比较器会根据用户需求的不同使用不同的值。如果设置了keyordering会按照key进行排序。如果没有设置keyordering，但是设置了aggregator(即combine)，则按照key的hashcode进行排序，这样保证相同的key组织一起，便于combine操作。如果keyordering和aggregator都没有设置，则会按照partiton进行排序。</p>
<p>所有Record都写完时，需要读取spill的文件，并合并成一个全局有序的文件。</p>
<p>三种writer的比较</p>
<table>
<thead>
<tr>
<th>writer</th>
<th>优点</th>
<th>缺点</th>
<th>场景</th>
</tr>
</thead>
<tbody><tr>
<td>BypassMergeSortShuffleWriter</td>
<td>(1) 只经过一次序列化。<br>(2) 采用类hashmap的数据结构，插入数据快。</td>
<td>(1) 不支持combine <br>(2) 每个分区都要对应生成一个临时文件，会产生过多的临时文件。</td>
<td>适合分区数较少(默认小于等于200)且没有combine的的情况.</td>
</tr>
<tr>
<td>UnsafeShuffleWriter</td>
<td>(1) 只经过一次序列化。<br>(2) spill到磁盘的文件数目有限，不再基于分区数，可以支持更大的分区。</td>
<td>(1) 不支持combine <br>(2) 写入顺序Record顺序会打乱，要求supportsRelocationOfSerializedObjects。</td>
<td>适用于没有combine的情况，且支持supportsRelocationOfSerializedObjects，并且支持最大支持分区数为16777216。</td>
</tr>
<tr>
<td>SortShuffleWriter</td>
<td>(1) 支持combine <br> (2) 适合于所有场景 <br> (3)  spill到磁盘的文件数目有限</td>
<td>(1) 需要进行多次序列化</td>
<td>适用于所有场景。</td>
</tr>
</tbody></table>
<h3 id="1-3-2-shuffle读"><a href="#1-3-2-shuffle读" class="headerlink" title="1.3.2 shuffle读"></a>1.3.2 shuffle读</h3><p>当前只有BlockStoreShuffleReader一个实现。实现与MapReduce类似。<br>Reduce端会从Map端以远程或本地的方式拉取对应分区的Records。正常情况下会直接写到内存中，但如果要获取的block大小超过阈值则会使用磁盘。<br>然后会根据用户的需求决定是否进行combine或sort，最终形成一个用户要求的record iterator。<br>combine和sort分别使用了ExternalAppendOnlyMap和ExternalSorter，当内存超过阈值后，会将数据spill到本地磁盘中。</p>
<h2 id="1-4-总结"><a href="#1-4-总结" class="headerlink" title="1.4 总结"></a>1.4 总结</h2><p>(1) 关于各个框架语义</p>
<p>对于MapReduce和Tez的ordered io, 实质就是spark的排序的特例。对于Tez的unordered io，实质上就是spark的非排序的特例。实质各个框架上语义是相同的, spark更加泛化。</p>
<p>(2) 哪里会产生本地磁盘文件? </p>
<p>分析三种计算框架后，我们得知如下过程会使用磁盘:</p>
<ul>
<li>(1) Map由于内存超越阈值，可能会产生中间的临时文件。</li>
<li>(2) Map端最终必然会产生磁盘文件，用于提供shuffle服务。</li>
<li>(3) Reduce拉取records时候，可能因为超越阈值，产生磁盘文件。</li>
<li>(4) Reduce端Merge的时候，可能会产生临时的磁盘文件，用于全局排序。</li>
</ul>
<p>而事实上, uniffle已经解决(1), (2)。对于(3), 如果有效的调整参数，是很难产生磁盘文件的。事实上只有(4)是本文需要讨论的。</p>
<h1 id="2-方案的选择"><a href="#2-方案的选择" class="headerlink" title="2 方案的选择"></a>2 方案的选择</h1><p>为了解决在Reduce端Merge可能会spill到磁盘的问题，主要有两个方案:</p>
<ul>
<li>(1) Shuffle Server端进行Merge</li>
<li>(2) Reduce端按需Merge</li>
</ul>
<h2 id="2-1-方案1-ShuffleServer端进行Merge"><a href="#2-1-方案1-ShuffleServer端进行Merge" class="headerlink" title="2.1 方案1: ShuffleServer端进行Merge"></a>2.1 方案1: ShuffleServer端进行Merge</h2><p>将Reduce的Merge过程移到ShuffleServer端，ShuffleServer会对Map端发来的局部排序后的Records进行Merge，合并成一个全局排序的Records序列。Reduce端直接按照哦Records序列的顺序读取。</p>
<ul>
<li>优点: 不需要过多内存和网络RPC。</li>
<li>缺点: Shuffle Server端需要解析Key, Value和Comparator。Shuffle端不能combine。</li>
</ul>
<h2 id="2-2-方案2-Reduce端按需Merge"><a href="#2-2-方案2-Reduce端按需Merge" class="headerlink" title="2.2 方案2: Reduce端按需Merge"></a>2.2 方案2: Reduce端按需Merge</h2><img src="/images/rss/on_demand_merge.png" width=50% height=50% text-align=center/>
由于Reduce端内存有限，为了避免在Reduce端进行Merge的时候spill数据到磁盘。Reduce在获取Segment只能读取每个segment的部分buffer，然后对所有buffer进行Merge。然后对然后当某一个segment的部分buffer读取完成，会继续读取这个segment的下一块buffer，将这块buffer继续加到merge过程中。
这样有一个问题，Reduce端从ShuffleServer读取数据的次数大约为为segments_num * (segment_size / buffer_size)，对于大任务这是一个很大的值。过多的RPC意味着性能的下降。

<blockquote>
<p>这里的segment是指排序后record集合，可以理解为record已经按照key排序后的block。</p>
</blockquote>
<ul>
<li>优点: Shuffle Server不需要做额外的任何事情。</li>
<li>缺点: 过多的RPC。</li>
</ul>
<p><strong>本文选择方案1，接下来的内容主要针对于方案1进行讨论。</strong></p>
<h1 id="3-需求分析"><a href="#3-需求分析" class="headerlink" title="3 需求分析"></a>3 需求分析</h1><h2 id="3-1-哪类任务需要远程merge"><a href="#3-1-哪类任务需要远程merge" class="headerlink" title="3.1 哪类任务需要远程merge?"></a>3.1 哪类任务需要远程merge?</h2><p>当前uniffle的map端操作已经不再需要磁盘操作。本文主要考虑reduce端的情况。主要分如下几种情况:</p>
<ul>
<li>(1) 对于spark的非排序且非聚集、tez unordered io，Record是来之即用的，不需要有任何的全局的聚合和排序操作，只需要非常少的内存。当前版本的uniffle在内存设置合理的情况下是不会使用磁盘的。使用当前uniffle的方案即可。本文不会讨论这方面的内容。</li>
<li>(2) 对于spark的排序或聚集任务、tez ordered io、mapreduce，由于需要全局排序或聚集，内存可能不够用，可能会将Record spill到磁盘。本文主要讨论这种情况。<br><strong>综上可知，remote merge仅用于需要排序或聚集的shuffle。</strong></li>
</ul>
<h2 id="3-2-ShuffleServer如何进行排序"><a href="#3-2-ShuffleServer如何进行排序" class="headerlink" title="3.2 ShuffleServer如何进行排序?"></a>3.2 ShuffleServer如何进行排序?</h2><p>对于排序类的操作，一般会在Map进行排序得到一组局部排序的记录，这里称之为segment。然后Reduce会获取所有的segment, 并进行归并，Spark, MR, Tez都是用了最小堆K路归并排序。远程排序依旧可以使用这种方式。</p>
<p>BufferManager和FlushManager维护着block在内存和磁盘中的信息。我们只需要在需要在ShuffleServer中新增MergeManager，并将同一个Shuffle下的block进行Merge，得到全局排序的Records即可。</p>
<p>在ShuffleServer端引入排序后产生一个副作用: 即需要将该Shuffle的KeyClass和ValueClass以及KeyComparator传递给ShuffleServer。</p>
<h2 id="3-3-ShuffleServer禁止combine"><a href="#3-3-ShuffleServer禁止combine" class="headerlink" title="3.3 ShuffleServer禁止combine"></a>3.3 ShuffleServer禁止combine</h2><p>Combine一般都是用户自定义的操作，因此禁止ShuffleServer端进行Combine操作。</p>
<h1 id="4-架构设计"><a href="#4-架构设计" class="headerlink" title="4 架构设计"></a>4 架构设计</h1><h2 id="4-1-RemoteMerge的基本流程"><a href="#4-1-RemoteMerge的基本流程" class="headerlink" title="4.1 RemoteMerge的基本流程"></a>4.1 RemoteMerge的基本流程</h2><img src="/images/rss/remote_merge_structure.png" width=50% height=50% text-align=center/>


<p>下面介绍一下Remote Merge的流程:</p>
<ul>
<li>(1) 注册<br>AM&#x2F;Driver调用registerShuffle方法，会额外注册keyClass, valueClass和keyComparator. 这些信息主要用于ShuffleServer在合并的时候对Record进行解析和排序。</li>
<li>(2) sendShuffleData<br>sendShuffleData逻辑与现有的RSS任务基本保持一致。唯一区别是使用统一的序列化器和反序列化器，这样可以保证无论是哪一种框架，ShuffleServer都可以正常的解析Record.</li>
<li>(3) buffer and flush<br> shuffle server端会将数据存在缓存中，或者通过flush manager缓存到本地文件系统或远程文件系统。这里还是复用原来的ShuffleServer的逻辑。</li>
<li>(4) reportUniqueBlocks<br>提供了一个新的API, 即reportUniqueBlocks。Reduce端会根据map产生的block进行去重，然后将得到有效block集合通过reportUniqueBlocks发送给ShuffleServer。ShuffleServer收到有效的blocks集合后，会触发Remote Merge。Remote Merge的结果会像普通的block一样写入到bufferPool, 避免的时候会flush到磁盘中。RemoteMerge产生的结果即为普通的block，但是为了方便说明，这里称之为merged block。merged block记录的是按照key排序后的结果，因此读取merged block的时候，需要按照blockid的顺序依次递增读取。</li>
<li>(5) getSortedShuffleData<br>Reduce会按照block序号的顺序读取merged block，然后根据一定的条件选择何时为reduce计算使用。</li>
</ul>
<h2 id="4-2-从Record的视角分析流程"><a href="#4-2-从Record的视角分析流程" class="headerlink" title="4.2 从Record的视角分析流程"></a>4.2 从Record的视角分析流程</h2><p>我们可以WordCount为例子解释Record在整个过程中的流转。本例子中有两个分区以及一个Reduce，即一个Reduce处理两个分区的数据。</p>
<img src="/images/rss/remote_merge_from_record_perspective.jpg" width=50% height=50% text-align=center/>

<ul>
<li>(1) MAP端<br>Map端处理文档数据后，会进行排序。对于Map1, 由于存在两个分区，以奇数为key的record会写入到block101中，以偶数为key的record会写入到block102中。Map2同理。注意这里block中的Record都是已经排序后的。</li>
<li>(2) Map端发送数据<br>Map端通过sendShuffleData将block发送给ShuffleServer， ShuffleServer会将其存储到bufferPool中。<br>这里指的注意的是，在注册的时候会会注册APP1名字的app的同时，也会注册APP1@RemoteMerge的app，稍后会介绍它。</li>
<li>(3) ShuffleServer端Merge<br>Reduce启动后，会调用reportUniqueBlocks汇报可用的block集合，同时触发ShuffleServer中对应的partition进行Merge。Merge的结果在这个分区下全局排序的Record集合。<br>然后的问题是Merge的结果存在那里？Merge过程是在内存中发生的，每当Merge一定数量的Record后，会将这些结果写到一个新的block中。为了与原来的appid区分，这里会将这组block放在一个以”@RemoteMerge”结尾的appid进行管理。这组新的block的blockid是从1开始递增的，而且是经过全局排序的。即每个block内部的record是排序的，blockid&#x3D;1的records一定小于等于blockid&#x3D;2的records。</li>
<li>(4) Reduce端读<br>根据前面的分析，Reduce端只要读取以”@RemoteMerge“结尾的appid管理的block即可。Reduce读取block的时候从blockid&#x3D;1的block开始，按照blockid顺序读取即可。我们知道Reduce进行计算的时候，是按照顺序计算的。由于我们在ShuffleServer端获取的数据已经是排序后的，所以每次只需要从ShuffleServer端获取少量的数据即可，这样就实现了从ShuffleServer端按需读取，大大降低了使用内存。<br>这里还存在两种特殊的情况，详细5.5</li>
</ul>
<h1 id="5-计划"><a href="#5-计划" class="headerlink" title="5 计划"></a>5 计划</h1><h2 id="5-1-统一序列化器"><a href="#5-1-统一序列化器" class="headerlink" title="5.1 统一序列化器"></a>5.1 统一序列化器</h2><p>由于需要在ShuffleServer端进行Merge, 需要提取出独立于计算框架的统一序列化器。这里提炼出两类序列化器: (1) Writable (2) Kryo。Writable序列化用于处理org.apache.hadoop.io.Writable接口的类，用于MR和TEZ框架。Kryo可以序列化绝大多数的类，一般用于Spark框架。</p>
<h2 id="5-2-RecordsFileWriter-RecordFileReader"><a href="#5-2-RecordsFileWriter-RecordFileReader" class="headerlink" title="5.2 RecordsFileWriter&#x2F;RecordFileReader"></a>5.2 RecordsFileWriter&#x2F;RecordFileReader</h2><p>提供关于处理Record的抽象方法</p>
<h2 id="5-3-Merger"><a href="#5-3-Merger" class="headerlink" title="5.3 Merger"></a>5.3 Merger</h2><p>提供基础的Merger服务，对多个数据流按照key进行merge。采用最小堆K路归并排序，对已经进行局部排序的数据流进行归并排序。</p>
<h2 id="5-4-MergeManager"><a href="#5-4-MergeManager" class="headerlink" title="5.4 MergeManager"></a>5.4 MergeManager</h2><p>用于在服务端对Records进行Merge。</p>
<h2 id="5-5-RMRecordsReader"><a href="#5-5-RMRecordsReader" class="headerlink" title="5.5 RMRecordsReader"></a>5.5 RMRecordsReader</h2><p>一般来讲Reduce端在读取数据的情况，直接发给下游计算即可。但是存在两种特殊的情况:<br>(1) 对于存在需要在Merge进行combine的情况，我们需要等待所有相同的key都达到后进行combine，然后再发给下游。<br>(2) 对于spark和tez, reduce端可以会读取多个分区的数据。因此我们需要对多个分区的数据在reduce端再进行一次merge，然后在发给下游。<br>RMRecordsReader是用于读取排序后数据的工具。大致的架构如下:</p>
<img src="/images/rss/rm_records_reader.png" width=50% height=50% text-align=center/>

<p>图中描述了单个Reduce处理两个分区的情况。RecordsFetcher线程会读取对应分区的block，然后解析成Records。然后发送到combineBuffers中。RecordCombiner从combineBuffer中读取Records，当某个key的所有records都收集完成，会进行combine操作。结果会发送给mergedBuffer。RecordMerge会获取所有mergedBuffer，然后在内存中再进行一次归并排序。最终得到全局排序的结果给下游使用。</p>
<h2 id="5-6-框架适配"><a href="#5-6-框架适配" class="headerlink" title="5.6 框架适配"></a>5.6 框架适配</h2><p>适配MR,Tez,Spark三种架构。</p>
<blockquote>
<p>笔者已使用线上任务对MR和Tez进行了大规模压测。Spark目前仅进行了一些基础的examples的测试，仍需要大量测试。</p>
</blockquote>
<h2 id="5-7-隔离的classloader"><a href="#5-7-隔离的classloader" class="headerlink" title="5.7 隔离的classloader"></a>5.7 隔离的classloader</h2><p>对于不同版本的keyclass, valueclass以及comparatorclass, 使用隔离的classloader加载。</p>
<h1 id="5-特别注意"><a href="#5-特别注意" class="headerlink" title="5 特别注意"></a>5 特别注意</h1><ul>
<li>不支持spark的javardd，因为spark javardd的类型会被擦除。</li>
<li>适当提高服务器的max open file的配置。因为合并的时候可能会长时间持有文件。</li>
<li>适当降低rss.server.buffer.capacity, 因为remote merge的过程需要更多的额外内存。</li>
</ul>


    </div>
    <footer class="article-footer">
      <a data-url="http://example.com/2023/12/25/RSS-%E8%BF%9C%E7%A8%8BMerge%E7%9A%84%E8%AE%BE%E8%AE%A1/" data-id="cuidB6bYtGb39ynJ2p0_fULc5" data-title="RSS-远程Merge的设计" class="article-share-link"><span class="fa fa-share">Share</span></a>


  <ul class="article-tag-list" itemprop="keywords"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/RSS/" rel="tag">RSS</a></li></ul>

    </footer>
  </div>

</article>


    <article id="post-机器学习-3-2-非监督学习之主成分分析" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
  <div class="article-meta">
    <a href="/2017/10/22/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0-3-2-%E9%9D%9E%E7%9B%91%E7%9D%A3%E5%AD%A6%E4%B9%A0%E4%B9%8B%E4%B8%BB%E6%88%90%E5%88%86%E5%88%86%E6%9E%90/" class="article-date">
  <time class="dt-published" datetime="2017-10-22T07:56:23.000Z" itemprop="datePublished">2017-10-22</time>
</a>

  </div>
  <div class="article-inner">


      <header class="article-header">


    <h1 itemprop="name">
      <a class="p-name article-title" href="/2017/10/22/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0-3-2-%E9%9D%9E%E7%9B%91%E7%9D%A3%E5%AD%A6%E4%B9%A0%E4%B9%8B%E4%B8%BB%E6%88%90%E5%88%86%E5%88%86%E6%9E%90/">机器学习-3.2-非监督学习之主成分分析.md</a>
    </h1>


      </header>

    <div class="e-content article-entry" itemprop="articleBody">

        <script type="text/javascript" src="/Users/zcy/Desktop/study/git/MathJax-master/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>

<h1 id="主成分分析-PCA"><a href="#主成分分析-PCA" class="headerlink" title="主成分分析(PCA)"></a>主成分分析(PCA)</h1><blockquote>
<p>严格上说PCA应该算是一种降低维度的方法，这里暂时归类为非监督学习中。</p>
</blockquote>
<h2 id="1-PCA理论简述"><a href="#1-PCA理论简述" class="headerlink" title="1 PCA理论简述"></a>1 PCA理论简述</h2><p>PCA的主要思路是将\(n\)维数据降维到\(k\)维子空间中，以滤除不需要的噪声或没有意义的特征信息。譬如我们有汽车的一组运行数据，包括专项半径、速度等。其中就一个用英里&#x2F;每小时和千米&#x2F;每小时描述的速度，很明显两者是呈线性关系的，只是因为舍入带来一些误差，对于这样的问题我们完全可以将这个\(n\)维数据移到\(n-1\)维子空间中。</p>
<p>再举一个例子,对于RC直升机驾驶员的调查，\(x _1\)表示飞行员驾驶技能，\(x _2\)表示他对驾驶直升飞机的爱好程度。因为RC直升机很难驾驶，因此我们可以认为只有任务对驾驶RC直升机感兴趣的驾驶员才能学好它。因此,\(x _1\)和\(x _2\)有强相关性。如下图所示，我们可以假定一个方向\(u _1\)，用来表示\(x _1\)和\(x _2\)两个属性，仅仅在其垂直轴\(u _2\)上有少量噪声。</p>
<img src="/images/机器学习/PCA驾驶员技能兴趣关系曲线.png" width=50% height=50% text-align=center/>

<p>对数据降维之前，我们需要对数据进行初始化，主要是一个归整的过程，将均值归整为0，将方差归整为1。依次按照如下公式进行归整。</p>
<ul>
<li>\(\mu  &#x3D; \frac{1}{m}\sum\limits _{i &#x3D; 1}^m {x^{(i)}} \)</li>
<li>\({x^{(i)}}: &#x3D; {x^{(i)}} - \mu \)</li>
<li>\({\sigma ^2} &#x3D; \frac{1}{m}\sum\limits _{i &#x3D; 1}^m {(x^{(i)})^2} \)</li>
<li>\({x^{(i)}}: &#x3D; {x^{(i)}}&#x2F;\sigma \)</li>
</ul>
<p>归整后的数据如下，可以知道如果\(x\)在向量\(u\)的投影最大。说明\(u\)在对\(x _1\)和\(x _2\)的降维过程中带来的损失越小。</p>
<img src="/images/机器学习/PCA降维示例.png" width=50% height=50% text-align=center/>

<p>因此，可以需要保证下式最大：</p>

$$\frac{1}{m}\sum\limits _{i = 1}^m {{{({x^{{{(i)}^T}}}u)}^2}}  = \frac{1}{m}\sum\limits _{i = 1}^m {{u^T}{x^{(i)}}{x^{{{(i)}^T}}}u}  = {u^T}(\frac{1}{m}\sum\limits _{i = 1}^m {{x^{(i)}}{x^{{{(i)}^T}}}} )u$$


<p>问题也就转变为找到一组\(u\)使得，上式最大的问题。然后我们假定\(u\)为一组标准正交基，因此\(||u||&#x3D;1\)。因此问题可写为如下形式：</p>

$$\begin{gathered}
   & \max {u^T}(\frac{1}{m}\sum\limits _{i = 1}^m {{x^{(i)}}{x^{{{(i)}^T}}}} )u \hfill \\\
  s.t.: & ||u|| = 1 \hfill \\\
\end{gathered} $$


<p>组成其拉格朗日函数如下：</p>

$$L(u,\lambda ) = {u^T}(\frac{1}{m}\sum\limits _{i = 1}^m {x^{(i)}x^{(i)^T}} )u - \lambda (||u|| - 1)$$


<p>这里我们设\(\Sigma&#x3D;\frac{1}{m}\sum\limits _{i &#x3D; 1}^m {x^{(i)}{x^{(i)^T}}}\)，然后对\(u\)求导数，易得:</p>

$${\nabla _u}L(u,\lambda ) = \sum u - \lambda u = 0$$


<p>因此，我们知道\(\lambda\)为\(\Sigma\)的特征值，\(u\)为对应的特征向量。恰好为我们要求的向量\(u\)。得到一组新的正交基后，我们就可以映射到新的空间中了。具体如下:</p>

$${y^{(i)}} = {(u _1^T{x^{(i)}},u _2^T{x^{(i)}},...,u _k^T{x^{(i)}})^T}$$


<h2 id="2-PCA的奇异值分解"><a href="#2-PCA的奇异值分解" class="headerlink" title="2. PCA的奇异值分解"></a>2. PCA的奇异值分解</h2><p>由于一般为\(n*n\)的维度，因此往往是一个很大的矩阵。对于维度很大，样本数目要少于维度很多的数据，这样计算往往有不够划算。因此我们可以通过求解\(x\)的特征值的方法来求节\(\Sigma\)的特征向量。</p>
<blockquote>
<p>由于求矩阵的单位特征向量，可以不考虑1&#x2F;m这个系数。</p>
</blockquote>
<p>假如对\(x\)进行奇异值分解，其中\(U\)和\(V\)均为酉阵，\(\Lambda\)为对角阵。另外值得注意的是酉阵的可逆矩阵与转置矩阵相同。具体分解形式如下：</p>

$$x = U\Lambda {V^T}$$


<p>因此可以得到：</p>

$$\Sigma  = x{x^T} = U\Lambda {V^T}V\Lambda {U^T} = U{\Lambda ^2}{U^T}$$
$$\Sigma U = {\Lambda ^2}{U^T}$$


<p>因此可以知道\(\Sigma\)的特征向量对应着\(x\)的奇异值分解中的\(U\)。同时\(\Sigma\)的特征值为\(x\)的奇异值的平方。</p>
<h2 id="3-源码实现"><a href="#3-源码实现" class="headerlink" title="3 源码实现"></a>3 源码实现</h2><p>我们对k临近算法中的手写数字识别做分析。我们首先对数字映射到三维空间中，然后进行展示。然后在映射到样本数量维度的空间，实现对数字的识别。值得一提是，PCA可以将多种无法直接展示的多维数组转化为三维数据，然后更直观地进行分析。</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br><span class="line">142</span><br><span class="line">143</span><br><span class="line">144</span><br><span class="line">145</span><br><span class="line">146</span><br><span class="line">147</span><br><span class="line">148</span><br><span class="line">149</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> numpy <span class="keyword">as</span> np</span><br><span class="line"><span class="keyword">import</span> os</span><br><span class="line"><span class="keyword">from</span> numpy <span class="keyword">import</span> linalg</span><br><span class="line"><span class="keyword">import</span> matplotlib.pyplot <span class="keyword">as</span> plt</span><br><span class="line"><span class="keyword">from</span> mpl_toolkits.mplot3d <span class="keyword">import</span> axes3d, Axes3D</span><br><span class="line"></span><br><span class="line"><span class="keyword">global</span> g_label      <span class="comment"># 训练集的label</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">makeTrain</span>(<span class="params"><span class="built_in">dir</span></span>):</span><br><span class="line">  dirs=os.listdir(<span class="built_in">dir</span>)</span><br><span class="line">  files = <span class="built_in">filter</span>(<span class="keyword">lambda</span> item:<span class="keyword">not</span> os.path.isdir(item), dirs)</span><br><span class="line">  mat = []</span><br><span class="line">  label = []</span><br><span class="line">  <span class="keyword">for</span> file <span class="keyword">in</span> files:</span><br><span class="line">    arr = []</span><br><span class="line">    f = <span class="built_in">open</span>(<span class="built_in">dir</span>+<span class="string">&quot;/&quot;</span>+file)</span><br><span class="line">    <span class="keyword">while</span> <span class="literal">True</span>:</span><br><span class="line">      line = f.readline()</span><br><span class="line">      <span class="keyword">if</span> <span class="keyword">not</span> line:</span><br><span class="line">        <span class="keyword">break</span></span><br><span class="line">      <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(<span class="built_in">len</span>(line)-<span class="number">1</span>):        <span class="comment"># line[len(line)]=&#x27;\n&#x27;</span></span><br><span class="line">        arr.append(<span class="built_in">float</span>(line[i]))</span><br><span class="line">    mat.append(arr)</span><br><span class="line">    label.append(<span class="built_in">int</span>(file.split(<span class="string">&quot;_&quot;</span>)[<span class="number">0</span>]))</span><br><span class="line">  <span class="keyword">return</span> (mat,label)</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">preprocessing</span>(<span class="params">trainList</span>):</span><br><span class="line">  train = np.array(trainList)</span><br><span class="line">  rows = <span class="built_in">len</span>(train)</span><br><span class="line">  cols = <span class="built_in">len</span>(train[<span class="number">0</span>])</span><br><span class="line">  <span class="comment"># 使数学期望为0</span></span><br><span class="line">  <span class="keyword">for</span> col <span class="keyword">in</span> <span class="built_in">range</span>(cols):</span><br><span class="line">    mean = np.mean(train[:,col])</span><br><span class="line">    <span class="keyword">for</span> row <span class="keyword">in</span> <span class="built_in">range</span>(rows):</span><br><span class="line">      train[row][col] = train[row][col]-mean</span><br><span class="line">    <span class="comment"># 使方差为1。 如果方差小于1,任务反差为0，则不更新该行。</span></span><br><span class="line">    <span class="comment"># 实际上，由于手写数字已经被归整为0,1这样的二值化图像，因此这里没有修改波定性。</span></span><br><span class="line">    var = np.var(train[:,col])</span><br><span class="line">    <span class="keyword">if</span> var &gt; <span class="number">1</span>:</span><br><span class="line">      <span class="built_in">print</span>(<span class="string">&quot;hello&quot;</span>)</span><br><span class="line">      standard = np.sqrt(var)</span><br><span class="line">      <span class="keyword">for</span> row <span class="keyword">in</span> <span class="built_in">range</span>(rows):</span><br><span class="line">        train[row][col] = train[row][col]/standard</span><br><span class="line">  <span class="keyword">return</span> train.tolist()</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">lowerDimension</span>(<span class="params">u3t,<span class="built_in">dir</span></span>):</span><br><span class="line">  <span class="comment"># 该函数将1024维度的数据转化为3维</span></span><br><span class="line">  dirs=os.listdir(<span class="built_in">dir</span>)</span><br><span class="line">  files = <span class="built_in">filter</span>(<span class="keyword">lambda</span> item:<span class="keyword">not</span> os.path.isdir(item), dirs)</span><br><span class="line">  mat = []</span><br><span class="line">  label = []</span><br><span class="line">  <span class="keyword">for</span> file <span class="keyword">in</span> files:</span><br><span class="line">    arr = []</span><br><span class="line">    f = <span class="built_in">open</span>(<span class="built_in">dir</span>+<span class="string">&quot;/&quot;</span>+file)</span><br><span class="line">    <span class="keyword">while</span> <span class="literal">True</span>:</span><br><span class="line">      line = f.readline()</span><br><span class="line">      <span class="keyword">if</span> <span class="keyword">not</span> line:</span><br><span class="line">        <span class="keyword">break</span></span><br><span class="line">      <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(<span class="built_in">len</span>(line)-<span class="number">1</span>):        <span class="comment"># line[len(line)]=&#x27;\n&#x27;</span></span><br><span class="line">        arr.append(<span class="built_in">float</span>(line[i]))</span><br><span class="line">    new_mat = u3t*np.mat(arr).T</span><br><span class="line">    label.append(<span class="built_in">int</span>(file.split(<span class="string">&quot;_&quot;</span>)[<span class="number">0</span>]))</span><br><span class="line">    mat.append((np.array(new_mat.T)[<span class="number">0</span>]).tolist())</span><br><span class="line">  <span class="keyword">return</span> (mat,label)         <span class="comment"># mat为num*3的list</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">show3D</span>(<span class="params">g_train_3d</span>):</span><br><span class="line">  fig = plt.figure()</span><br><span class="line">  ax = Axes3D(fig)</span><br><span class="line">  <span class="comment">#将数据点分成三部分画，在颜色上有区分度</span></span><br><span class="line">  m = <span class="built_in">len</span>(g_train_3d)</span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(m):</span><br><span class="line">    <span class="keyword">if</span> g_label[i] == <span class="number">0</span>:</span><br><span class="line">      ax.scatter(g_train_3d[i][<span class="number">0</span>], g_train_3d[i][<span class="number">1</span>], g_train_3d[i][<span class="number">2</span>], c=<span class="string">&#x27;b&#x27;</span>)</span><br><span class="line">    <span class="keyword">elif</span> g_label[i] == <span class="number">1</span>:</span><br><span class="line">      ax.scatter(g_train_3d[i][<span class="number">0</span>], g_train_3d[i][<span class="number">1</span>], g_train_3d[i][<span class="number">2</span>], c=<span class="string">&#x27;c&#x27;</span>)</span><br><span class="line">    <span class="keyword">elif</span> g_label[i] == <span class="number">2</span>:</span><br><span class="line">      ax.scatter(g_train_3d[i][<span class="number">0</span>], g_train_3d[i][<span class="number">1</span>], g_train_3d[i][<span class="number">2</span>], c=<span class="string">&#x27;g&#x27;</span>)</span><br><span class="line">    <span class="keyword">elif</span> g_label[i] == <span class="number">3</span>:</span><br><span class="line">      ax.scatter(g_train_3d[i][<span class="number">0</span>], g_train_3d[i][<span class="number">1</span>], g_train_3d[i][<span class="number">2</span>], c=<span class="string">&#x27;k&#x27;</span>)</span><br><span class="line">    <span class="keyword">elif</span> g_label[i] == <span class="number">4</span>:</span><br><span class="line">      ax.scatter(g_train_3d[i][<span class="number">0</span>], g_train_3d[i][<span class="number">1</span>], g_train_3d[i][<span class="number">2</span>], c=<span class="string">&#x27;m&#x27;</span>)</span><br><span class="line">    <span class="keyword">elif</span> g_label[i] == <span class="number">5</span>:</span><br><span class="line">      ax.scatter(g_train_3d[i][<span class="number">0</span>], g_train_3d[i][<span class="number">1</span>], g_train_3d[i][<span class="number">2</span>], c=<span class="string">&#x27;r&#x27;</span>)</span><br><span class="line">    <span class="keyword">elif</span> g_label[i] == <span class="number">6</span>:</span><br><span class="line">      ax.scatter(g_train_3d[i][<span class="number">0</span>], g_train_3d[i][<span class="number">1</span>], g_train_3d[i][<span class="number">2</span>], c=<span class="string">&#x27;w&#x27;</span>)</span><br><span class="line">    <span class="keyword">elif</span> g_label[i] == <span class="number">7</span>:</span><br><span class="line">      ax.scatter(g_train_3d[i][<span class="number">0</span>], g_train_3d[i][<span class="number">1</span>], g_train_3d[i][<span class="number">2</span>], c=<span class="string">&#x27;y&#x27;</span>)</span><br><span class="line"><span class="comment">#    elif g_label[i] == 8:</span></span><br><span class="line"><span class="comment">#      ax.scatter(g_train_3d[i][0], g_train_3d[i][1], g_train_3d[i][2], c=&#x27;b&#x27;, depthshade=False)</span></span><br><span class="line"><span class="comment">#    elif g_label[i] == 9:</span></span><br><span class="line"><span class="comment">#      ax.scatter(g_train_3d[i][0], g_train_3d[i][1], g_train_3d[i][2], c=&#x27;c&#x27;, depthshade=False)</span></span><br><span class="line">  ax.set_zlabel(<span class="string">&#x27;Z&#x27;</span>) <span class="comment">#坐标轴</span></span><br><span class="line">  ax.set_ylabel(<span class="string">&#x27;Y&#x27;</span>)</span><br><span class="line">  ax.set_xlabel(<span class="string">&#x27;X&#x27;</span>)</span><br><span class="line">  plt.show()</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">test</span>(<span class="params">ukt, train_3d, <span class="built_in">dir</span></span>):</span><br><span class="line">  test_kd,test_label = lowerDimension(ukt, <span class="built_in">dir</span>)</span><br><span class="line">  testN = <span class="built_in">len</span>(test_kd)</span><br><span class="line">  right=<span class="number">0</span></span><br><span class="line">  wrong=<span class="number">0</span></span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(testN):</span><br><span class="line">    label = classify(np.array(test_kd[i]),np.array(train_3d),<span class="number">10</span>)</span><br><span class="line">    <span class="keyword">if</span> <span class="built_in">str</span>(test_label[i]) == label:</span><br><span class="line">      right = right + <span class="number">1</span></span><br><span class="line">    <span class="keyword">else</span>:</span><br><span class="line">      wrong = wrong + <span class="number">1</span></span><br><span class="line">  <span class="built_in">print</span>(<span class="string">&quot;right=&quot;</span>, right, <span class="string">&quot;, wrong=&quot;</span>, wrong)</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">classify</span>(<span class="params">vec,train_kd,k</span>):</span><br><span class="line">  <span class="comment"># 计算各个训练数据与测试数据的距离</span></span><br><span class="line">  m = <span class="built_in">len</span>(g_label)</span><br><span class="line">  dis = []</span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(m):</span><br><span class="line">    dis.append([linalg.norm(vec-train_kd[i]),g_label[i]])</span><br><span class="line">  dis = <span class="built_in">sorted</span>(dis, key=<span class="keyword">lambda</span> v:v[<span class="number">0</span>])</span><br><span class="line">  <span class="comment"># 计算相似度最高的k个值，这里写入map做累积</span></span><br><span class="line">  dic = &#123;&#125;</span><br><span class="line">  <span class="keyword">for</span> j <span class="keyword">in</span> <span class="built_in">range</span>(k):</span><br><span class="line">    <span class="keyword">if</span> <span class="keyword">not</span> <span class="built_in">str</span>(dis[j][<span class="number">1</span>]) <span class="keyword">in</span> dic:</span><br><span class="line">      dic[<span class="built_in">str</span>(dis[j][<span class="number">1</span>])]=<span class="number">1</span></span><br><span class="line">    <span class="keyword">else</span>:</span><br><span class="line">      dic[<span class="built_in">str</span>(dis[j][<span class="number">1</span>])]=dic[<span class="built_in">str</span>(dis[j][<span class="number">1</span>])]+<span class="number">1</span></span><br><span class="line">  <span class="keyword">return</span> <span class="built_in">max</span>(dic.items(), key=<span class="keyword">lambda</span> x: x[<span class="number">1</span>])[<span class="number">0</span>]</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> __name__==<span class="string">&quot;__main__&quot;</span>:</span><br><span class="line">  <span class="comment"># 1 预处理</span></span><br><span class="line">  <span class="comment"># 这里为了显示降低维度在训练样本中的作用，仅仅是用了300个样本</span></span><br><span class="line">  (train,g_label) = makeTrain(<span class="string">&quot;/Users/zcy/Desktop/study/git/mlearning/res/trainingDigits1&quot;</span>)</span><br><span class="line">  <span class="comment"># 进行预处理操作，将均值设置为0，将方差归整为1</span></span><br><span class="line">  train_processed = preprocessing(train)</span><br><span class="line"></span><br><span class="line">  <span class="comment"># 2 降低维度</span></span><br><span class="line">  <span class="comment"># 首先计算x的奇异值</span></span><br><span class="line">  train_mat = np.mat(train_processed)</span><br><span class="line">  train_mat = train_mat.T               <span class="comment"># 转化为n*m 1024*200</span></span><br><span class="line">  U,sigma,VT = linalg.svd(train_mat)      <span class="comment"># U的维度为n*n 即1024*1024. sigma为m*1. vt为300*300</span></span><br><span class="line">  u3=U[np.ix_(np.arange(<span class="number">1024</span>), np.arange(<span class="number">3</span>))]      <span class="comment"># 提取对应最高特征值最高的三个方向,u3的维度为1024*3</span></span><br><span class="line">  <span class="comment"># 将1024维的数字图像降低维度到三维向量</span></span><br><span class="line">  train_3d,_ = lowerDimension(u3.T,<span class="string">&quot;/Users/zcy/Desktop/study/git/mlearning/res/trainingDigits1&quot;</span>)</span><br><span class="line">  train_kd,_ = lowerDimension(U.T,<span class="string">&quot;/Users/zcy/Desktop/study/git/mlearning/res/trainingDigits1&quot;</span>)</span><br><span class="line"></span><br><span class="line">  <span class="comment"># 3 展示3维下的模型信息</span></span><br><span class="line">  <span class="comment">#show3D(train_3d)</span></span><br><span class="line"></span><br><span class="line">  <span class="comment"># 4 使用k邻域验证测试样本</span></span><br><span class="line">  <span class="comment">#test(u3.T, train_3d, &quot;/Users/zcy/Desktop/study/git/mlearning/res/testDigits1&quot;)</span></span><br><span class="line">  test(U.T,train_kd,<span class="string">&quot;/Users/zcy/Desktop/study/git/mlearning/res/testDigits1&quot;</span>)</span><br><span class="line"></span><br></pre></td></tr></table></figure>

<p>500个测试样本中，有27个识别错误，具体识别率为94.6%。这与未使用PCA降维的完全一致。该例子似乎尚未体现到PCA有什么优势，以后有机会在分析。当然如果映射到三维空间，识别率仅仅为75.2%，因此不要过度降维。下面是一个展示到部分数据的三维图。</p>
<blockquote>
<p>考虑到颜色，图片只显示部分类别数据。</p>
</blockquote>
<img src="/images/机器学习/PCA手写数字三维展示图.png" width=50% height=50% text-align=center/>

<h2 id="4-参考文献"><a href="#4-参考文献" class="headerlink" title="4 参考文献"></a>4 参考文献</h2><ul>
<li>cs229-note10</li>
<li>机器学习实战</li>
</ul>


    </div>
    <footer class="article-footer">
      <a data-url="http://example.com/2017/10/22/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0-3-2-%E9%9D%9E%E7%9B%91%E7%9D%A3%E5%AD%A6%E4%B9%A0%E4%B9%8B%E4%B8%BB%E6%88%90%E5%88%86%E5%88%86%E6%9E%90/" data-id="cuido2djXMarcFNCc_cgqM9Rd" data-title="机器学习-3.2-非监督学习之主成分分析.md" class="article-share-link"><span class="fa fa-share">Share</span></a>


  <ul class="article-tag-list" itemprop="keywords"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/" rel="tag">机器学习</a></li></ul>

    </footer>
  </div>

</article>


    <article id="post-机器学习-3-1-非监督学习之聚类" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
  <div class="article-meta">
    <a href="/2017/09/24/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0-3-1-%E9%9D%9E%E7%9B%91%E7%9D%A3%E5%AD%A6%E4%B9%A0%E4%B9%8B%E8%81%9A%E7%B1%BB/" class="article-date">
  <time class="dt-published" datetime="2017-09-24T13:36:20.000Z" itemprop="datePublished">2017-09-24</time>
</a>

  </div>
  <div class="article-inner">


      <header class="article-header">


    <h1 itemprop="name">
      <a class="p-name article-title" href="/2017/09/24/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0-3-1-%E9%9D%9E%E7%9B%91%E7%9D%A3%E5%AD%A6%E4%B9%A0%E4%B9%8B%E8%81%9A%E7%B1%BB/">机器学习-3.1-非监督学习之聚类.md</a>
    </h1>


      </header>

    <div class="e-content article-entry" itemprop="articleBody">

        <script type="text/javascript" src="/Users/zcy/Desktop/study/git/MathJax-master/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>

<h1 id="聚类问题"><a href="#聚类问题" class="headerlink" title="聚类问题"></a>聚类问题</h1><p>本节主要介绍非监督学习的聚类算法。</p>
<h2 id="1-常见的聚类算法"><a href="#1-常见的聚类算法" class="headerlink" title="1 常见的聚类算法"></a>1 常见的聚类算法</h2><p>聚类问题中，我们给定训练集合\(\{ {x^{(1)}},{x^{(2)}},…,{x^{(m)}}\}\)，目的是将训练样本聚合几个类中。由于问题过程中，\(y\)并没有指定，所以这是一个非监督问题。</p>
<h3 id="1-1-k-means"><a href="#1-1-k-means" class="headerlink" title="1.1 k-means"></a>1.1 k-means</h3><p>下面介绍k-means算法。算法的主要流程如下：</p>
<ul>
<li>(1) 随机初始化重心\(\{ {x^{(1)}},{x^{(2)}},…,{x^{(m)}}\}\)(假设有k个分类)</li>
<li>(2) 根据当前的\(\{ {x^{(1)}},{x^{(2)}},…,{x^{(m)}}\}\)，计算距离每个样本距离最近的中心，即为所属类。</li>
<li>(3) 然后根据2中得到新的所属类关系，更新一组新的重心值。</li>
<li>(4) 重复2,3直到某个截止条件。</li>
</ul>
<p>下面我们随机制造以三个点为高斯分布的一组数据，试图从该组数据完成聚类操作，具体代码如下：</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> numpy <span class="keyword">as</span> np</span><br><span class="line"><span class="keyword">from</span> matplotlib <span class="keyword">import</span> pyplot <span class="keyword">as</span> plt</span><br><span class="line"><span class="keyword">import</span> sys</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">makeData</span>():</span><br><span class="line">  <span class="comment"># 0 make data</span></span><br><span class="line">  mean_1 = [<span class="number">1</span>, <span class="number">1</span>]</span><br><span class="line">  mean_2 = [<span class="number">2</span>, <span class="number">2</span>]</span><br><span class="line">  mean_3 = [<span class="number">1</span>, <span class="number">2</span>]</span><br><span class="line">  cov = [[<span class="number">0.05</span>, <span class="number">0</span>], [<span class="number">0</span>, <span class="number">0.05</span>]]</span><br><span class="line">  arr1 = np.random.multivariate_normal(mean_1, cov, <span class="number">100</span>)</span><br><span class="line">  arr2 = np.random.multivariate_normal(mean_2, cov, <span class="number">50</span>)</span><br><span class="line">  arr3 = np.random.multivariate_normal(mean_3, cov, <span class="number">50</span>)</span><br><span class="line">  figure, ax = plt.subplots()</span><br><span class="line">  ax.set_xlim(left=-<span class="number">1</span>, right=<span class="number">4</span>)</span><br><span class="line">  ax.set_ylim(bottom=-<span class="number">1</span>, top=<span class="number">4</span>)</span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(<span class="built_in">len</span>(arr1)):</span><br><span class="line">    plt.plot(arr1[i][<span class="number">0</span>], arr1[i][<span class="number">1</span>], <span class="string">&#x27;b--&#x27;</span>, marker=<span class="string">&#x27;+&#x27;</span>, color=<span class="string">&#x27;r&#x27;</span>)</span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(<span class="built_in">len</span>(arr2)):</span><br><span class="line">    plt.plot(arr2[i][<span class="number">0</span>], arr2[i][<span class="number">1</span>], <span class="string">&#x27;b--&#x27;</span>, marker=<span class="string">&#x27;o&#x27;</span>, color=<span class="string">&#x27;g&#x27;</span>)</span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(<span class="built_in">len</span>(arr3)):</span><br><span class="line">    plt.plot(arr3[i][<span class="number">0</span>], arr3[i][<span class="number">1</span>], <span class="string">&#x27;b--&#x27;</span>, marker=<span class="string">&#x27;*&#x27;</span>, color=<span class="string">&#x27;b&#x27;</span>)</span><br><span class="line">  plt.xlabel(<span class="string">&quot;x1&quot;</span>)</span><br><span class="line">  plt.ylabel(<span class="string">&quot;x2&quot;</span>)</span><br><span class="line">  <span class="keyword">return</span> np.vstack((arr1,arr2,arr3))</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">updateSingleLabel</span>(<span class="params">x,center</span>):</span><br><span class="line">  n = <span class="built_in">len</span>(center)</span><br><span class="line">  <span class="built_in">min</span> = sys.maxsize</span><br><span class="line">  minIndex = -<span class="number">1</span></span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(n):</span><br><span class="line">    tmp = np.linalg.norm(x - center[i])</span><br><span class="line">    <span class="keyword">if</span> tmp &lt; <span class="built_in">min</span>:</span><br><span class="line">      minIndex=i</span><br><span class="line">      <span class="built_in">min</span> = tmp</span><br><span class="line">  <span class="keyword">return</span> minIndex</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">updateLable</span>(<span class="params">data,label,center</span>):</span><br><span class="line">  numChanged = <span class="number">0</span>;</span><br><span class="line">  n=<span class="built_in">len</span>(data)</span><br><span class="line">  label_new = np.zeros(n);</span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(n):</span><br><span class="line">    label_new[i] = updateSingleLabel(data[i], center)</span><br><span class="line">    <span class="keyword">if</span> label_new[i] != label[i]:</span><br><span class="line">      numChanged = numChanged + <span class="number">1</span></span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(n):</span><br><span class="line">    label[i] = label_new[i]</span><br><span class="line">  <span class="keyword">return</span> numChanged;</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">updateCenter</span>(<span class="params">data,label,center</span>):</span><br><span class="line">  newCenter=np.array([[<span class="number">0.0</span>,<span class="number">0.0</span>],[<span class="number">0.0</span>,<span class="number">0.0</span>],[<span class="number">0.0</span>,<span class="number">0.0</span>]])</span><br><span class="line">  newCenterSum=np.array([<span class="number">0</span>,<span class="number">0</span>,<span class="number">0</span>])</span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(<span class="built_in">len</span>(data)):</span><br><span class="line">    <span class="keyword">if</span> label[i]==<span class="number">0</span>:</span><br><span class="line">      newCenter[<span class="number">0</span>] = newCenter[<span class="number">0</span>] + data[i]</span><br><span class="line">      newCenterSum[<span class="number">0</span>] = newCenterSum[<span class="number">0</span>] + <span class="number">1</span></span><br><span class="line">    <span class="keyword">elif</span> label[i] == <span class="number">1</span>:</span><br><span class="line">      newCenter[<span class="number">1</span>] = newCenter[<span class="number">1</span>] + data[i]</span><br><span class="line">      newCenterSum[<span class="number">1</span>] = newCenterSum[<span class="number">1</span>] + <span class="number">1</span></span><br><span class="line">    <span class="keyword">elif</span> label[i] == <span class="number">2</span>:</span><br><span class="line">      newCenter[<span class="number">2</span>] = newCenter[<span class="number">2</span>] + data[i]</span><br><span class="line">      newCenterSum[<span class="number">2</span>] = newCenterSum[<span class="number">2</span>] + <span class="number">1</span></span><br><span class="line">  <span class="keyword">if</span> newCenterSum[<span class="number">0</span>] &gt; <span class="number">0</span> :</span><br><span class="line">    center[<span class="number">0</span>] = newCenter[<span class="number">0</span>]/newCenterSum[<span class="number">0</span>]</span><br><span class="line">  <span class="keyword">if</span> newCenterSum[<span class="number">1</span>] &gt; <span class="number">0</span> :</span><br><span class="line">    center[<span class="number">1</span>] = newCenter[<span class="number">1</span>]/newCenterSum[<span class="number">1</span>]</span><br><span class="line">  <span class="keyword">if</span> newCenterSum[<span class="number">2</span>] &gt; <span class="number">0</span> :</span><br><span class="line">    center[<span class="number">2</span>] = newCenter[<span class="number">2</span>]/newCenterSum[<span class="number">2</span>]</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">showPic</span>(<span class="params">x,label,label1</span>):</span><br><span class="line">  figure, ax = plt.subplots()</span><br><span class="line">  ax.set_xlim(left=-<span class="number">1</span>, right=<span class="number">4</span>)</span><br><span class="line">  ax.set_ylim(bottom=-<span class="number">1</span>, top=<span class="number">4</span>)</span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(<span class="built_in">len</span>(x)):</span><br><span class="line">    <span class="keyword">if</span> label[i] == <span class="number">0</span>:</span><br><span class="line">      plt.plot(x[i][<span class="number">0</span>], x[i][<span class="number">1</span>], <span class="string">&#x27;b--&#x27;</span>, marker=<span class="string">&#x27;+&#x27;</span>, color=<span class="string">&#x27;r&#x27;</span>)</span><br><span class="line">    <span class="keyword">elif</span> label[i] == <span class="number">1</span>:</span><br><span class="line">      plt.plot(x[i][<span class="number">0</span>], x[i][<span class="number">1</span>], <span class="string">&#x27;b--&#x27;</span>, marker=<span class="string">&#x27;o&#x27;</span>, color=<span class="string">&#x27;g&#x27;</span>)</span><br><span class="line">    <span class="keyword">elif</span> label[i] == <span class="number">2</span>:</span><br><span class="line">      plt.plot(x[i][<span class="number">0</span>], x[i][<span class="number">1</span>], <span class="string">&#x27;b--&#x27;</span>, marker=<span class="string">&#x27;*&#x27;</span>, color=<span class="string">&#x27;b&#x27;</span>)</span><br><span class="line">  figure, ax = plt.subplots()</span><br><span class="line">  ax.set_xlim(left=-<span class="number">1</span>, right=<span class="number">4</span>)</span><br><span class="line">  ax.set_ylim(bottom=-<span class="number">1</span>, top=<span class="number">4</span>)</span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(<span class="built_in">len</span>(x)):</span><br><span class="line">    <span class="keyword">if</span> label1[i] == <span class="number">0</span>:</span><br><span class="line">      plt.plot(x[i][<span class="number">0</span>], x[i][<span class="number">1</span>], <span class="string">&#x27;b--&#x27;</span>, marker=<span class="string">&#x27;+&#x27;</span>, color=<span class="string">&#x27;r&#x27;</span>)</span><br><span class="line">    <span class="keyword">elif</span> label1[i] == <span class="number">1</span>:</span><br><span class="line">      plt.plot(x[i][<span class="number">0</span>], x[i][<span class="number">1</span>], <span class="string">&#x27;b--&#x27;</span>, marker=<span class="string">&#x27;o&#x27;</span>, color=<span class="string">&#x27;g&#x27;</span>)</span><br><span class="line">    <span class="keyword">elif</span> label1[i] == <span class="number">2</span>:</span><br><span class="line">      plt.plot(x[i][<span class="number">0</span>], x[i][<span class="number">1</span>], <span class="string">&#x27;b--&#x27;</span>, marker=<span class="string">&#x27;*&#x27;</span>, color=<span class="string">&#x27;b&#x27;</span>)</span><br><span class="line">  plt.show()</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> __name__==<span class="string">&quot;__main__&quot;</span>:</span><br><span class="line"></span><br><span class="line">  <span class="comment"># 0 make data</span></span><br><span class="line">  data=makeData()</span><br><span class="line"></span><br><span class="line">  <span class="comment"># 1 initial</span></span><br><span class="line">  n = <span class="built_in">len</span>(data)</span><br><span class="line">  label = np.zeros(n)       <span class="comment"># 0,1,2 represent center[0],center[1],center[2]</span></span><br><span class="line">  center = np.array([[<span class="number">0.0</span>,<span class="number">3.0</span>],[<span class="number">3.0</span>,<span class="number">3.0</span>],[<span class="number">0.0</span>,<span class="number">0.0</span>]])</span><br><span class="line">  label1 = np.zeros(n)</span><br><span class="line">  center1 = np.array([[<span class="number">0.0</span>,<span class="number">10.0</span>],[<span class="number">2.0</span>,<span class="number">3.0</span>],[<span class="number">0.5</span>,<span class="number">5.0</span>]])</span><br><span class="line"></span><br><span class="line">  <span class="comment"># 2. trainning</span></span><br><span class="line">  <span class="comment"># 2.1 trainning for good initial</span></span><br><span class="line">  <span class="keyword">while</span> <span class="literal">True</span>:</span><br><span class="line">    <span class="comment"># update label</span></span><br><span class="line">    numChanged = updateLable(data,label,center)</span><br><span class="line">    <span class="comment"># update center</span></span><br><span class="line">    updateCenter(data,label,center)</span><br><span class="line">    <span class="keyword">if</span> numChanged==<span class="number">0</span>:</span><br><span class="line">      <span class="keyword">break</span></span><br><span class="line"></span><br><span class="line">  <span class="comment"># 2.1 trainning for bad initial</span></span><br><span class="line">  <span class="keyword">while</span> <span class="literal">True</span>:</span><br><span class="line">    <span class="comment"># update label</span></span><br><span class="line">    numChanged = updateLable(data,label1,center1)</span><br><span class="line">    <span class="comment"># update center</span></span><br><span class="line">    updateCenter(data,label1,center1)</span><br><span class="line">    <span class="keyword">if</span> numChanged==<span class="number">0</span>:</span><br><span class="line">      <span class="keyword">break</span></span><br><span class="line"></span><br><span class="line">  <span class="comment"># 3. showData</span></span><br><span class="line">  showPic(data,label,label1)</span><br></pre></td></tr></table></figure>

<p>下面是随机产生的基于(1，1),(2，2),(1,2)的高斯分布。</p>
<img src="/images/机器学习/非监督学习-kmeans实验原始数据.png" width=50% height=50% text-align=center/>

<p>然后我们从(0.0,3.0),(3.0,3.0),(0.0,0.0)开始迭代得到如下效果，可以看出结果还是非常理想的。</p>
<img src="/images/机器学习/非监督学习-kmeans实验结果1.png" width=50% height=50% text-align=center/>

<p>然后我们试图从(0.0,10.0),(2.0,3.0),(0.5,5.0)开始迭代,则会得到这样的结果。</p>
<img src="/images/机器学习/非监督学习-kmeans实验结果2.png" width=50% height=50% text-align=center/>

<p>可以从上文中看出，k-means对初始值的敏感度很高。对于现实问题，也许我们并不知道训练样本中本身存在多个分类，我们可以设置多个分类，如果某一个分类里面的样本过少，就删除分类，这样就不再依赖于事先知道分类的数量了。</p>
<h3 id="1-2-高斯混合聚类"><a href="#1-2-高斯混合聚类" class="headerlink" title="1.2 高斯混合聚类"></a>1.2 高斯混合聚类</h3><p>假设我们的分类都服从于各自的高斯分布，我们试图从样本中对其分类。该问题与之前的高斯判别分析类似，区别仅仅在于该问题没有样本标签。<br>我们设置z表示样本的距离分类，可以知道他服从一个多项式分布, \({z^{(i)}} \sim Multinomial(\phi )\)，其中。然后已经\(z\)之后，\(x\)服从的是一个高斯分布，\({x^{(i)}}|{z^{(i)}} &#x3D; j \sim N({\mu _j},{\Sigma _j})\)。</p>
<p>下面直接写出算法过程(具体的算法推导见EM算法小节)：</p>
<ul>
<li>(1)	随机初始化\(\phi\),\(\mu\),\(\Sigma\)。</li>
<li>(2)	遍历样本，计算得\([w _j^{(i)}\),如下：</li>
</ul>

$$w _j^{(i)} = p({z^{(i)}} = j|{x^{(i)}},\phi ,\mu ,\Sigma )$$
$$w _j^{(i)} = p({z^{(i)}} = j|{x^{(i)}},\phi ,\mu ,\Sigma ) = \frac{{p({z^{(i)}} = j,{x^{(i)}})}}{{p({x^{(i)}})}} = \frac{{p({z^{(i)}} = j,{x^{(i)}})}}{{\sum\limits _{j = 1}^k {p({x^{(i)}},{z^{(i)}} = j)} }}$$


<ul>
<li>(3)	根据计算得到的\(w\)重新更新各个分类的分布，如下：</li>
</ul>

$${\phi _j} = \frac{1}{m}\sum\limits _{i = 1}^m {1\{ {z^{(i)}} = j\} } $$
$${\mu _j} = \frac{{\sum\limits _{i = 1}^m {1\{ {z^{(i)}} = j\} } {x^{(i)}}}}{{\sum\limits _{i = 1}^m {1\{ {z^{(i)}} = j\} } }}$$

$${\Sigma _j} = \frac{{\sum\limits _{i = 1}^m {1\{ {z^{(i)}} = j\} } ({x^{(i)}} - {\mu _j}){{({x^{(i)}} - {\mu _j})}^T}}}{{\sum\limits _{i = 1}^m {1\{ {z^{(i)}} = j\} } }}$$


<ul>
<li>(4)	重复2,3知道达到截止条件。</li>
</ul>
<p>下面我们制作一组由三个高斯分布组成的样本数据，对其进行聚类。代码如下:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br><span class="line">142</span><br><span class="line">143</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># coding=utf-8</span></span><br><span class="line"><span class="keyword">import</span> numpy <span class="keyword">as</span> np</span><br><span class="line"><span class="keyword">from</span> matplotlib <span class="keyword">import</span> pyplot <span class="keyword">as</span> plt</span><br><span class="line"><span class="keyword">from</span> scipy.stats <span class="keyword">import</span> multivariate_normal</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">make_data</span>():</span><br><span class="line">  mean_1 = [<span class="number">1</span>,<span class="number">1</span>]</span><br><span class="line">  mean_2 = [<span class="number">4</span>,<span class="number">4</span>]</span><br><span class="line">  mean_3 = [<span class="number">1</span>,<span class="number">4</span>]</span><br><span class="line">  cov1 = [[<span class="number">0.1</span>,<span class="number">0</span>],[<span class="number">0</span>,<span class="number">0.1</span>]]</span><br><span class="line">  cov2 = [[<span class="number">0.2</span>,<span class="number">0</span>],[<span class="number">0</span>,<span class="number">0.2</span>]]</span><br><span class="line">  cov3 = [[<span class="number">0.6</span>,<span class="number">0</span>],[<span class="number">0</span>,<span class="number">0.6</span>]]</span><br><span class="line">  arr1 = np.random.multivariate_normal(mean_1, cov1, <span class="number">100</span>)</span><br><span class="line">  arr2 = np.random.multivariate_normal(mean_2, cov2, <span class="number">50</span>)</span><br><span class="line">  arr3 = np.random.multivariate_normal(mean_3, cov3, <span class="number">50</span>)</span><br><span class="line">  figure, ax = plt.subplots()</span><br><span class="line">  ax.set_xlim(left=-<span class="number">4</span>, right=<span class="number">8</span>)</span><br><span class="line">  ax.set_ylim(bottom=-<span class="number">4</span>, top=<span class="number">8</span>)</span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(<span class="built_in">len</span>(arr1)):</span><br><span class="line">    plt.plot(arr1[i][<span class="number">0</span>], arr1[i][<span class="number">1</span>], <span class="string">&#x27;b--&#x27;</span>, marker=<span class="string">&#x27;+&#x27;</span>, color=<span class="string">&#x27;r&#x27;</span>)</span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(<span class="built_in">len</span>(arr2)):</span><br><span class="line">    plt.plot(arr2[i][<span class="number">0</span>], arr2[i][<span class="number">1</span>], <span class="string">&#x27;b--&#x27;</span>, marker=<span class="string">&#x27;o&#x27;</span>, color=<span class="string">&#x27;g&#x27;</span>)</span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(<span class="built_in">len</span>(arr3)):</span><br><span class="line">    plt.plot(arr3[i][<span class="number">0</span>], arr2[i][<span class="number">1</span>], <span class="string">&#x27;b--&#x27;</span>, marker=<span class="string">&#x27;*&#x27;</span>, color=<span class="string">&#x27;b&#x27;</span>)</span><br><span class="line">  plt.xlabel(<span class="string">&quot;x1&quot;</span>)</span><br><span class="line">  plt.ylabel(<span class="string">&quot;x2&quot;</span>)</span><br><span class="line">  plt.plot()</span><br><span class="line">  plt.title(<span class="string">&quot;sample&quot;</span>)</span><br><span class="line">  <span class="comment">#plt.show()</span></span><br><span class="line">  <span class="keyword">return</span> np.vstack((arr1, arr2, arr3))</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">updatePhi</span>(<span class="params">w</span>):</span><br><span class="line">  n1 = <span class="built_in">len</span>(w)</span><br><span class="line">  phi=np.zeros(n1)</span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(n1):</span><br><span class="line">    <span class="built_in">sum</span> = <span class="number">0</span></span><br><span class="line">    n2 = <span class="built_in">len</span>(w[i])</span><br><span class="line">    <span class="keyword">for</span> j <span class="keyword">in</span> <span class="built_in">range</span>(n2):</span><br><span class="line">      <span class="built_in">sum</span> = <span class="built_in">sum</span> + w[i][j]</span><br><span class="line">    phi[i] = <span class="built_in">sum</span>/n2</span><br><span class="line">  <span class="keyword">return</span> phi</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">updateW</span>(<span class="params">data,w,phi,mu,sigma</span>):</span><br><span class="line">  n1=<span class="built_in">len</span>(w)           <span class="comment"># 3</span></span><br><span class="line">  n2 = <span class="built_in">len</span>(w[<span class="number">0</span>])      <span class="comment"># 200</span></span><br><span class="line">  var = []</span><br><span class="line">  <span class="keyword">for</span> k <span class="keyword">in</span> <span class="built_in">range</span>(n1):</span><br><span class="line">    var.append(multivariate_normal(mean=mu[k].tolist(), cov=sigma[k].tolist()))</span><br><span class="line"></span><br><span class="line">  <span class="keyword">for</span> j <span class="keyword">in</span> <span class="built_in">range</span>(n2):</span><br><span class="line">    <span class="built_in">sum</span> = <span class="number">0</span></span><br><span class="line">    <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(n1):</span><br><span class="line">      <span class="built_in">sum</span> = <span class="built_in">sum</span> + var[i].pdf(data[j])*phi[i]</span><br><span class="line">    <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(n1):</span><br><span class="line">      w[i][j] = var[i].pdf(data[j])*phi[i]/<span class="built_in">sum</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">updateMu</span>(<span class="params">data,w,mu</span>):</span><br><span class="line">  changed=<span class="literal">False</span></span><br><span class="line">  n1 = <span class="built_in">len</span>(w)  <span class="comment"># 3</span></span><br><span class="line">  n2 = <span class="built_in">len</span>(w[<span class="number">0</span>])  <span class="comment"># 200</span></span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(n1):</span><br><span class="line">    sumW = <span class="number">0.0</span></span><br><span class="line">    sumX = np.array([<span class="number">0.0</span>,<span class="number">0.0</span>])</span><br><span class="line">    <span class="keyword">for</span> j <span class="keyword">in</span> <span class="built_in">range</span>(n2):</span><br><span class="line">      sumW = sumW + w[i][j]</span><br><span class="line">      sumX = sumX + w[i][j]*data[j]</span><br><span class="line">    mu_new = sumX / sumW</span><br><span class="line">    <span class="keyword">if</span> np.dot(mu_new-mu[i],mu_new-mu[i]) &gt; <span class="number">0.001</span>:</span><br><span class="line">      changed = <span class="literal">True</span></span><br><span class="line">    mu[i] = mu_new</span><br><span class="line">  <span class="keyword">return</span> changed</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">updateSigma</span>(<span class="params">data,w,mu,sigma</span>):</span><br><span class="line">  n1 = <span class="built_in">len</span>(w)  <span class="comment"># 3</span></span><br><span class="line">  n2 = <span class="built_in">len</span>(w[<span class="number">0</span>])  <span class="comment"># 200</span></span><br><span class="line">  <span class="built_in">sum</span>=np.array([[<span class="number">0</span>,<span class="number">0</span>],[<span class="number">0</span>,<span class="number">0</span>]])</span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(n1):</span><br><span class="line">    sumW = <span class="number">0.0</span></span><br><span class="line">    sumX = np.array([<span class="number">0.0</span>,<span class="number">0.0</span>])</span><br><span class="line">    <span class="keyword">for</span> j <span class="keyword">in</span> <span class="built_in">range</span>(n2):</span><br><span class="line">      sumW = sumW + w[i][j]</span><br><span class="line">      z0 = np.array([data[j] - mu[i]])</span><br><span class="line">      z0T = np.array([data[j] - mu[i]]).transpose()</span><br><span class="line">      sumX = sumX + w[i][j]*np.dot(z0T, z0)</span><br><span class="line">    sigma[i] = sumX/sumW</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">classify</span>(<span class="params">data,mu,sigma</span>):</span><br><span class="line">  n1=<span class="built_in">len</span>(w)           <span class="comment"># 3</span></span><br><span class="line">  n2 = <span class="built_in">len</span>(w[<span class="number">0</span>])      <span class="comment"># 200</span></span><br><span class="line">  var=[]</span><br><span class="line">  <span class="keyword">for</span> k <span class="keyword">in</span> <span class="built_in">range</span>(n1):</span><br><span class="line">    var.append(multivariate_normal(mean=mu[k].tolist(), cov=sigma[k].tolist()))</span><br><span class="line"></span><br><span class="line">  figure, ax = plt.subplots()</span><br><span class="line">  ax.set_xlim(left=-<span class="number">4</span>, right=<span class="number">8</span>)</span><br><span class="line">  ax.set_ylim(bottom=-<span class="number">4</span>, top=<span class="number">8</span>)</span><br><span class="line"></span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(n2):</span><br><span class="line">    tmp_arr=np.array([])</span><br><span class="line">    <span class="keyword">for</span> j <span class="keyword">in</span> <span class="built_in">range</span>(n1):</span><br><span class="line">      tmp_arr=np.append(tmp_arr,var[j].pdf(data[i]))</span><br><span class="line">    index = tmp_arr.argmax()</span><br><span class="line">    <span class="keyword">if</span> index == <span class="number">0</span>:</span><br><span class="line">      plt.plot(data[i][<span class="number">0</span>], data[i][<span class="number">1</span>], <span class="string">&#x27;b--&#x27;</span>, marker=<span class="string">&#x27;+&#x27;</span>, color=<span class="string">&#x27;r&#x27;</span>)</span><br><span class="line">    <span class="keyword">elif</span> index == <span class="number">1</span>:</span><br><span class="line">      plt.plot(data[i][<span class="number">0</span>], data[i][<span class="number">1</span>], <span class="string">&#x27;b--&#x27;</span>, marker=<span class="string">&#x27;o&#x27;</span>, color=<span class="string">&#x27;g&#x27;</span>)</span><br><span class="line">    <span class="keyword">elif</span> index == <span class="number">2</span>:</span><br><span class="line">      plt.plot(data[i][<span class="number">0</span>], data[i][<span class="number">1</span>], <span class="string">&#x27;b--&#x27;</span>, marker=<span class="string">&#x27;o&#x27;</span>, color=<span class="string">&#x27;b&#x27;</span>)</span><br><span class="line">  plt.xlabel(<span class="string">&quot;x1&quot;</span>)</span><br><span class="line">  plt.ylabel(<span class="string">&quot;x2&quot;</span>)</span><br><span class="line">  plt.title(<span class="string">&quot;result&quot;</span>)</span><br><span class="line">  plt.plot()</span><br><span class="line">  plt.show()</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> __name__ == <span class="string">&quot;__main__&quot;</span>:</span><br><span class="line">  <span class="comment"># 1 make data</span></span><br><span class="line">  classN=<span class="number">3</span></span><br><span class="line">  data=make_data()</span><br><span class="line">  n=<span class="built_in">len</span>(data)</span><br><span class="line">  <span class="keyword">for</span> k <span class="keyword">in</span> <span class="built_in">range</span>(n):</span><br><span class="line">    w = np.array([np.zeros(n), np.zeros(n), np.zeros(n)])</span><br><span class="line">  <span class="comment"># w = array[classN][n] =array[3][200]</span></span><br><span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(classN):</span><br><span class="line">    <span class="keyword">for</span> j <span class="keyword">in</span> <span class="built_in">range</span>(n):</span><br><span class="line">      w[i][j]= <span class="number">1.0</span>/classN</span><br><span class="line"></span><br><span class="line">  phi=updatePhi(w)</span><br><span class="line">  mu = np.array([[<span class="number">6.0</span>, <span class="number">6.0</span>], [<span class="number">4.0</span>, -<span class="number">1.0</span>], [-<span class="number">2.0</span>, <span class="number">2.0</span>]])</span><br><span class="line">  sigma=np.array([[[<span class="number">0.1</span>,<span class="number">0</span>],[<span class="number">0</span>,<span class="number">0.1</span>]],[[<span class="number">0.1</span>,<span class="number">0</span>],[<span class="number">0</span>,<span class="number">0.1</span>]],[[<span class="number">0.1</span>,<span class="number">0</span>],[<span class="number">0</span>,<span class="number">0.1</span>]]])</span><br><span class="line"></span><br><span class="line">  <span class="comment"># 2 training</span></span><br><span class="line">  <span class="keyword">while</span> <span class="literal">True</span>:</span><br><span class="line">    updateW(data,w,phi,mu,sigma)</span><br><span class="line">    updatePhi(w)</span><br><span class="line">    updateSigma(data,w,mu,sigma)</span><br><span class="line">    changed = updateMu(data,w,mu)</span><br><span class="line">    <span class="keyword">if</span> changed == <span class="literal">False</span>:</span><br><span class="line">      <span class="built_in">print</span>(<span class="string">&quot;迭代完成&quot;</span>)</span><br><span class="line">      <span class="keyword">break</span></span><br><span class="line"></span><br><span class="line">  <span class="comment"># 3 show</span></span><br><span class="line">  <span class="built_in">print</span>(<span class="string">&quot;mu=&quot;</span>,mu,<span class="string">&quot;, sigma=&quot;</span>,sigma)</span><br><span class="line">  classify(data,mu,sigma)</span><br></pre></td></tr></table></figure>

<p>我们分别以[6.0, 6.0], [4.0, -1.0], [-2.0, 2.0]为均值，以[[0.1,0],[0,0.1]],[[0.1,0],[0,0.1]],[[0.1,0],[0,0.1]]为标准差生成一组高斯分布如下：</p>
<img src="/images/机器学习/非监督学习-高斯混合聚类原始数据.png" width=50% height=50% text-align=center/>

<p>然后通过训练的到的训练均值分别为[4.00729834, 3.99848889], [1.01089497,  1.05225006], [0.92949217, 4.18380895]]。标准差为[[0.23196114,-0.01896477],[-0.01896477, 0.17165715]], [[0.10205901, 0.00157192], [0.00157192,  0.11477843]], [[0.7010517, 0.04783335], [0.04783335, 0.51147277]]。具体结果如下:</p>
<img src="/images/机器学习/非监督学习-高斯混合聚类实验结果.png" width=50% height=50% text-align=center/>

<h3 id="2-1-Jensen不等式"><a href="#2-1-Jensen不等式" class="headerlink" title="2.1 Jensen不等式"></a>2.1 Jensen不等式</h3><p>对于一个严格凸函数,即\(f’’(x) \geqslant 0\)，我们容易得到下式:</p>

$$E[f(x)] \geqslant f(E[x])$$


<p>当且仅当x为常数的时候，上式等号成立。对于严格凹函数，则正好相反。</p>
<h3 id="2-2-EM算法模型建立"><a href="#2-2-EM算法模型建立" class="headerlink" title="2.2 EM算法模型建立"></a>2.2 EM算法模型建立</h3><p>假定我们的数据有\(k\)个分类。我们聚类的目标是，样本在自己分类中出现的概率最大。或者换句话说，让其在所属分类的分布(可以用高斯分类假想该问题)中出现的概率最大。可是对于非监督学习问题，我们不知道具体的分类。因此，我们可以将模型假定为找到给定参数\(\theta\)对应分布，是的\(x\)在分布中出现的概率足够大，这说明(\theta\)对应的分布能够充分的表示某一组分类。因此可以构造如下的最大释然函数：</p>

$$\ell (\theta ) = \sum\limits _{i = 1}^m {\ln p({x^{(i)}};\theta )} $$


<p>进入推导最大释然函数：</p>

$$\ell (\theta ) = \sum\limits _{i = 1}^m {\ln \sum\limits _{j = 1}^k {p({x^{(i)}},{z^{(j)}};\theta )} }  = \sum\limits _{i = 1}^m {\ln \sum\limits _{j = 1}^k {{Q _i}({z^{(i)}} = j)\frac{{p({x^{(i)}},{z^{(i)}} = j;\theta )}}{{{Q _i}({z^{(i)}} = j)}}} } $$


<p>上面\(k\)为分类的个数。对上面的\({Q _i}\)为一个概率分布，有:</p>

$$\sum\limits _{j=1} ^k {{Q _i}({z^{(i)}}=j)}=1$$


<p>对于\(f(x) &#x3D; \ln x\),我们知道\(f’’(x) &#x3D;  - \frac{1}{x^2}\),可知其为一个严格凹函数。上式可以写成：</p>

$$\ell (\theta ) = \sum\limits _{i = 1}^m {f(E[\frac{{p({x^{(i)}},{z^{(i)}} = j;\theta )}}{{{Q _i}({z^{(i)}} = j)}}])}$$


<p>根据Jensen不等式，我们有:</p>

$$\ell (\theta ) = \sum\limits _{i = 1}^m {f(E[\frac{p({x^{(i)},{z^{(i)}};\theta )}}{{Q _i}({z^{(i)}} = j)}])} \geqslant \sum\limits _{i = 1}^m {E[f(\frac{p({x^{(i)}},{z^{(i)}} = j;\theta )}{{Q _i}({z^{(i)}} = j)})]} = \sum\limits _{i = 1}^m {\sum\limits _{j = 1}^k {{Q _i}({z^{(i)}} = j)\ln (\frac{p({x^{(i)}},{z^{(i)}} = j;\theta )}{{Q _i}({z^{(i)}} = j)})} }$$


<p>因此我们得到最大释然函数的下确定，即为上式子中后面的部分。我们只要保证下确定随着迭代的方向单调递增即可，具体的要保证\(\ell ({\theta ^{(t + 1)}}) \geqslant \ell ({\theta ^{(t)}})\)。这里我们设\(low(\theta )\)是已知\({Q _i}\)的情况下，最大释然函数的下确定关于\(\theta\)的函数。我们看如下公式:</p>

$$\ell ({\theta ^{(t + 1)}}) \geqslant low({\theta ^{(t + 1)}}) \geqslant low({\theta ^{(t)}}) = \ell ({\theta ^{(t)}})$$


<p>事实上，我们只要保证上面的式子我们就可以保证迭代方向是正确的。其中，第一个不等号是必然成立的。那我们分头来构造条件是后面的式子成立。<br>首先如果保证最后一个等号的成立，根据Jensen不等式，只有自变量为常数才能事等号成立，这样我们设置如下式子(其中\(c\)为常数):</p>

$$\frac{{p({x^{(i)}},{z^{(i)}} = j;\theta )}}{{{Q _i}({z^{(i)}} = j)}} = c$$


<p>然后对上面的式子按照分类累积求和得:</p>

$$\sum\limits _{j = 1}^k {p({x^{(i)}},{z^{(i)}} = j;\theta )}  = c\sum\limits _{j = 1}^k {{Q _i}({z^{(i)}} = j)}  = c$$


<p>因此得:</p>

$${Q _i}({z^{(i)}} = j) = \frac{{p({x^{(i)}},{z^{(i)}} = j;\theta )}}{{\sum\limits _{j = 1}^k {p({x^{(i)}},{z^{(i)}} = j;\theta )} }} = p({z^{(i)}}|{x^{(i)}};\theta)$$


<p>这样我们完成EM算法的E步骤。然后解决中间大于等于号的问题。这个就比较好解决了，我们只需要计算下确定函数关于\(\theta\)求最大值，最大值对应的参数，可保证不等号的成立，即为迭代的正确方向。具体公式如下：</p>

$$\theta  = \arg {\max _\theta }\sum\limits _{i = 1}^m {\sum\limits _{j = 1}^k {{Q _i}({z^{(i)}} = j)\ln \frac{{p({x^{(i)}},{z^{(i)}} = j;\theta )}}{{{Q _i}({z^{(i)}} = j)}}} }$$


<p>综上，我们来重新整理一下EM算法，具体如下：</p>
<ul>
<li>(1) 初始化相关参数</li>
<li>(2) E步骤:计算\({Q _i}\)，如下：</li>
</ul>

$${Q _i}({z^{(i)}} = j) = p({z^{(i)}}|{x^{(i)}};\theta)$$


<ul>
<li>(3) M步骤：更新参数\(\theta\)，如下：</li>
</ul>

$$\theta  = \arg {\max _\theta }\sum\limits _{i = 1}^m {\sum\limits _{j = 1}^k {{Q _i}({z^{(i)}} = j)\ln \frac{{p({x^{(i)}},{z^{(i)}} = j;\theta )}}{{{Q _i}({z^{(i)}} = j)}}} } $$


<p>(4)	重复2,3直到截止条件</p>
<p>注: 上面的例子是不断的更新迭代\({Q _i}\)和\(\theta\)。我们完全可以使用梯度上升发不断从各个方向更新\({Q _i}\)和\(\theta\)，来完成最大值的逼近。</p>
<h3 id="2-3-混合高斯分布的公式推导"><a href="#2-3-混合高斯分布的公式推导" class="headerlink" title="2.3 混合高斯分布的公式推导"></a>2.3 混合高斯分布的公式推导</h3><p>对比之前的内容，我们可以发现混合高斯分布的聚类问题为EM算法的一个特例。可以通过通用的EM算法来证明，下面我们来证明这一过程。</p>
<p>注： 这里只简单地提示计算，不展开了，因为与之前的高斯判别分析类似。<br>根据上一节，我们已得到E步骤的公式：</p>

$$w _j^{(i)}={Q _i}({z^{(i)}}=j) = p({z^{(i)}}|{x^{(i)}};\theta )$$


<p>然后对于M步骤，我们将\({Q _i}\)为一个已知值的方式对\(\theta\)进行求导，从而求得下确定的最大值，记为下一个迭代值。将最大释然函数展开记为如下函数：</p>

$$\ell (\theta ) = \sum\limits _{i = 1}^m {\sum\limits _{j = 1}^k {{Q _i}({z^{(i)}} = j)\ln (\frac{{p({x^{(i)}},{z^{(i)}} = j;\theta )}}{{{Q _i}({z^{(i)}} = j)}})} }  = \sum\limits _{i = 1}^m {\sum\limits _{j = 1}^k {w _j^{(i)}\ln (\frac{{\frac{1}{{{{(2\pi )}^{\frac{n}{2}}}|\Sigma {|^{\frac{1}{2}}}}}\exp ( - \frac{1}{2}{{({x^{(i)}} - {\mu _j})}^T}{\Sigma ^{ - 1}}({x^{(i)}} - {\mu _j})){\phi _j}}}{{w _j^{(i)}}})} } $$


<p>然后对\(\phi\),\(\mu\),\(\Sigma\), 即可得到M步骤的更新公式。</p>


    </div>
    <footer class="article-footer">
      <a data-url="http://example.com/2017/09/24/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0-3-1-%E9%9D%9E%E7%9B%91%E7%9D%A3%E5%AD%A6%E4%B9%A0%E4%B9%8B%E8%81%9A%E7%B1%BB/" data-id="cuid6ak90iuYAtFAEozz34kp4" data-title="机器学习-3.1-非监督学习之聚类.md" class="article-share-link"><span class="fa fa-share">Share</span></a>


  <ul class="article-tag-list" itemprop="keywords"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/" rel="tag">机器学习</a></li></ul>

    </footer>
  </div>

</article>


    <article id="post-机器学习-2-7-监督学习之支持向量机" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
  <div class="article-meta">
    <a href="/2017/09/08/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0-2-7-%E7%9B%91%E7%9D%A3%E5%AD%A6%E4%B9%A0%E4%B9%8B%E6%94%AF%E6%8C%81%E5%90%91%E9%87%8F%E6%9C%BA/" class="article-date">
  <time class="dt-published" datetime="2017-09-08T08:18:25.000Z" itemprop="datePublished">2017-09-08</time>
</a>

  </div>
  <div class="article-inner">


      <header class="article-header">


    <h1 itemprop="name">
      <a class="p-name article-title" href="/2017/09/08/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0-2-7-%E7%9B%91%E7%9D%A3%E5%AD%A6%E4%B9%A0%E4%B9%8B%E6%94%AF%E6%8C%81%E5%90%91%E9%87%8F%E6%9C%BA/">机器学习-2.7-监督学习之支持向量机</a>
    </h1>


      </header>

    <div class="e-content article-entry" itemprop="articleBody">

        <script type="text/javascript" src="/Users/zcy/Desktop/study/git/MathJax-master/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>

<h1 id="支持向量机-SVM"><a href="#支持向量机-SVM" class="headerlink" title="支持向量机(SVM)"></a>支持向量机(SVM)</h1><h2 id="1-Margins-间隔"><a href="#1-Margins-间隔" class="headerlink" title="1 Margins(间隔)"></a>1 Margins(间隔)</h2><p>对于一个逻辑回归问题，\(p(y &#x3D; 1|x, \theta )\)可以通过下面的式子表示：</p>

$${h _ \theta }(x) = g({\theta ^T}x)$$


<p>\(\theta^Tx\)越大， \(g({\theta ^T}x)\)就越大， 我们就有可高的信心认为\(y&#x3D;1\)。 同理假如\({ h _\theta }(x) &lt;&lt; 0\), 我们就有很强的信息认为\(y&#x3D;0\)。 (注: &gt;&gt; 表示远大于，&lt;&lt;表示远小于)</p>
<img src="/images/机器学习/监督学习-支持向量机图1.png" width=50% height=50% text-align=center/>