-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathsearch.xml
More file actions
613 lines (291 loc) · 332 KB
/
search.xml
File metadata and controls
613 lines (291 loc) · 332 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
<?xml version="1.0" encoding="utf-8"?>
<search>
<entry>
<title>如何高效下载基因组文件或原始reads数据</title>
<link href="/2024/08/11/download-data/"/>
<url>/2024/08/11/download-data/</url>
<content type="html"><![CDATA[<p><strong>如何简单高效下载ncbi中refseq、genbank数据库的基因组或EBI数据库中的原始reads数据</strong></p><h4 id="(一)利用ncbi-genome-download批量下载基因组文件"><a href="#(一)利用ncbi-genome-download批量下载基因组文件" class="headerlink" title="(一)利用ncbi-genome-download批量下载基因组文件"></a><strong>(一)利用ncbi-genome-download批量下载基因组文件</strong></h4><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash">安装</span></span><br><span class="line">conda install bioconda::ncbi-genome-download</span><br></pre></td></tr></table></figure><p>主要参数<br>groups参数定于目标物种所在的大类,包括’all’, ‘archaea’, ‘bacteria’, ‘fungi’, ‘invertebrate’, ‘metagenomes’, ‘plant’, ‘protozoa’, ‘vertebrate_mammalian’, ‘vertebrate_other’, ‘viral’。只将所属类群的名字放在命令行中<br>-s或–section 选择数据库refseq或genbank,一般不添加该参数,默认为refseq数据库<br>-t或–taxid 添加所下载物种在ncbi的id,可以直接添加,也可以放入一个文件中批量下载多个物种<br>-F或–formats 定义所需物种的数据的类型,包括’genbank’, ‘fasta’, ‘rm’, ‘features’, ‘gff’, ‘protein-fasta’, ‘genpept’, ‘wgs’, ‘translated-cds’, ‘all’,如果仅下载基因组数据则用fasta,看个人需求<br>-A或–assembly-accessions 利用基因组的accession号来下载,-t和-A参数选一个使用<br>–flat-output 该参数将所有文件直接转储到输出路径中,而不创建任何子目录<br>-o或–output-folder 输出文件的名字</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash">示例1:下载文昌鱼的基因组文件,文昌鱼在ncbi的id为7739,注意在第2、3命令行中,在使用accessions号时,默认检索refseq库,应该用GCF开头的accession,如果使用GCA开头的accessions,应该加上-s genbank参数</span></span><br><span class="line">ncbi-genome-download --taxid 7739 invertebrate -F fasta --flat-output -o Amphioxus</span><br><span class="line">ncbi-genome-download --assembly-accessions GCF_000003815.2 invertebrate -F fasta --flat-output -o Amphioxus</span><br><span class="line">ncbi-genome-download -s genbank --assembly-accessions GCA_000003815.2 invertebrate -F fasta --flat-output -o Amphioxus1</span><br><span class="line"><span class="meta"></span></span><br><span class="line"><span class="meta">#</span><span class="bash">示例2:批量下载文昌鱼和amil(id为45264)的基因组,将两着的id放到文件download_id.txt中</span></span><br><span class="line">ncbi-genome-download --taxid download_id.txt invertebrate -F fasta --flat-output -o genome_data</span><br><span class="line"><span class="meta"></span></span><br><span class="line"><span class="meta">#</span><span class="bash">示例3:下载amil的gff、genbank和protein-fasta文件</span></span><br><span class="line">ncbi-genome-download --taxid 45264 invertebrate -F gff,genbank,protein-fasta --flat-output -o amil</span><br></pre></td></tr></table></figure><p>ncbi-genome-download的其他用法,比如下载某一个属下所有的基因组等,可以参考<a href="https://github.com/kblin/ncbi-genome-download">https://github.com/kblin/ncbi-genome-download</a></p><h4 id="(二)利用aspera在EBI数据库下载sra数据"><a href="#(二)利用aspera在EBI数据库下载sra数据" class="headerlink" title="(二)利用aspera在EBI数据库下载sra数据"></a><strong>(二)利用aspera在EBI数据库下载sra数据</strong></h4><p>优势:①aspera中的ascp命令的下载速度远比wget和prefetch快;②相比与NCBI的SRA数据库,EBI可以直接下载fastq文件,避免sra文件向fastq文件的转换;</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash">安装</span></span><br><span class="line">conda install hcc::aspera-cli</span><br></pre></td></tr></table></figure><p>主要参数<br>-v 打印详细信息<br>-i 密钥文件的地址,如果是conda下载,则在下载环境中的etc目录中:/home/tianzhen/miniconda3/envs/evolution_work/etc/asperaweb_id_dsa.openssh<br>-l 最大下载速度<br>-k 1 设置断点续传<br>-T 禁用加密<br>–mode=recv 选择下载模式,send为上传模式<br>–host=fasp.sra.ebi.ac.uk NCBI的host名是ftp-private.ncbi.nlm.nih.gov,EBI的host名为fasp.sra.ebi.ac.uk<br>–user=era-fasp NCBI的用户名anonftp,EBI的用户名era-fasp</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash">示例:将accession号存入到文件sra_id.txt中,例如人皮肤转录组reads(SRR28800820)</span></span><br><span class="line"></span><br><span class="line">ascp -v -QT -l 200m -P33001 -k1 -i /home/tianzhen/miniconda3/envs/evolution_work/etc/asperaweb_id_dsa.openssh --mode=recv era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR288/020/SRR28800820/SRR28800820_1.fastq.gz ./</span><br></pre></td></tr></table></figure><p>下载速度平均60M/s</p><img src="/2024/08/11/download-data/sra%E4%B8%8B%E8%BD%BD.png" class title="sra下载"><p>利用EBI下载页面导出多个下载链接进行批量下载。</p><p>参考:</p><p><a href="https://github.com/kblin/ncbi-genome-download">https://github.com/kblin/ncbi-genome-download</a></p><p><a href="https://lishensuo.github.io/posts/bioinfo/049%E4%B8%8B%E8%BD%BD%E6%B5%8B%E5%BA%8F%E6%95%B0%E6%8D%AEsrr%E4%B8%8Efastq.gz%E6%96%B9%E5%BC%8F/">https://lishensuo.github.io/posts/bioinfo/049%E4%B8%8B%E8%BD%BD%E6%B5%8B%E5%BA%8F%E6%95%B0%E6%8D%AEsrr%E4%B8%8Efastq.gz%E6%96%B9%E5%BC%8F/</a><br><a href="https://cloud.tencent.com/developer/article/1587554">https://cloud.tencent.com/developer/article/1587554</a></p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<categories>
<category> 数据下载 </category>
</categories>
<tags>
<tag> 公共组学数据下载 </tag>
</tags>
</entry>
<entry>
<title>基因组注释——适合初学者的流程分解版</title>
<link href="/2024/07/17/genome-anno/"/>
<url>/2024/07/17/genome-anno/</url>
<content type="html"><![CDATA[<h3 id="主题:基因组注释"><a href="#主题:基因组注释" class="headerlink" title="主题:基因组注释"></a>主题:基因组注释</h3><p>相关参考:</p><p><a href="https://yanzhongsino.github.io/2021/08/02/omics_genome.annotation_repeat/">基因组注释(一):重复序列注释 | 生信技工 (yanzhongsino.github.io)</a></p><p><a href="https://www.biochen.org/cn/blog/2021/%E4%BD%BF%E7%94%A8RepeatModeler%E4%BB%8E%E5%A4%B4%E9%A2%84%E6%B5%8B%E5%9F%BA%E5%9B%A0%E7%BB%84%E9%87%8D%E5%A4%8D%E5%BA%8F%E5%88%97/">使用RepeatModeler从头预测基因组重复序列 | BioChen 博客</a></p><p><a href="https://www.jianshu.com/p/c83ad97d1c7c">【基因组注释】RepeatMasker和RepeatModeler安装、配置与运行避坑 - 简书 (jianshu.com)</a></p><p><a href="https://zhuanlan.zhihu.com/p/147585710">基因组重复序列注释repeatmask&repeatmodeler - 知乎 (zhihu.com)</a></p><p><a href="https://www.jianshu.com/p/a0dee7c5fdef">非模式生物重复序列注释 RepeatModeler2+RepeatMasker4 - 简书 (jianshu.com)</a></p><p><a href="https://phantom-aria.github.io/2023/03/16/a.html">基因组注释(2)——散在重复序列注释 - 我的小破站 (phantom-aria.github.io)</a></p><p><a href="https://zhuanlan.zhihu.com/p/379464361">使用AUGSTUS+Geneid+GeneMark+GeMoMa+GenomeThreader+Exonerate进行基因结构预测 - 知乎 (zhihu.com)</a></p><p><a href="https://www.jianshu.com/p/b75b8e253552">基因组结构注释软件列表 - 简书 (jianshu.com)</a></p><p><a href="https://phantom-aria.github.io/2023/03/12/a.html">基因组注释(1)——串联重复序列注释 - 我的小破站 (phantom-aria.github.io)</a></p><h4 id="1-重复序列注释"><a href="#1-重复序列注释" class="headerlink" title="1 重复序列注释"></a>1 重复序列注释</h4><h5 id="1-1-重复序列的主要分类"><a href="#1-1-重复序列的主要分类" class="headerlink" title="1.1 重复序列的主要分类"></a>1.1 重复序列的主要分类</h5><p>重复序列包括串联重复(tandem repeats)和散布重复(dispersed repeats)。前者按照重复的长度可划分为卫星序列(>100bp)、小卫星序列(>10bp and<100bp)和微卫星序列(SSR,<10bp),没有固定的长度划分标准。后者也叫做转座子(transposable element,TE),包括DNA转座子(DNA transposons,也叫做Ⅱ型转座子)和RNA转座子(Retrotransposon,也叫做Ⅰ型转座子、反转录转座子),其中RNA转座子包括长末端重复序列LTR、长散布核元件LINE和短散布核元件SINE。</p><h5 id="1-2-注释软件"><a href="#1-2-注释软件" class="headerlink" title="1.2 注释软件"></a>1.2 注释软件</h5><p>TRF:检测序列中串联重复序列,全称为Tandem Repeat Finder;</p><p>GMATA:分析基因组中的SSR序列;</p><p>RepeatScout:从头注释重复序列的核心组件;</p><p>RECON:从头注释重复序列的核心组件;</p><p>RepeatModeler:主要针对非模式物种的从头注释;RepeatMasker version 4.1.5</p><p>RepeatMasker:通过重复序列数据库(包括Dfam和Repbase等)和从头注释得到的数据库进行注释得到整合的结果;</p><p><em>文献中软件使用的案例:</em></p><p>①The rough-toothed dolphin genome provides new insights into the genetic mechanism of its rough teeth (Huang, et al., 2023)</p><img src="/2024/07/17/genome-anno/image-20240325141415995.png" class title="image-20240325141415995"><p>②Chromosome-level genome assembly of hadal snailfish reveals mechanisms of deep-sea adaptation in vertebrates (Xu, et al., 2023)</p><img src="/2024/07/17/genome-anno/image-20240325141612852.png" class title="image-20240325141612852"><p>③Genome assembly of the deep-sea coral Lophelia pertusa (Herrera, et al., 2022)</p><img src="/2024/07/17/genome-anno/image-20240325142112181.png" class title="image-20240325142112181"><p>④The FirstDraft Genome of a Cold-Water Coral Trachythela sp. (Alcyonacea: Stolonifera: Clavulariidae) (Zhou, et al., 2020)</p><img src="/2024/07/17/genome-anno/image-20240325142530238.png" class title="image-20240325142530238"><h5 id="1-3-用于同源注释的数据库"><a href="#1-3-用于同源注释的数据库" class="headerlink" title="1.3 用于同源注释的数据库"></a>1.3 用于同源注释的数据库</h5><p>①Dfam:RepeatMasker自带的库,提供了大量已知的重复元件家族的信息;</p><p>②Repbase:用于存储和管理已知重复序列的数据库,在网上可以下载到2018年的版本,更新的版本则需要注册费下载;</p><p>下载并配置repbase库,进入到/home/tianzhen/miniconda3/envs/buscopy3.9/share/RepeatMasker/目录下,运行配置命令</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">perl ./configure</span><br></pre></td></tr></table></figure><img src="/2024/07/17/genome-anno/image-20240325102127133.png" class title="image-20240325102127133"><p>③dc20170127-rb20170127数据库:不太了解,仅在几篇推文中看到过;</p><p>④通过RepeatModeler对自己的物种进行从头注释的数据库;</p><h5 id="1-4-软件安装与使用"><a href="#1-4-软件安装与使用" class="headerlink" title="1.4 软件安装与使用"></a>1.4 软件安装与使用</h5><p>①首先注释串联重复序列,利用GMATA和TRF两种软件。</p><p>GMATA安装非常简单,没有conda途径。下载链接<a href="https://sourceforge.net/projects/gmata/%EF%BC%8C%E8%A7%A3%E5%8E%8B%E5%90%8E%E4%B8%BA%E4%B8%80%E7%B3%BB%E5%88%97perl%E8%84%9A%E6%9C%AC%E5%92%8C%E9%85%8D%E7%BD%AE%E6%96%87%E4%BB%B6%E7%AD%89%EF%BC%8C%E9%A6%96%E5%85%88%E4%BF%AE%E6%94%B9default_cfg.txt%E6%96%87%E4%BB%B6%E4%B8%AD%E5%8F%82%E6%95%B0%E8%AE%BE%E7%BD%AE%EF%BC%8C%E5%B0%86[set]:doprimer_smt%E3%80%81[set]:elctPCR%E5%92%8C[set]:mk2gff3%E7%9A%84ModulRun">https://sourceforge.net/projects/gmata/,解压后为一系列perl脚本和配置文件等,首先修改default_cfg.txt文件中参数设置,将[set]:doprimer_smt、[set]:elctPCR和[set]:mk2gff3的ModulRun</a> = Y改为ModulRun = N,三者是设计SSR引物和模拟PCR的时候需要调用的,若设置为Y还得需要下载其他依赖。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#在default_cfg.txt文件默认参数中,每个SSR-motif的标准是:最小长度为2,最大长度为10,最少重复5次</span></span><br><span class="line">perl gmata.pl -c default_cfg.txt -i /project/tianzhenWu/z189/03genome_annotation/purged.fa</span><br><span class="line"></span><br><span class="line"><span class="comment">#利用trf(Tandem Repeats Finder, Version 4.09)预测串联重复序列,参数如下</span></span><br><span class="line">conda install bioconda::trf</span><br><span class="line">trf ../purged.fa 2 7 7 80 10 50 500 -f -d -h -r</span><br><span class="line"></span><br><span class="line"><span class="comment">#提取trf结果为gff3文件</span></span><br><span class="line">perl ~/scripts/repeat_to_gff.pl purged.fa.2.7.7.80.10.50.500.dat</span><br></pre></td></tr></table></figure><p>结果解读:GMATA和TRF的结果具有重叠性,前者的默认参数预测了以2-10bp为单位的串联重复(SSR),后者的默认参数则会预测2000bp以下为单位的串联重复,这里我后续的分析并非以SSR为主,因此只用了TRF的结果。TRF结果的统计见推文<a href="https://phantom-aria.github.io/2023/03/12/a.html">基因组注释(1)——串联重复序列注释 - 我的小破站 (phantom-aria.github.io)</a>,内含脚本,写得非常详细 [点赞]。</p><p>②针对散在重复序列,结合从头注释和同源注释的思路。先利用RepeatModeler进行自我比对,建立本地数据库,然后利用RepeatMasker根据公共数据库近缘物种的重复序列和本地自建的数据库进行重复序列的从头注释。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#利用conda安装RepeatModeler的2.0.5版本,根据其依赖包的要求配置python环境,本文以python3.9版本为例</span></span><br><span class="line">conda install repeatmodeler=2.0.5</span><br><span class="line"></span><br><span class="line"><span class="comment">#从头预测转座子</span></span><br><span class="line">BuildDatabase -name z189 -engine ncbi ../purged.fa</span><br><span class="line">RepeatModeler -pa 20 -database z189 -engine ncbi</span><br></pre></td></tr></table></figure><p>RepeatModeler运行五轮结束(可能也会运行6轮),日志文件如下:</p><img src="/2024/07/17/genome-anno/image-20240327094818877.png" class title="image-20240327094818877"><p>结果主要包括z189-families.fa、z189-families.stk和RM_29354.TueMar261429412024目录,目录下包括各种过程文件。</p><p>利用RepeatModeler的结果z189-families.fa文件运行RepeatMasker,此时运用-lib参数指定目标文件,然而当利用重复序列数据库运行RepeatMasker时,则运用-species参数指定目标物种,</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">RepeatMasker ../purged.fa -lib z189-families.fa -e ncbi -pa 40 -poly -html -gff -dir ./denove/</span><br></pre></td></tr></table></figure><p>利用重复序列数据库近缘物种的重复序列运行RepeatMasker,先找到目标物种的近缘种是否在数据库中,然后提取近缘物种的重复序列,形成近缘物种重复序列数据库,将其与上面RepeatModeler产生的自身比对重复序列数据库合并,形成最终用于从头注释的库repeat_for_anno.fasta。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">python famdb.py -i Libraries/RepeatMaskerLib.h5 lineage -ad Anthozoa <span class="comment">#查找信息</span></span><br><span class="line"></span><br><span class="line">python famdb.py -i Libraries/RepeatMaskerLib.h5 families -f embl -a -d Anthozoa > Anthozoa.embl <span class="comment">#将信息存储为embl格式文件</span></span><br><span class="line"></span><br><span class="line">buildRMLibFromEMBL.pl Anthozoa.embl > Anthozoa.fasta <span class="comment">#创建近缘物种重复序列</span></span><br><span class="line"></span><br><span class="line">cat /home/tianzhen/miniconda3/envs/buscopy3.9/share/RepeatMasker/Anthozoa.fasta ../z189-families.fa > repeat_for_anno.fasta <span class="comment">#将RepeatMasker重复序列库和近缘物种重复序列库合并</span></span><br><span class="line"></span><br><span class="line"><span class="comment">##进行软屏蔽,并输出重复序列表,加上-a参数输出比对文件</span></span><br><span class="line">RepeatMasker -a -xsmall -nolow -norna -html -gff -dir ./xsmall_output -lib repeat_for_anno.fasta -e ncbi -pa 50 -poly ../../purged.fa</span><br></pre></td></tr></table></figure><p>the neutral mutation rate (μ) was calculated using r8s v1.8.1</p><h4 id="2-非编码RNA注释"><a href="#2-非编码RNA注释" class="headerlink" title="2 非编码RNA注释"></a>2 非编码RNA注释</h4><h5 id="2-1-非编码rRNA分类"><a href="#2-1-非编码rRNA分类" class="headerlink" title="2.1 非编码rRNA分类"></a>2.1 非编码rRNA分类</h5><p>非编码RNA (ncRNA)包括rRNA, tRNA和其他小RNA (miRNA、snRNA等)。更详细的分类可以在Rfam网站进行查询<a href="https://rfam.org/search#tabview=tab5">Rfam: Search Rfam</a>。</p><h5 id="2-2-注释软件"><a href="#2-2-注释软件" class="headerlink" title="2.2 注释软件"></a>2.2 注释软件</h5><p>①Rnammer:一种用于预测原核生物和真核生物基因组中的RNA基因的工具,能够识别并定位16S/18S、5S和23S/28S rRNA基因序列,该软件利用隐马尔可夫模型(HMM)来识别RNA基因的保守结构特征,从而进行预测和定位。</p><p>②tRNAscan-SE:用于预测原核生物和真核生物基因组中转运RNA(tRNA)基因的工具,其结合了序列比对和结构预测的方法,能够准确地识别tRNA基因序列并预测其二级结构。tRNAscan-SE软件具有较高的准确性和灵敏度,为基因组注释、生物信息学研究和进化分析提供重要的信息。</p><p>③INFERNAL:运用Cmsan程序查找各类非编码RNA,包括miRNA、snRNA等。</p><h5 id="2-3-用于注释的数据库"><a href="#2-3-用于注释的数据库" class="headerlink" title="2.3 用于注释的数据库"></a>2.3 用于注释的数据库</h5><p>RFAM数据库是一个专门用于存储和注释非编码RNA序列的数据库<a href="https://rfam.org/">Rfam: The RNA families database</a>。可以在这个页面<a href="https://rfam.org/search#tabview=tab5">Rfam: Search Rfam</a>查看不同非编码RNA的分类,以及通过sequence search功能来查看我们所感兴趣的序列片段是否属于某一个非编码RNA家族。</p><h5 id="2-4-软件安装与使用"><a href="#2-4-软件安装与使用" class="headerlink" title="2.4 软件安装与使用"></a>2.4 软件安装与使用</h5><h6 id="2-4-1-利用Infernal软件注释ncRNA"><a href="#2-4-1-利用Infernal软件注释ncRNA" class="headerlink" title="2.4.1 利用Infernal软件注释ncRNA"></a>2.4.1 利用Infernal软件注释ncRNA</h6><p>参考<a href="https://yanzhongsino.github.io/2022/04/22/omics_genome.annotation_ncRNA/">基因组注释(四):非编码RNA的注释-用Infernal软件对Rfam 12进行RNA注释 | 生信技工 (yanzhongsino.github.io)</a></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">conda install infernal=1.1.5</span><br></pre></td></tr></table></figure><p>在Rfam网站<a href="https://rfam.org/">Rfam: The RNA families database</a>下载RNA family数据库h:版本为Rfam 14.10 (November 2023, 4170 families),下载配套的clanin文件</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line">wget -c https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz</span><br><span class="line"></span><br><span class="line">wget -c https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.clanin</span><br><span class="line"></span><br><span class="line">gunzip Rfam.cm.gz <span class="comment">#解压</span></span><br><span class="line"></span><br><span class="line">cmpress Rfam.cm <span class="comment">#建立索引并生成Rfam.cm.i1f, Rfam.cm.i1i, Rfam.cm.i1m, Rfam.cm.i1p</span></span><br><span class="line"></span><br><span class="line">cmscan -Z 1650 --cut_ga --rfam --nohmmonly --fmt 2 --tblout sample.tblout -o sample.result --clanin Rfam.clanin Rfam.cm ../purged.fa.masked <span class="comment">#-Z参数根据基因组大小来定,基因组大小的2倍,以Mb单位选一个整数</span></span><br><span class="line"></span><br><span class="line">perl infernal-tblout2gff.pl --cmscan --fmt2 sample.tblout >sample.infernal.ncRNA.gff3 <span class="comment">#将结果转成gff3文件</span></span><br><span class="line"></span><br><span class="line"><span class="comment">#统计各类ncRNA总数</span></span><br><span class="line">awk <span class="string">'BEGIN{OFS="\t";}{if(FNR==1) print "target_name\taccession\tquery_name\tquery_start\tquery_end\tstrand\tscore\tEvalue"; if(FNR>2 && $20!="=" && $0!~/^#/) print $2,$3,$4,$10,$11,$12,$17,$18; }'</span> sample.tblout >sample.tblout.xls</span><br><span class="line"></span><br><span class="line"><span class="comment">#在rfam官网下载所有的Entry types</span></span><br><span class="line">cat rfam.txt | awk <span class="string">'BEGIN {FS=OFS="\t"}{split($3,x,";");class=x[2];print $1,$2,$3,$4,class}'</span> > rfam_anno.txt</span><br><span class="line"></span><br><span class="line"><span class="comment">#统计结果</span></span><br><span class="line">awk <span class="string">'BEGIN{OFS=FS="\t"}ARGIND==1{a[$2]=$5;}ARGIND==2{type=a[$1]; if(type=="") type="Others"; count[type]+=1;}END{for(type in count) print type, count[type];}'</span> rfam_anno.txt sample.tblout.xls >sample.ncRNA.statistic</span><br><span class="line"></span><br><span class="line"><span class="comment">#统计具体的分类信息,比如miRNA</span></span><br><span class="line">grep <span class="string">"miRNA"</span> rfam_anno.txt |cut -f1 >miRNA.tem</span><br><span class="line">grep -f miRNA.tem sample.tblout.xls >miRNA.txt</span><br><span class="line">awk <span class="string">'{sum += (int($5) - int($4) >= 0 ? int($5) - int($4) : int($4) - int($5)) + 1} END {print sum}'</span> miRNA.txt <span class="comment">#打印miRNA序列总长数值到屏幕</span></span><br></pre></td></tr></table></figure><p>本部分详细解读请查阅生信技工的博客,非常详细且无报错<a href="https://yanzhongsino.github.io/2022/04/22/omics_genome.annotation_ncRNA/">基因组注释(四):非编码RNA的注释-用Infernal软件对Rfam 12进行RNA注释 | 生信技工 (yanzhongsino.github.io)</a></p><h6 id="2-4-2-利用rnammer-注释rRNA"><a href="#2-4-2-利用rnammer-注释rRNA" class="headerlink" title="2.4.2 利用rnammer 注释rRNA"></a>2.4.2 利用rnammer 注释rRNA</h6><p>rnammer软件不能通过conda进行安装,目前只允许院校研究人员通过单位邮件来申请,申请地址为<a href="https://services.healthtech.dtu.dk/services/RNAmmer-1.2/">RNAmmer 1.2 - DTU Health Tech - Bioinformatic Services</a>,提交后立即发送下载链接到邮箱,下载链接4小时内有效,下载链接如下。</p><img src="/2024/07/17/genome-anno/image-20240328203933759.png" class title="image-20240328203933759"><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">wget -c rnammer-1.2.Unix.tar.gz <span class="comment">#下载软件</span></span><br><span class="line"></span><br><span class="line">tar zxf rnammer-1.2.Unix.tar.gz <span class="comment">#解压软件包,其中rnammer文件为一个perl脚本,执行该脚本便可以预测rRNA,但是需要修改其中的两个绝对路径,以便脚本运行时能够找到相应的程序。而修改的绝对路径包括本perl脚本本身的安装目录以及序列检索所需要的Hmmer程序。rnammer要求hmmer版本为2.3.2,若已安装最新版本的hmmer,则需要重新安装hmmer-2.3.2。另外其他不同推文表示该软件依赖perl模块,我利用conda安装了perl-xml-simple便可运行成功,也有推文表示需要perl4版本下的模块Perl4::CoreLibs,这个我没有安装,也可运行成功。</span></span><br></pre></td></tr></table></figure><p>安装hmmer2.3.2和perl模块</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">conda install perl-xml-simple <span class="comment">#安装perl-xml-simple</span></span><br><span class="line"></span><br><span class="line">wget http://eddylab.org/software/hmmer/hmmer-2.3.2.tar.gz <span class="comment">#下载hmmer-2.3.2</span></span><br><span class="line"></span><br><span class="line">tar -xf hmmer-2.3.2.tar.gz</span><br><span class="line"></span><br><span class="line"><span class="built_in">cd</span> hmmer-2.3.2</span><br><span class="line"></span><br><span class="line">./configure</span><br><span class="line"></span><br><span class="line">make</span><br><span class="line"></span><br><span class="line"><span class="built_in">cd</span> src;hmmsearch <span class="comment">#打印帮助文件则表示安装成功</span></span><br></pre></td></tr></table></figure><p>在rnammer文件的目录下修改rnammer中的绝对路径</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">perl -p -i -e <span class="string">'s/(my \$INSTALL_PATH).*/$1 = \"\/project\/tianzhenWu\/software\/rnammer\";/'</span> rnammer <span class="comment">#将\/project\/tianzhenWu\/software\/rnammer换为rnammer的目录</span></span><br><span class="line"></span><br><span class="line">perl -p -i -e <span class="string">'s/^(\s+\$HMMSEARCH_BINARY).*/$1 = \"\/project\/tianzhenWu\/software\/hmmer-2.3.2\/src\/hmmsearch\";/'</span> rnammer <span class="comment">#将\/project\/tianzhenWu\/software\/hmmer-2.3.2\/src\/换位hmmsearch的目录</span></span><br><span class="line"></span><br><span class="line"><span class="comment">#另外,修改该目录下core-rnammer的多线运行限制,可以缩短运行时间</span></span><br><span class="line">perl -p -i -e <span class="string">'s/--cpu 1//g'</span> core-rnammer</span><br><span class="line"></span><br><span class="line"><span class="comment">#运行脚本,注释rRNA</span></span><br><span class="line">perl /project/tianzhenWu/software/rnammer/rnammer -S euk -multi -m lsu,ssu,tsu -gff rRNA.gff2 -f rRNA.fasta -h rRNA.hmmreport -xml rRNA.xml /project/tianzhenWu/z189/03genome_annotation/03ncrna/purged.fa.masked</span><br></pre></td></tr></table></figure><p>此部分主要参考<a href="https://matinnuhamunada.github.io/posts/2022/01/rnammer">How to Install RNAmmer in Prokka - Matin Nuhamunada</a>、<a href="https://www.jianshu.com/p/ba28e4674961">本地rnammer-1.2使用 - 简书 (jianshu.com)</a>和<a href="http://www.chenlianfu.com/?p=1979">RNAmmer的安装和使用 | 陈连福的生信博客 (chenlianfu.com)</a></p><h6 id="2-4-3-利用tRNAscan-SE注释tRNA"><a href="#2-4-3-利用tRNAscan-SE注释tRNA" class="headerlink" title="2.4.3 利用tRNAscan-SE注释tRNA"></a>2.4.3 利用tRNAscan-SE注释tRNA</h6><p>利用conda安装</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line">conda install trnascan-se=2.0.12</span><br><span class="line"></span><br><span class="line"><span class="comment">#基础用法</span></span><br><span class="line">Basic Options</span><br><span class="line"> -E : search <span class="keyword">for</span> eukaryotic tRNAs (default)</span><br><span class="line"> -B : search <span class="keyword">for</span> bacterial tRNAs</span><br><span class="line"> -A : search <span class="keyword">for</span> archaeal tRNAs</span><br><span class="line"> -M <model> : search <span class="keyword">for</span> mitochondrial tRNAs</span><br><span class="line"> options: mammal, vert</span><br><span class="line"> -O : search <span class="keyword">for</span> other organellar tRNAs</span><br><span class="line"> -G : use general tRNA model (cytoslic tRNAs from all 3 domains included)</span><br><span class="line"> -L : search using the legacy method (tRNAscan, EufindtRNA, and COVE)</span><br><span class="line"> use with -E, -B, -A, -O, or -G</span><br><span class="line"> -I : search using Infernal (default)</span><br><span class="line"> use with -E, -B, -A, -O, or -G</span><br><span class="line"> -o <file> : save final results <span class="keyword">in</span> <file></span><br><span class="line"> -f <file> : save tRNA secondary structures to <file></span><br><span class="line"> -m <file> : save statistics summary <span class="keyword">for</span> run <span class="keyword">in</span> <file></span><br><span class="line"> (speed, <span class="comment"># tRNAs found in each part of search, etc)</span></span><br><span class="line"> -H : show both primary and secondary structure components to</span><br><span class="line"> covariance model bit scores</span><br><span class="line"> -q : quiet mode (credits & run option selections suppressed)</span><br><span class="line"></span><br><span class="line"> -h : <span class="built_in">print</span> full list (long) of available options</span><br><span class="line"> </span><br><span class="line"><span class="comment">#运行软件</span></span><br><span class="line">tRNAscan-SE -E -o tRNA.out -f tRNA.ss -m tRNA.stats ../purged.fa.masked</span><br></pre></td></tr></table></figure><h4 id="3-编码基因注释"><a href="#3-编码基因注释" class="headerlink" title="3 编码基因注释"></a>3 编码基因注释</h4><h5 id="3-1-基因结构预测的方案"><a href="#3-1-基因结构预测的方案" class="headerlink" title="3.1 基因结构预测的方案"></a>3.1 基因结构预测的方案</h5><p>一般包括三部分:①利用近缘物种的蛋白进行同源注释;②利用转录组或isoform测序数据进行基因模型预测;③通过训练已有的数据来从头预测基因组的基因结构。</p><h5 id="3-2-参考文章示例"><a href="#3-2-参考文章示例" class="headerlink" title="3.2 参考文章示例"></a>3.2 参考文章示例</h5><p>①A draft genome assembly of reef-building octocoral Heliopora coerulea (Ip, et al., 2023)</p><img src="/2024/07/17/genome-anno/image-20240328220942281.png" class title="image-20240328220942281"><p>②Penaeid shrimp genome provides insights into benthic adaptation and frequent molting (Zhang, et al., 2019)</p><img src="/2024/07/17/genome-anno/image-20240328221247850.png" class title="image-20240328221247850"><img src="/2024/07/17/genome-anno/image-20240328221258675.png" class title="image-20240328221258675"><p>③Chromosome-level genome assembly of the silver pomfret Pampus argenteus (Wei, et al., 2024)</p><img src="/2024/07/17/genome-anno/image-20240328222148370.png" class title="image-20240328222148370"><h5 id="3-3-基因结构预测的软件"><a href="#3-3-基因结构预测的软件" class="headerlink" title="3.3 基因结构预测的软件"></a>3.3 基因结构预测的软件</h5><p>这部分内容在<a href="https://www.jianshu.com/u/740f4b0f11e9">Mr_我爱读文献</a>的简书中<a href="https://www.jianshu.com/p/b75b8e253552">基因组结构注释软件列表 - 简书 (jianshu.com)</a>进行了详细罗列,不再赘述。</p><h5 id="3-4-软件安装及其使用"><a href="#3-4-软件安装及其使用" class="headerlink" title="3.4 软件安装及其使用"></a>3.4 软件安装及其使用</h5><h6 id="3-4-1基于转录组数据注释"><a href="#3-4-1基于转录组数据注释" class="headerlink" title="3.4.1基于转录组数据注释"></a>3.4.1基于转录组数据注释</h6><p>策略一:HISAT2 + StringTie + TransDecoder</p><p>参考<a href="https://zhuanlan.zhihu.com/p/659049169">基因结构预测 |TransDecoder和PASA基于转录组预测及转录本去冗余 - 知乎 (zhihu.com)</a>和<a href="https://www.zhouxiaozhao.cn/2020/11/24/2020-11-24-annotion(3)/">基因结构注释(3):转录组注释 - 生信学习 | Zhou Xiaozhao = 小钊の笔记 = 前天是小兔子,昨天是小鹿,今天是你</a></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#所需软件可以用conda安装</span></span><br><span class="line">conda install fastp=0.23.4</span><br><span class="line">conda install hisat2=2.2.1</span><br><span class="line">conda install samtools=1.19.2</span><br><span class="line">conda install stringtie=2.2.1</span><br><span class="line"></span><br><span class="line"><span class="comment">#对二代转录组数据进行质控</span></span><br><span class="line">fastp -i Unknown_BD751-01T0002_good_1.fq.gz -o output.R1.fq -I Unknown_BD751-01T0002_good_2.fq.gz -O output.R2.fq -w 8</span><br><span class="line"></span><br><span class="line"><span class="comment">#为屏蔽重复序列后的基因组建立索引</span></span><br><span class="line">hisat2-build ../purged.fa.masked z189.genome.index</span><br><span class="line"></span><br><span class="line"><span class="comment">#将双端的reads进行基因组mapping</span></span><br><span class="line">$ hisat2 -p 20 --dta --no-mixed -x z189.genome.index -1 output.R1.fq -2 output.R2.fq --no-unal -S z189.sam 2>z189.summary.txt</span><br><span class="line"></span><br><span class="line"><span class="comment">#利用samtools对比对结果进行二进制转换和排序</span></span><br><span class="line">samtools view -b z189.sam -o z189.bam</span><br><span class="line">samtools sort z189.bam -o z189.sort.bam</span><br><span class="line"></span><br><span class="line"><span class="comment">#利用stringtie进行转录本的预测</span></span><br><span class="line">stringtie -p 20 -o 02out.gtf z189.sort.bam</span><br></pre></td></tr></table></figure><p>最后利用transdecoder注释蛋白编码区域,参考<a href="https://www.jianshu.com/p/fdd547223ed5">使用TransDecoder寻找转录本中的编码区 - 简书 (jianshu.com)</a></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line">conda install bioconda::transdecoder=5.5.0</span><br><span class="line">conda install bioconda::diamond=2.1.9</span><br><span class="line"></span><br><span class="line"><span class="comment">#从GTF文件中提取FASTA序列</span></span><br><span class="line">gtf_genome_to_cdna_fasta.pl 02out.gtf ../purged.fa.masked > transcripts.fasta</span><br><span class="line"><span class="comment">#将GTF文件转成GFF3格式</span></span><br><span class="line">gtf_to_alignment_gff3.pl 02out.gtf > 02out.gff3</span><br><span class="line"><span class="comment">#预测转录本中长的开放阅读框</span></span><br><span class="line">TransDecoder.LongOrfs -t transcripts.fasta</span><br><span class="line"></span><br><span class="line"><span class="comment">#对于得到的开放阅读框(以起始密码子为头,以终止密码子为尾),与uniprot数据库中的蛋白作比较,来检验预测的可靠性,寻找同源证据。这里要下载uniport蛋白库</span></span><br><span class="line">wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz</span><br><span class="line">gunzip uniprot_sprot.fasta.gz</span><br><span class="line"><span class="comment">#在进行blast比对之前,要对蛋白文件进行建库</span></span><br><span class="line">diamond makedb --<span class="keyword">in</span> uniprot_sprot.fasta --db uniprot_sprot.fasta</span><br><span class="line"><span class="comment">#进行比对,diamond blast速度非常快</span></span><br><span class="line">diamond blastp -d uniprot_sprot.fasta -q transcripts.fasta.transdecoder_dir/longest_orfs.pep --evalue 1e-5 --max-target-seqs 1 > blastp.outfmt6</span><br><span class="line"></span><br><span class="line"><span class="comment">#预测可能的编码区</span></span><br><span class="line">TransDecoder.Predict -t transcripts.fasta --retain_blastp_hits blastp.outfmt6</span><br><span class="line"><span class="comment">#生成基于参考基因组的编码区注释文件</span></span><br><span class="line">cdna_alignment_orf_to_genome_orf.pl transcripts.fasta.transdecoder.gff3 02out.gff3 transcripts.fasta > transcripts.fasta.transdecoder.genome.gff3</span><br></pre></td></tr></table></figure><p>得到基于转录组数据注释的gff文件transcripts.fasta.transdecoder.genome.gff3</p><p>注:没有脚本将该文件转换为evm格式的gff文件,在整合gff文件时的策略是将其和从头注释的gff文件合并,在权重文件中标注OTHER_PREDICTION transdecoder</p><p>策略二(可注释不同转录本):trinity+pasa</p><p>首先利用trinity进行转录组的de novo组装</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">sh trinity.sh (trinity.sh内容如下)</span><br><span class="line"></span><br><span class="line"><span class="comment">## my_trinity_script.sh</span></span><br><span class="line"><span class="comment">#!/bin/bash</span></span><br><span class="line">Trinity --seqType fq \</span><br><span class="line">--left Unknown_BD751-01T0002_good_1.fq.gz \</span><br><span class="line">--right Unknown_BD751-01T0002_good_2.fq.gz \</span><br><span class="line">--CPU 20 \</span><br><span class="line">--max_memory 30G</span><br></pre></td></tr></table></figure><h6 id="3-4-2-基于近缘物种的同源蛋白注释"><a href="#3-4-2-基于近缘物种的同源蛋白注释" class="headerlink" title="3.4.2 基于近缘物种的同源蛋白注释"></a>3.4.2 基于近缘物种的同源蛋白注释</h6><p>①利用miniprot软件,下载并提取近缘物种的蛋白序列,看文献中一般用5个物种左右。将蛋白map到基因组,得到gff文件,参考大神李恒的官方描述<a href="https://github.com/lh3/miniprot?tab=readme-ov-file#usage">lh3/miniprot: Align proteins to genomes with splicing and frameshift (github.com)</a>,以下是recommended的方法</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#利用conda安装miniprot</span></span><br><span class="line">conda install miniprot=0.13</span><br><span class="line"></span><br><span class="line"><span class="comment">#由于 miniprot 索引速度较慢且占用大量内存,因此建议预先生成索引</span></span><br><span class="line">miniprot -t8 -d ref.mpi ../purged.fa.masked</span><br><span class="line">miniprot -t8 -I --gff ref.mpi protein.faa > out.gff3</span><br></pre></td></tr></table></figure><p>得到基于近缘物种的同源蛋白注释gff文件out.gff3</p><p>利用脚本将该文件转换为evm格式的gff文件,脚本地址下载地址<a href="https://github.com/EVidenceModeler/EVidenceModeler/blob/master/EvmUtils/misc/miniprot_GFF_2_EVM_GFF3.py">EVidenceModeler/EvmUtils/misc/miniprot_GFF_2_EVM_GFF3.py at master · EVidenceModeler/EVidenceModeler (github.com)</a></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python ~/scripts/miniprot_GFF_2_EVM_GFF3.py out.gff3 > out_evm.gff3</span><br></pre></td></tr></table></figure><p>②利用gemoma软件,需要下载近缘物种的基因组组装序列文件(fasta格式)、gff3文件和蛋白序列文件,同时也需要上面得到的转录组reads的mapping文件(sam格式)</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#利用conda安装,最好新建环境</span></span><br><span class="line">conda create -n gemoma python=3</span><br><span class="line">conda activate gemoma</span><br><span class="line">conda install bioconda::gemoma=1.9</span><br></pre></td></tr></table></figure><p>该软件具有多种运行方式:</p><p>①GeMoMa GeMoMaPipeline 加参数格式,参考<a href="https://zhuanlan.zhihu.com/p/688384179">GeMoMa:基因同源预测软件 - 知乎 (zhihu.com)</a></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">GeMoMa GeMoMaPipeline threads=10 outdir=`<span class="built_in">pwd</span>` GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=<span class="literal">true</span> t=scaffold.fa a=Genome1.gff g=Genome1.fa</span><br></pre></td></tr></table></figure><p>②利用java命令运行,参考<a href="https://www.jianshu.com/p/8d795097d859">基因注释工具GeMoMa - 简书 (jianshu.com)</a></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">java -jar GeMoMa-1.9.jar CLI GeMoMaPipeline threads=4 outdir=output GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=<span class="literal">true</span> t=target.fa a=ref.gff g=ref.fa</span><br></pre></td></tr></table></figure><p>前两种方法一直报错,再尝试第三种方法</p><p>③找到pipeline.sh文件的绝对路径,比如/public/home/wutianzhen/miniconda3/pkgs/gemoma-1.9-hdfd78af_0/share/gemoma-1.9-0/pipeline.sh,再运行该sh文件,参考<a href="https://www.jianshu.com/p/6d9d9f0c38a6">gemoma安装与简单用法 - 简书 (jianshu.com)</a>。pipeline.sh内容如下,若有转录组数据则设置8个参数,若没有转录组数据则设置前6个参数。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#!/bin/bash</span></span><br><span class="line">jar=$(<span class="built_in">eval</span> <span class="string">"ls /public/home/wutianzhen/miniconda3/pkgs/gemoma-1.9-hdfd78af_0/share/gemoma-1.9-0/GeMoMa-*.jar"</span>) <span class="comment">#此处ls后面最好添加GeMoMa-1.9.jar的绝对路径,否则只能在GeMoMa-1.9.jar的目录下运行程序</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># This script allows to run the complete GeMoMa pipeline multithreaded from the command line.</span></span><br><span class="line"><span class="comment"># The final prediction is located in ${out}/filtered_predictions.gff.</span></span><br><span class="line"><span class="comment">#</span></span><br><span class="line"><span class="comment"># A simple example without RNA-seq using tblastn is</span></span><br><span class="line"><span class="comment"># ./pipeline.sh tblastn tests/gemoma/target-fragment.fasta tests/gemoma/ref-annotation.gff tests/gemoma/ref-fragment.fasta 1 results/sw-pipeline</span></span><br><span class="line"></span><br><span class="line"><span class="comment">#parameters</span></span><br><span class="line"><span class="keyword">if</span> [ <span class="string">"<span class="variable">$1</span>"</span> == <span class="string">"tblastn"</span> ] <span class="comment">#设置比对方法为tblastn</span></span><br><span class="line"><span class="keyword">then</span></span><br><span class="line"> tblastn=<span class="literal">true</span>;</span><br><span class="line"><span class="keyword">else</span></span><br><span class="line"> tblastn=<span class="literal">false</span>;</span><br><span class="line"><span class="keyword">fi</span></span><br><span class="line">target_genome=<span class="variable">$2</span> <span class="comment">#设置我们自己的基因组文件,后缀必须是(fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)其中的一种</span></span><br><span class="line">ref_annotation=<span class="variable">$3</span> <span class="comment">#设置参考基因组注释文件</span></span><br><span class="line">ref_genome=<span class="variable">$4</span> <span class="comment">#设置参考基因组序列文件</span></span><br><span class="line">threads=<span class="variable">$5</span> <span class="comment">#设置运行所用的核心数</span></span><br><span class="line">out=<span class="variable">$6</span> <span class="comment">#设置输出路径</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> [ <span class="variable">$#</span> -ne 6 ]; <span class="keyword">then</span></span><br><span class="line"> lib=<span class="variable">$7</span>; <span class="comment">#设置转录组数据的类型,其中FR_UNSTRANDED表示没方向,FR_FIRST_STRAND定向转录的正义链,FR_SECOND_STRAND定向转录的反义链</span></span><br><span class="line"> reads=<span class="variable">$8</span>; <span class="comment">#转录组mapping的sam或bam文件</span></span><br><span class="line"> <span class="built_in">echo</span> <span class="string">"GeMoMa using RNA-seq data: library type="</span> <span class="variable">$lib</span> <span class="string">"mapped reads="</span> <span class="variable">$reads</span></span><br><span class="line"> time java -jar <span class="variable">$jar</span> CLI GeMoMaPipeline threads=<span class="variable">$threads</span> t=<span class="variable">$target_genome</span> s=own a=<span class="variable">$ref_annotation</span> g=<span class="variable">$ref_genome</span> tblastn=<span class="variable">${tblastn}</span> outdir=<span class="variable">$out</span> r=MAPPED ERE.s=<span class="variable">$lib</span> ERE.m=<span class="variable">$reads</span> ERE.c=<span class="literal">true</span> AnnotationFinalizer.r=NO</span><br><span class="line"><span class="keyword">else</span></span><br><span class="line"> <span class="built_in">echo</span> <span class="string">"GeMoMa without RNA-seq data"</span></span><br><span class="line"> time java -jar <span class="variable">$jar</span> CLI GeMoMaPipeline threads=<span class="variable">$threads</span> t=<span class="variable">$target_genome</span> s=own a=<span class="variable">$ref_annotation</span> g=<span class="variable">$ref_genome</span> tblastn=<span class="variable">${tblastn}</span> outdir=<span class="variable">$out</span> AnnotationFinalizer.r=NO</span><br><span class="line"><span class="keyword">fi</span></span><br></pre></td></tr></table></figure><p>运行sh脚本</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sh /public/home/wutianzhen/miniconda3/pkgs/gemoma-1.9-hdfd78af_0/share/gemoma-1.9-0/pipeline.sh tblastn purged.fa.masked.fas genomic.gff dgig.fna 5 ./result FR_UNSTRANDED ../../02Transcriptome/z189.bam</span><br></pre></td></tr></table></figure><p>第三种方法运行成功,仍遇到了两个报错,一是GeMoMa-1.9.jar的路径问题,而是target基因组的后缀问题。前两种方法的报错可能也是由于这些问题。</p><p>GeMoMa运行结束产生final_annotation.gff文件。</p><p>利用脚本将该文件转换为evm格式的gff文件,脚本地址<a href="https://github.com/EVidenceModeler/EVidenceModeler/blob/master/EvmUtils/misc/GeMoMa_gff_to_gff3.pl">EVidenceModeler/EvmUtils/misc/GeMoMa_gff_to_gff3.pl at master · EVidenceModeler/EVidenceModeler (github.com)</a></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">perl ~/scripts/GeMoMa_gff_to_gff3.pl final_annotation.gff > final_annotation_evm.gff</span><br></pre></td></tr></table></figure><p>脚本运行时报错缺少三个模块:Gene_obj.pm、Nuc_translator.pm和Longest_orf.pm,</p><p>这些模块可以在<a href="https://github.com/EVidenceModeler/EVidenceModeler/blob/master/PerlLib/Longest_orf.pm">EVidenceModeler/PerlLib/Longest_orf.pm at master · EVidenceModeler/EVidenceModeler (github.com)</a>下载得到</p><p>将它们放在已有的@INC目录中,这个目录通过以下命令来查询</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">perl -e '{print "$_\n" foreach @INC}'</span><br></pre></td></tr></table></figure><img src="/2024/07/17/genome-anno/image-20240419111329510.png" class title="image-20240419111329510"><p>将下载的perl模块放在/public/home/wutianzhen/miniconda3/envs/annotation/lib/perl5/core_perl/下面即可</p><h6 id="3-4-3-从头基因结构预测"><a href="#3-4-3-从头基因结构预测" class="headerlink" title="3.4.3 从头基因结构预测"></a>3.4.3 从头基因结构预测</h6><p>利用Augustus进行基因模型训练,训练数据可以下载近缘物种的基因组数据和gff3文件。</p><p>参考<a href="https://www.jianshu.com/p/6f7b2998600c">Augustus 基因从头预测 - 简书 (jianshu.com)</a>和<a href="https://docs.hpc.sjtu.edu.cn/app/bioinformatics/augustus.html">AUGUSTUS - 上海交大超算平台用户手册 Documentation (sjtu.edu.cn)</a></p><p>建议重建一个conda环境再进行安装</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#安装用到的软件</span></span><br><span class="line">conda install bioconda::augustus=3.5</span><br><span class="line">conda install gffread=0.12.7</span><br><span class="line">conda install seqkit=2.8.1</span><br><span class="line"></span><br><span class="line"><span class="comment">#首先得到蛋白序列</span></span><br><span class="line">gffread genomic.gff -g GCF_004324835.1_DenGig_1.0_genomic.fna -y dgig.pep</span><br><span class="line"></span><br><span class="line"><span class="comment">#得到genebank格式文件</span></span><br><span class="line">gff2gbSmallDNA.pl genomic.gff GCF_004324835.1_DenGig_1.0_genomic.fna 1000 gene.raw.gb</span><br><span class="line"></span><br><span class="line"><span class="comment">#尝试训练,捕捉错误基因</span></span><br><span class="line">etraining --species=generic --stopCodonExcludedFromCDS=<span class="literal">false</span> gene.raw.gb 2> train.err</span><br><span class="line"></span><br><span class="line"><span class="comment">#过滤掉错误的基因结构</span></span><br><span class="line">cat train.err | perl -pe <span class="string">'s/.*in sequence (\S+): .*/$1/'</span> >badgenes.lst</span><br><span class="line">filterGenes.pl badgenes.lst gene.raw.gb > genes.gb</span><br><span class="line"></span><br><span class="line"><span class="comment">#提取过滤后的蛋白</span></span><br><span class="line">grep <span class="string">'/gene'</span> genes.gb |sort |uniq |sed <span class="string">'s/\/gene=//g'</span> |sed <span class="string">'s/\"//g'</span> |awk <span class="string">'{print $1}'</span> >geneSet.lst</span><br><span class="line">seqkit grep -f geneSet.lst dgig.pep >geneSet.lst.fa <span class="comment">#19708个蛋白模型</span></span><br><span class="line"></span><br><span class="line"><span class="comment">#将得到的蛋白序列进行建库,自身blastp比对。根据比对结果,如果基因间identity >= 70%,则只保留其中之一,再次得到一个过滤后的gff文件,gene_filter.gff3</span></span><br><span class="line">makeblastdb -<span class="keyword">in</span> geneSet.lst.fa -dbtype prot -parse_seqids -out geneSet.lst.fa</span><br><span class="line">blastp -db geneSet.lst.fa -query geneSet.lst.fa -out geneSet.lst.fa.blastp -evalue 1e-5 -outfmt 6 -num_threads 8</span><br><span class="line"></span><br><span class="line">awk <span class="string">'$3 > 70 && $1 != $2 {print $2}'</span> geneSet.lst.fa.blastp | sort | uniq > filtered_lines.txt</span><br><span class="line">awk <span class="string">'NR==FNR{a[$0];next} !($0 in a)'</span> filtered_lines.txt genomic.gff > gene_filter.gff3</span><br><span class="line"></span><br><span class="line"><span class="comment">#将得到的gene_filter.gff3 转换为genbank 格式文件</span></span><br><span class="line">gff2gbSmallDNA.pl gene_filter.gff3 GCF_004324835.1_DenGig_1.0_genomic.fna 1000 genes.gb.filter</span><br><span class="line"></span><br><span class="line"><span class="comment">#将上一步过滤后的文件随机分成两份,测试集和训练集,100为测试集的基因数目,其余为训练集</span></span><br><span class="line">randomSplit.pl genes.gb.filter 100</span><br><span class="line"></span><br><span class="line"><span class="comment">#进行训练</span></span><br><span class="line">new_species.pl --species=z189</span><br><span class="line">etraining --species=z189 genes.gb.filter.train</span><br><span class="line"></span><br><span class="line"><span class="comment">#利用测试集进行检验</span></span><br><span class="line">augustus --species=z189 genes.gb.filter.test | tee firsttest.out</span><br><span class="line">augustus --species=nematostella_vectensis genes.gb.filter.test | tee firsttest_nvec.out</span><br><span class="line">augustus --species=human genes.gb.filter.test | tee firsttest_human.out</span><br></pre></td></tr></table></figure><p>测试结果如下</p><img src="/2024/07/17/genome-anno/image-20240412151218427.png" class title="image-20240412151218427"><p>近缘物种nematostella_vectensis训练情况的查看</p><img src="/2024/07/17/genome-anno/image-20240412151807169.png" class title="image-20240412151807169"><p>人类human的训练情况如下</p><img src="/2024/07/17/genome-anno/image-20240412152306450.png" class title="image-20240412152306450"><p>通过比较发现,训练效果优于通过近缘已训练物种的预测,但是gene level的sensitivity只有37%,不知道能不能继续下一步,还是优化训练。</p><p>其他推文说低于20%的话需要加大数据量继续优化训练,并且优化仅能提高几个百分点,因此这里初步训练的结果应该是可以拿来从头基因注释的。</p><p>接下来进行进行CRF训练,CRF: conditional random field, 进行进行CRF时,将备份/species/target species/中的所有参数。比较CRF前后预测的精确性,若升高则使用,若降低,则用上一步结果</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">etraining --species=z189 --CRF=1 genes.gb.filter.train</span><br><span class="line">augustus --species=z189 genes.gb.filter.test | tee secondtest.out.withCRF</span><br></pre></td></tr></table></figure><p>进行基因结构预测,得到最后的gff文件augustus_z189_gene.gff</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">augustus --species=z189 --gff3=on purged.fa.masked >augustus_z189.gff <span class="comment">#使用默认参数</span></span><br><span class="line"></span><br><span class="line">perl ~/scripts/augustus_gff3_to_evm_gff3.pl augustus_z189.gff > augustus_z189_gene.gff <span class="comment">#使用evm脚本将gff转化为evm格式的gff,脚本下载链接:https://github.com/Zhanmengtao/bin/blob/master/augustus_gff3_to_evm_gff3.pl</span></span><br></pre></td></tr></table></figure><h6 id="3-4-5-将以上不同软件的注释结果进行整合"><a href="#3-4-5-将以上不同软件的注释结果进行整合" class="headerlink" title="3.4.5 将以上不同软件的注释结果进行整合"></a>3.4.5 将以上不同软件的注释结果进行整合</h6><p>参考<a href="https://www.jianshu.com/p/ca4ac8700e32">EVM 对预测结果进行整合 - 简书 (jianshu.com)</a></p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">wget -c https://github.com/EVidenceModeler/EVidenceModeler/archive/refs/heads/master.zip</span><br><span class="line">unzip master.zip</span><br><span class="line">cd EVidenceModeler-master/</span><br><span class="line">EVidenceModeler -h</span><br></pre></td></tr></table></figure><p>准备好三种预测的结果文件transcript_alignments.gff3,protein_alignments.gff3和gene_prediction.gff3</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">perl -p -i -e <span class="string">'s/^#.*//s'</span> gene_prediction.gff3 transcript_alignments.gff3 protein_alignments.gff3</span><br></pre></td></tr></table></figure><p>编辑注释证据的权重文件</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">vi weights.txt</span><br><span class="line"></span><br><span class="line"><span class="comment">#内容如下</span></span><br><span class="line"></span><br><span class="line">PROTEIN miniprot 2</span><br><span class="line">PROTEIN GAF 1</span><br><span class="line">PROTEIN GeMoMa 1</span><br><span class="line">TRANSCRIPT transdecoder 8</span><br><span class="line">ABINITIO_PREDICTION AUGUSTUS 1</span><br></pre></td></tr></table></figure><p>分步运行程序</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#将从头注释和转录组注释结果合并</span></span><br><span class="line">cat transcript_alignments.gff3 gene_prediction.gff3 > gene_prediction_combined.gff3</span><br><span class="line"></span><br><span class="line"><span class="comment">#对每一个contig序列进行单独创建文件夹,将gff3文件划分开来</span></span><br><span class="line">perl EVidenceModeler-master/EvmUtils/partition_EVM_inputs.pl --partition_dir ./partition --genome ../purged.fa.masked --gene_predictions gene_prediction_combined.gff3 --protein_alignments protein_alignments.gff3 --segmentSize 100000 --overlapSize 10000 --partition_listing partitions_list.out</span><br><span class="line"></span><br><span class="line"><span class="comment">#生成批量运行evm的sh文件</span></span><br><span class="line">perl EVidenceModeler-master/EvmUtils/write_EVM_commands.pl --partitions partitions_list.out --genome ../purged.fa.masked --gene_predictions gene_prediction_combined.gff3 --protein_alignments protein_alignments.gff3 --output_file_name evm.out --weights `<span class="built_in">pwd</span>`/weights.txt > commands.list<span class="string">"</span></span><br><span class="line"><span class="string"> </span></span><br><span class="line"><span class="string">#运行evm程序</span></span><br><span class="line"><span class="string">perl EVidenceModeler-master/EvmUtils/execute_EVM_commands.pl commands.list | tee evm_run.log</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">#合并运行结果</span></span><br><span class="line"><span class="string">perl EVidenceModeler-master/EvmUtils/recombine_EVM_partial_outputs.pl --partitions partitions_list.out --output_file_name evm.out</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">#结果整理为gff3格式</span></span><br><span class="line"><span class="string">perl EVidenceModeler-master/EvmUtils/convert_EVM_outputs_to_GFF3.pl --partitions partitions_list.out --output_file_name evm.out --genome ../purged.fa.masked</span></span><br><span class="line"><span class="string">find . -regex "</span>.*evm.out.gff3<span class="string">" -exec cat {} \; | bedtools sort -i - > EVM.all.gff</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">#过滤gff文件,过滤出长度不足50AA的序列</span></span><br><span class="line"><span class="string">conda install gffread</span></span><br><span class="line"><span class="string">conda install bioconda::bioawk</span></span><br><span class="line"><span class="string">gffread EVM.all.gff -g ../purged.fa.masked -y tr_cds.fa</span></span><br><span class="line"><span class="string">bioawk -c fastx 'length(<span class="variable">$seq</span>) < 50 {print <span class="variable">$name</span>}' tr_cds.fa | sed 's/evm.model.//g' > short_aa_gene_list.txt #这一步骤利用别人的推送中的命令跑不通,进行了修改</span></span><br><span class="line"><span class="string">grep -v -w -f short_aa_gene_list.txt EVM.all.gff > z189.gff</span></span><br></pre></td></tr></table></figure><p>到这里,得到了初步地结构注释结果文件z189.gff</p><h6 id="3-4-6-注释文件的优化与调整"><a href="#3-4-6-注释文件的优化与调整" class="headerlink" title="3.4.6 注释文件的优化与调整"></a>3.4.6 注释文件的优化与调整</h6><p>Tbtools的相关插件可以优化</p><h6 id="3-5-busco评估基因注释质量"><a href="#3-5-busco评估基因注释质量" class="headerlink" title="3.5 busco评估基因注释质量"></a>3.5 busco评估基因注释质量</h6><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#利用真核生物库</span></span><br><span class="line">busco_for_annotation]$ busco -i tr_cds.fa -l /project/tianzhenWu/z189/02genome_assemble/without-hic/fasta_asm/nextpolish/01_rundir/eukaryota_odb10/ -m prot -o busco_out -c 40</span><br><span class="line"></span><br><span class="line"><span class="comment">#利用后生动物库</span></span><br><span class="line">busco -i tr_cds.fa -l /project/tianzhenWu/z189/02genome_assemble/without-hic/fasta_asm/nextpolish/01_rundir/metazoa_odb10/ -m prot -o busco_out_m -c 40</span><br></pre></td></tr></table></figure><p>利用真核生物库评估结果如下:</p><img src="/2024/07/17/genome-anno/image-20240419202621349.png" class title="image-20240419202621349"><h6 id="3-6-对基因特征进行可视化,包括基因长度、基因间长度、外显子个数等"><a href="#3-6-对基因特征进行可视化,包括基因长度、基因间长度、外显子个数等" class="headerlink" title="3.6 对基因特征进行可视化,包括基因长度、基因间长度、外显子个数等"></a>3.6 对基因特征进行可视化,包括基因长度、基因间长度、外显子个数等</h6><p><a href="https://zhuanlan.zhihu.com/p/614002824">玩转基因组 | 利用R语言统计外显子数量和基因长度并可视化 - 知乎 (zhihu.com)</a></p><p><a href="https://www.jianshu.com/p/323612a6ed0a">统计注释出的基因数目,基因长度,外显子和CDS长度 - 简书 (jianshu.com)</a></p><p>#将gff转换为gtf文件</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">gffread z189.gff -T -o z189.gtf</span><br></pre></td></tr></table></figure><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">\<span class="comment">#统计基因模型总长度、数目及平均基因长度</span></span><br><span class="line"></span><br><span class="line">cat maker_rnd3.gff | awk <span class="string">'{ if ($3 == "gene") print $0 }'</span> | awk <span class="string">'{ sum += ($5 - $4) } END { print sum, NR, sum / NR }'</span></span><br><span class="line"></span><br><span class="line">308328514 42256 7296.68</span><br><span class="line"></span><br><span class="line"> \<span class="comment">#统计外显子总长度、数目以及平均外线子长度</span></span><br><span class="line"></span><br><span class="line">cat z189.gff | awk <span class="string">'{ if ($3 == "exon") print $0 }'</span> | awk <span class="string">'{ sum += ($5 - $4) } END { print sum, NR, sum / NR }'</span></span><br><span class="line"></span><br><span class="line">50327093 198915 253.008</span><br></pre></td></tr></table></figure><h4 id="4-基因功能注释"><a href="#4-基因功能注释" class="headerlink" title="4 基因功能注释"></a>4 基因功能注释</h4><p>常用数据据库:NR,Uniprot (Swiss-Prot, TrEMBL), eggNOG, KOG, KEGG, Go,Pfam等,参考<a href="https://phantom-aria.github.io/2023/09/19/a.html#1-5-Swissprot%E6%95%B0%E6%8D%AE%E5%BA%93">基因组注释(6)——在线版eggNOG-mapper注释功能基因 - 我的小破站 (phantom-aria.github.io)</a></p><h5 id="4-1-利用NR动物数据库注释"><a href="#4-1-利用NR动物数据库注释" class="headerlink" title="4.1 利用NR动物数据库注释"></a>4.1 利用NR动物数据库注释</h5><p>下载nr数据库,参考<a href="https://www.bioinfo-scrounger.com/archives/207/">创建NR子库以及从NR库提取特定物种分类的序列 | KeepNotes blog (bioinfo-scrounger.com)</a>和<a href="https://www.jianshu.com/p/f78a98587f8c">2022-09-03 NR 动物数据库构建方法2 - 简书 (jianshu.com)</a></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">conda install -y -c hcc aspera-cli</span><br><span class="line">conda install -y -c bioconda sra-tools</span><br><span class="line">ascp -v -k 1 -T -l 200m -i /public/home/wutianzhen/miniconda3/envs/download/etc/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/blast/db/FASTA/nr.gz ./ <span class="comment">#需要找到密钥文件asperaweb_id_dsa.openssh的绝对路径</span></span><br></pre></td></tr></table></figure><p>下载其他分类相关库,然后提取6073 刺胞动物子库</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line">conda install -c bioconda taxonkit</span><br><span class="line">conda install -c bioconda csvtk</span><br><span class="line">wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz</span><br><span class="line">wget -c https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz</span><br><span class="line">tar -zxvf taxdump.tar.gz</span><br><span class="line">cp names.dmp ~/.taxonkit</span><br><span class="line">cp nodes.dmp ~/.taxonkit</span><br><span class="line">taxonkit list --ids 6073 --indent <span class="string">""</span> > cnidaria.taxid.txt</span><br><span class="line">wc -l cnidaria.taxid.txt <span class="comment">##打印15570 cnidaria.taxid.txt</span></span><br><span class="line">zcat prot.accession2taxid.gz |csvtk -t grep -f taxid -P cnidaria.taxid.txt |csvtk -t cut -f accession.version > cnidaria.taxid.acc.txt</span><br><span class="line">seqkit grep -f cnidaria.taxid.acc.txt nr -o cnidaria.fas <span class="comment">#共755344条序列</span></span><br><span class="line"></span><br><span class="line"><span class="comment">#利用diamond进行比对</span></span><br><span class="line">diamond makedb --<span class="keyword">in</span> cnidaria.fas -d cnidaria_diamond.dmnd</span><br><span class="line">diamond blastp --db cnidaria_diamond.dmnd --query z189.pep.fa --out nr.tab --outfmt 6 --sensitive --max-target-seqs 1 --evalue 1e-5 --id 30 --block-size 20 --index-chunks 1</span><br></pre></td></tr></table></figure><h5 id="4-2-利用eggNOG数据库进行GO、KEGG、Pfam注释"><a href="#4-2-利用eggNOG数据库进行GO、KEGG、Pfam注释" class="headerlink" title="4.2 利用eggNOG数据库进行GO、KEGG、Pfam注释"></a>4.2 利用eggNOG数据库进行GO、KEGG、Pfam注释</h5><p>网页版注释<a href="http://eggnog-mapper.embl.de/">eggNOG-mapper (embl.de)</a>,操作简单,修改e值为1e-5</p><h5 id="4-3-利用swissprot数据库进行注释"><a href="#4-3-利用swissprot数据库进行注释" class="headerlink" title="4.3 利用swissprot数据库进行注释"></a>4.3 利用swissprot数据库进行注释</h5><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz ./</span><br><span class="line">gzip -d uniprot_sprot.fasta.gz</span><br><span class="line"></span><br><span class="line">diamond makedb --<span class="keyword">in</span> uniprot_sprot.fasta -d uniprot_diamond.dmnd</span><br><span class="line">diamond blastp --db uniprot_diamond.dmnd --query ../05function_annotation/z189.pep.fa --out swissprot.tab --outfmt 6 --sensitive --max-target-seqs 1 --evalue 1e-5 --id 30 --block-size 20 --index-chunks 1</span><br></pre></td></tr></table></figure><h5 id="4-4-利用KOG数据库进行注释"><a href="#4-4-利用KOG数据库进行注释" class="headerlink" title="4.4 利用KOG数据库进行注释"></a>4.4 利用KOG数据库进行注释</h5><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">wget https://ftp.ncbi.nih.gov/pub/COG/KOG/fun.txt</span><br><span class="line">wget https://ftp.ncbi.nih.gov/pub/COG/KOG/kog</span><br><span class="line">wget https://ftp.ncbi.nih.gov/pub/COG/KOG/kyva</span><br><span class="line">wget https://ftp.ncbi.nih.gov/pub/COG/KOG/twog</span><br><span class="line"></span><br><span class="line">diamond makedb --<span class="keyword">in</span> kyva.fas -d kog_diamond.dmnd</span><br><span class="line">diamond blastp --db kog_diamond.dmnd --query ../05function_annotation/z189.pep.fa --out kog.tab --outfmt 6 --sensitive --max-target-seqs 1 --evalue 1e-5 --id 30 --block-size 20 --index-chunks 1</span><br></pre></td></tr></table></figure><p>利用excel整理结果。</p><p>制作VN图:<a href="https://jvenn.toulouse.inrae.fr/app/example.html">jvenn (inrae.fr)</a>,最多可以做6组。</p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> genome注释 </tag>
</tags>
</entry>
<entry>
<title>QuIBL方法检验基因渐渗和不完全谱系分选</title>
<link href="/2024/04/09/QuIBL%E6%96%B9%E6%B3%95%E6%A3%80%E9%AA%8C%E5%9F%BA%E5%9B%A0%E6%B8%90%E6%B8%97%E5%92%8C%E4%B8%8D%E5%AE%8C%E5%85%A8%E8%B0%B1%E7%B3%BB%E5%88%86%E9%80%89/"/>
<url>/2024/04/09/QuIBL%E6%96%B9%E6%B3%95%E6%A3%80%E9%AA%8C%E5%9F%BA%E5%9B%A0%E6%B8%90%E6%B8%97%E5%92%8C%E4%B8%8D%E5%AE%8C%E5%85%A8%E8%B0%B1%E7%B3%BB%E5%88%86%E9%80%89/</url>
<content type="html"><![CDATA[<p><strong>QuIBL是2019年science蝴蝶辐射演化分析中检测渐渗的新方法,其用法为python脚本QuIBL.py的使用</strong></p><h4 id="1、安装"><a href="#1、安装" class="headerlink" title="1、安装"></a>1、安装</h4><p>在github下载软件包:<a href="https://github.com/miriammiyagi/QuIBL">https://github.com/miriammiyagi/QuIBL</a></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">wget https://github.com/miriammiyagi/QuIBL/archive/refs/heads/master.zip</span><br><span class="line"></span><br><span class="line">unzip master.zip</span><br></pre></td></tr></table></figure><p>该脚本依赖python2.7,且依赖以下包joblib, ete3, itertools, sys, numpy, math, ConfigParser, csv, and multiprocessing</p><p>#创建python2的环境</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">conda create -n python2.7 python=2.7</span><br><span class="line">conda activate python2.7</span><br></pre></td></tr></table></figure><p>#通过运行示例文件来检查不存在的依赖包</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python QuIBL.py ./Small_Test_Example/sampleInputFile.txt</span><br></pre></td></tr></table></figure><p>#通过多次运行实例文件的报错,发现te3和joblib是存在问题的,按照作者的建议安装特定版本的ete3和joblib,ete3==3.0.0b35和joblib==0.11,利用conda安装它们</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">conda install ete3==3.0.0b35</span><br><span class="line">conda install joblib==0.11</span><br></pre></td></tr></table></figure><p>#再次运行示例文件,可以运行成功</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python QuIBL.py ./Small_Test_Example/sampleInputFile.txt</span><br></pre></td></tr></table></figure><img src="/2024/04/09/QuIBL%E6%96%B9%E6%B3%95%E6%A3%80%E9%AA%8C%E5%9F%BA%E5%9B%A0%E6%B8%90%E6%B8%97%E5%92%8C%E4%B8%8D%E5%AE%8C%E5%85%A8%E8%B0%B1%E7%B3%BB%E5%88%86%E9%80%89/image-20240408201814957.png" class title="image-20240408201814957"><h4 id="2、输入文件"><a href="#2、输入文件" class="headerlink" title="2、输入文件"></a>2、输入文件</h4><p>QuIBL运行较为简单,只需要准备两个文件</p><p>①sampleInputFile.txt,该参数配置文件要根据软件自带的配置文件进行修改,以下是参数的具体意义:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line">treefile: The path to the trees to be analyzed.</span><br><span class="line"></span><br><span class="line">numdistributions: The number of branch length distributions in the mixture to test. For now, only two is supported (this corresponds to one ILS and one non-ILS distribution).</span><br><span class="line"></span><br><span class="line">likelihoodthresh: The maximum change in likelihood allowed for the gradient ascent search for theta to stop.</span><br><span class="line"></span><br><span class="line">numsteps: The number of total EM steps. For thousands of trees, we reccomend trying around 50.</span><br><span class="line"></span><br><span class="line">gradascentscalar: The factor to shrink the stepsize when a gradient ascent step fails.</span><br><span class="line"></span><br><span class="line">totaloutgroup: The name of the ultimate outrgroup of your sample. All trees are assumed to be rooted using this taxon.</span><br><span class="line"></span><br><span class="line">multiproc: Accepts `True` or `False` and either turns multiprocessing on or off.</span><br><span class="line"></span><br><span class="line">OutputPath: Where the output gets written.</span><br><span class="line"></span><br><span class="line">maxcores: The maximum number of cores QuIBL is allowed to use.</span><br></pre></td></tr></table></figure><p>②树文件smallTestTrees.txt,该文件包括4个物种的拓扑关系及其枝长信息。</p><p>利用ete3从包含多个基因树的文件(每行包括一个基因树)中批量提取子树,可以在python3环境下运行以下脚本,然后将结果写到空文件smallTestTrees.txt,注意去除中间树节点在不同基因树的编号</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> ete3 <span class="keyword">import</span> Tree</span><br><span class="line"></span><br><span class="line"><span class="comment"># 定义subtree_taxa</span></span><br><span class="line">subtree_taxa = [<span class="string">"amil"</span>, <span class="string">"aflo"</span>, <span class="string">"aint"</span>, <span class="string">"aawi"</span>]</span><br><span class="line"></span><br><span class="line"><span class="comment"># 打开文件并读取每一行</span></span><br><span class="line"><span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'alltree.rooted.txt'</span>, <span class="string">'r'</span>) <span class="keyword">as</span> file:</span><br><span class="line"> <span class="keyword">for</span> line <span class="keyword">in</span> file:</span><br><span class="line"> t = Tree(line)</span><br><span class="line"> t.prune(subtree_taxa, preserve_branch_length=<span class="literal">True</span>)</span><br><span class="line"> t.write()</span><br></pre></td></tr></table></figure><h4 id="3、运行QuIBL-py脚本"><a href="#3、运行QuIBL-py脚本" class="headerlink" title="3、运行QuIBL.py脚本"></a>3、运行QuIBL.py脚本</h4><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python QuIBL.py my_analysis/sampleInputFile.txt</span><br></pre></td></tr></table></figure><h4 id="4、结果解读"><a href="#4、结果解读" class="headerlink" title="4、结果解读"></a>4、结果解读</h4><p>我的运行结果如下</p><table><thead><tr><th align="center">triplet</th><th align="center">outgroup</th><th align="center">C1</th><th align="center">C2</th><th align="center">mixprop1</th><th>mixprop2</th><th align="center">lambda2Dist</th><th align="center">lambda1Dist</th><th align="center">BIC2Dist</th><th align="center">BIC1Dist</th><th align="center">count</th></tr></thead><tbody><tr><td align="center">aint_aawi_aflo</td><td align="center">aint</td><td align="center">0</td><td align="center">0.760561129</td><td align="center">0.801436519</td><td>0.198563481</td><td align="center">0.003752992</td><td align="center">0.004153213</td><td align="center">-33700.04628</td><td align="center">-33728.42803</td><td align="center">3762</td></tr><tr><td align="center">aint_aawi_aflo</td><td align="center">aawi</td><td align="center">0</td><td align="center">10.96069501</td><td align="center">0.975929055</td><td>0.024070945</td><td align="center">0.00378819</td><td align="center">0.004690472</td><td align="center">-20079.01796</td><td align="center">-19744.4163</td><td align="center">2264</td></tr><tr><td align="center">aint_aawi_aflo</td><td align="center">aflo</td><td align="center">0</td><td align="center">10.88296285</td><td align="center">0.975452203</td><td>0.024547797</td><td align="center">0.0035074</td><td align="center">0.004364335</td><td align="center">-15592.90708</td><td align="center">-15335.1854</td><td align="center">1730</td></tr></tbody></table><p>假如物种树的拓扑是((aawi, aflo), aint),查看aint与aflo的渐渗要看第二行,其中mixprop2表示渐渗程度,若BIC2Dist与BIC1Dist之差小于-10,则支持存在渐渗的模型。同理,查看aint与aawi的渐渗要看第三行。</p><table><thead><tr><th>文章附件的解释:</th></tr></thead><tbody><tr><td>triplet: The three-taxon subset considered. Species abbreviations separated by underscores. Outgroup: Species inferred to be the outgroup in the triplet gene tree topology tested.</td></tr><tr><td>Cx: Inferred species tree branch length for (1) the ILS case and (2) the non-ILS case. The ILS case is forced to be 0, as all lineages must be in the same population.</td></tr><tr><td>Topology proportions: Inferred mixture proportion for the ILS and non-ILS distributions. These values sum to 1.</td></tr><tr><td>numTrees: Frequency of the topology in the sample. BICx: Raw BIC values for each model dBIC: difference in BIC value between the models. <strong>dBIC < -10 implies that the ILS+introgression model is a better fit for the data.</strong></td></tr><tr><td>total non-ILS: topology non-ILS proportion * (numTrees/total trees in sample). This value represents the genome-wide introgression fraction</td></tr></tbody></table><h4 id="5、穷尽运行物种树的所有三物种组合,批量运行QuIBL-py"><a href="#5、穷尽运行物种树的所有三物种组合,批量运行QuIBL-py" class="headerlink" title="5、穷尽运行物种树的所有三物种组合,批量运行QuIBL.py"></a>5、穷尽运行物种树的所有三物种组合,批量运行QuIBL.py</h4><p>①在物种树上提取所有的三物种组合,运行以下脚本01_comb_4spes.py得到4物种组合的文件out_four_species_array.txt</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"> <span class="number">1</span> <span class="comment"># coding=utf-8</span></span><br><span class="line"> <span class="number">2</span> <span class="keyword">from</span> itertools <span class="keyword">import</span> combinations</span><br><span class="line"> <span class="number">3</span> <span class="keyword">from</span> ete3 <span class="keyword">import</span> Tree</span><br><span class="line"> <span class="number">4</span> <span class="comment"># 给定的元素</span></span><br><span class="line"> <span class="number">5</span> elements = [<span class="string">'aawi'</span>, <span class="string">'acer'</span>, <span class="string">'acyt'</span>, <span class="string">'adig'</span>, <span class="string">'aflo'</span>, <span class="string">'ahya'</span>, <span class="string">'aint'</span>, <span class="string">'amic'</span>, <span class="string">'amil'</span>, <span class="string">'amur'</span>, <span class="string">'anas'</span>, <span class="string">'apal'</span>, <span class="string">'asel'</span>, <span class="string">'aten'</span>, <span class="string">'ayon'</span>]</span><br><span class="line"> <span class="number">6</span></span><br><span class="line"> <span class="number">7</span> <span class="comment"># 提取三个元素为一组</span></span><br><span class="line"> <span class="number">8</span> combs_3 = <span class="built_in">list</span>(combinations(elements, <span class="number">3</span>))</span><br><span class="line"> <span class="number">9</span></span><br><span class="line"><span class="number">10</span> <span class="comment"># 每组加上'mcap'形成四个元素</span></span><br><span class="line"><span class="number">11</span> combs_4 = [(comb[<span class="number">0</span>], comb[<span class="number">1</span>], comb[<span class="number">2</span>], <span class="string">'mcap'</span>) <span class="keyword">for</span> comb <span class="keyword">in</span> combs_3]</span><br><span class="line"><span class="number">12</span></span><br><span class="line"><span class="number">13</span> <span class="comment"># 将每个组合写入文件中,每行一个组合</span></span><br><span class="line"><span class="number">14</span> <span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'out_four_species_array.txt'</span>, <span class="string">'w'</span>) <span class="keyword">as</span> file:</span><br><span class="line"><span class="number">15</span> <span class="keyword">for</span> comb <span class="keyword">in</span> combs_4:</span><br><span class="line"><span class="number">16</span> file.write(<span class="string">' '</span>.join(comb) + <span class="string">'\n'</span>)</span><br></pre></td></tr></table></figure><p>②按照out_four_species_array.txt文件中的4物种组合,在定根后的基因树集合文件里提取子树,包含枝长信息。运行以下脚本02_extract_subtree_for_4spes.py,得到不同物种组合的subtree文件,命名格式为”out_subtree_”+str(b)+”.txt”,其中变量b依次取自然数</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line"> <span class="number">1</span> <span class="comment"># coding=utf-8</span></span><br><span class="line"> <span class="number">2</span> <span class="keyword">from</span> ete3 <span class="keyword">import</span> Tree</span><br><span class="line"> <span class="number">3</span></span><br><span class="line"> <span class="number">4</span> file_path = <span class="string">'out_four_species_array.txt'</span></span><br><span class="line"> <span class="number">5</span> b = <span class="number">1</span></span><br><span class="line"> <span class="number">6</span> <span class="keyword">with</span> <span class="built_in">open</span>(file_path, <span class="string">'r'</span>) <span class="keyword">as</span> file:</span><br><span class="line"> <span class="number">7</span> <span class="comment"># 逐行读取文件内容</span></span><br><span class="line"> <span class="number">8</span> <span class="keyword">for</span> line <span class="keyword">in</span> file:</span><br><span class="line"> <span class="number">9</span> <span class="comment"># 以空格为分隔符划分多个字符串</span></span><br><span class="line"><span class="number">10</span> subtree_taxa = line.strip().split(<span class="string">' '</span>)</span><br><span class="line"><span class="number">11</span> <span class="comment"># 打印每行划分后的字符串数组</span></span><br><span class="line"><span class="number">12</span> <span class="comment">#print(strings_array)</span></span><br><span class="line"><span class="number">13</span></span><br><span class="line"><span class="number">14</span> <span class="comment"># 定义subtree_taxa</span></span><br><span class="line"><span class="number">15</span> <span class="comment">#subtree_taxa = ["amil", "aflo", "aint", "aawi"]</span></span><br><span class="line"><span class="number">16</span></span><br><span class="line"><span class="number">17</span> result_array = []</span><br><span class="line"><span class="number">18</span> <span class="comment"># 打开文件并读取每一行</span></span><br><span class="line"><span class="number">19</span> <span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'alltree.rooted.txt'</span>, <span class="string">'r'</span>) <span class="keyword">as</span> file:</span><br><span class="line"><span class="number">20</span> <span class="keyword">for</span> line <span class="keyword">in</span> file:</span><br><span class="line"><span class="number">21</span> t = Tree(line)</span><br><span class="line"><span class="number">22</span> t.prune(subtree_taxa, preserve_branch_length=<span class="literal">True</span>)</span><br><span class="line"><span class="number">23</span> a = t.write()</span><br><span class="line"><span class="number">24</span> result_array.append(a)</span><br><span class="line"><span class="number">25</span></span><br><span class="line"><span class="number">26</span> filename = <span class="string">"out_subtree_"</span>+<span class="built_in">str</span>(b)+<span class="string">".txt"</span></span><br><span class="line"><span class="number">27</span> <span class="keyword">with</span> <span class="built_in">open</span>(filename, <span class="string">'w'</span>) <span class="keyword">as</span> file:</span><br><span class="line"><span class="number">28</span> <span class="keyword">for</span> element <span class="keyword">in</span> result_array:</span><br><span class="line"><span class="number">29</span> file.write(element + <span class="string">'\n'</span>)</span><br><span class="line"><span class="number">30</span> b+=<span class="number">1</span></span><br></pre></td></tr></table></figure><p>产生多个4物种组合的基因树文件,对其进行处理,将祖先节点的序号去掉</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sed -i <span class="string">'s/)\([0-9]*\):/):/g'</span> out_subtree_*</span><br></pre></td></tr></table></figure><p>③产生sampleInputFile.txt文件,在以下shell命令中修改路径</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">for</span> file <span class="keyword">in</span> `ls out_subtree*`; <span class="keyword">do</span> <span class="built_in">echo</span> -e <span class="string">"[Input]\ntreefile: /project/tianzhenWu/acropora_06four_cluster_gene/03_building_tree/4species/7756trees/<span class="variable">$file</span>\nnumdistributions: 2\nlikelihoodthresh: 0.01\nnumsteps: 50\ngradascentscalar: 0.5\ntotaloutgroup: mcap\nmultiproc: True\nmaxcores:70\n\n[Output]\nOutputPath: /project/tianzhenWu/acropora_06four_cluster_gene/03_building_tree/4species/QuIBL-master/my_455_annalysis/out<span class="variable">$file</span>.csv\n"</span> > /project/tianzhenWu/acropora_06four_cluster_gene/03_building_tree/4species/QuIBL-master/my_455_annalysis/run<span class="variable">$file</span>.txt;<span class="keyword">done</span></span><br></pre></td></tr></table></figure><p>④批量运行脚本</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#构建批量运行的sh文件03_run.sh</span></span><br><span class="line"><span class="keyword">for</span> file <span class="keyword">in</span> `ls runout_subtree*`;<span class="keyword">do</span> <span class="built_in">echo</span> -e <span class="string">"python QuIBL.py /project/tianzhenWu/acropora_06four_cluster_gene/03_building_tree/4species/QuIBL-master/my_455_annalysis/<span class="variable">$file</span>"</span>;<span class="keyword">done</span> > 03_run.sh</span><br><span class="line"></span><br><span class="line"><span class="comment">#利用Parafly并行跑程序,该软件在python2和python3环境下均可以利用conda安装</span></span><br><span class="line">ParaFly -c ./my_455_annalysis/03_run.sh -CPU 40</span><br></pre></td></tr></table></figure><p>⑤结果整合</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">for</span> file <span class="keyword">in</span> *.csv; <span class="keyword">do</span> sed -n <span class="string">'2,4p'</span> <span class="string">"<span class="variable">$file</span>"</span> >> results_qulbl.txt; <span class="keyword">done</span></span><br></pre></td></tr></table></figure><p><strong>参考:<a href="https://github.com/miriammiyagi/QuIBL">miriammiyagi/QuIBL: Quantifying Introgression via Branch Lengths (github.com)</a></strong></p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<categories>
<category> 网状进化 </category>
</categories>
<tags>
<tag> QuIBL </tag>
</tags>
</entry>
<entry>
<title>溯祖树与基因树之间冲突的可视化</title>
<link href="/2024/01/20/%E6%BA%AF%E7%A5%96%E6%A0%91%E4%B8%8E%E5%9F%BA%E5%9B%A0%E6%A0%91%E4%B9%8B%E9%97%B4%E5%86%B2%E7%AA%81%E7%9A%84%E5%8F%AF%E8%A7%86%E5%8C%96/"/>
<url>/2024/01/20/%E6%BA%AF%E7%A5%96%E6%A0%91%E4%B8%8E%E5%9F%BA%E5%9B%A0%E6%A0%91%E4%B9%8B%E9%97%B4%E5%86%B2%E7%AA%81%E7%9A%84%E5%8F%AF%E8%A7%86%E5%8C%96/</url>
<content type="html"><![CDATA[<h1 id="溯祖树与基因树之间冲突的可视化"><a href="#溯祖树与基因树之间冲突的可视化" class="headerlink" title="溯祖树与基因树之间冲突的可视化"></a><strong>溯祖树与基因树之间冲突的可视化</strong></h1><p>主要用到PhyParts软件以及phypartspiecharts.py脚本</p><p>参考网站:</p><p><a href="https://bitbucket.org/blackrim/phyparts/src/master/README.md">blackrim / phyparts / README.md — Bitbucket</a></p><p><a href="https://github.com/mossmatters/phyloscripts/tree/master/phypartspiecharts">phyloscripts/phypartspiecharts at master · mossmatters/phyloscripts · GitHub</a></p><p><a href="https://hackmd.io/@mossmatters/ry9PP6_2u#IQTree">Abronia HybSeq Phylogeny - HackMD</a></p><p>以上的英文页面讲解非常清晰,在此抛砖引玉,方便大家检索。</p><h3 id="一、phyparts的安装(首先保证运行环境有maven,否则无法安装)"><a href="#一、phyparts的安装(首先保证运行环境有maven,否则无法安装)" class="headerlink" title="一、phyparts的安装(首先保证运行环境有maven,否则无法安装)"></a><strong>一、phyparts的安装(首先保证运行环境有maven,否则无法安装)</strong></h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">git <span class="built_in">clone</span> https://bitbucket.org/blackrim/phyparts.git</span><br><span class="line"></span><br><span class="line">sh mvn_cmdline.sh(安装失败)</span><br><span class="line"></span><br><span class="line"><span class="comment">##缺少依赖,安装maven</span></span><br><span class="line"></span><br><span class="line">conda install conda-forge::maven</span><br><span class="line"></span><br><span class="line">sh mvn_cmdline.sh(安装成功)</span><br><span class="line"></span><br><span class="line">java -jar target/phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar (无报错则安装成功)</span><br></pre></td></tr></table></figure><h3 id="二、phyparts使用"><a href="#二、phyparts使用" class="headerlink" title="二、phyparts使用"></a><strong>二、phyparts使用</strong></h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">java -jar /project/tianzhenWu/software/phyparts/phyparts/target/phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -v -d rooted_7756_gene_tree/ -m rooted_astral_tree/astral.tree</span><br><span class="line"></span><br><span class="line">报错:Tree is invalid: missing concluding semicolon. Exiting.</span><br><span class="line"></span><br><span class="line">经过排查,原因是物种树没有rooted</span><br></pre></td></tr></table></figure><h3 id="三、可视化"><a href="#三、可视化" class="headerlink" title="三、可视化"></a>三、可视化</h3><p>下载phypartspiecharts.py脚本,来自<a href="https://github.com/mossmatters/phyloscripts/blob/master/phypartspiecharts/phypartspiecharts.py">phyloscripts/phypartspiecharts/phypartspiecharts.py at master · mossmatters/phyloscripts · GitHub</a></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">python phypartspiecharts.py rooted_astral_tree/astral.tree out 7756</span><br><span class="line"></span><br><span class="line">报错:raise NewickError(<span class="string">"Unexpected newick format '%s' "</span> %subnw[0:50])</span><br><span class="line">ete3.parser.newick.NewickError: Unexpected newick format <span class="string">'[&label=1]:2.800269'</span></span><br><span class="line"></span><br><span class="line">原因:物种树中存在[&label=1],要将其删掉,且保证当下环境是python3,并且可以使用ete3软件</span><br></pre></td></tr></table></figure><p>添加其他参数</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">python phypartspiecharts.py rooted_astral_tree/astral.tree_for_keshihua out 7756 --to_csv --no_ladderize --svg_name pies2.svg</span><br><span class="line"></span><br><span class="line">--colors COLORS [#9DDOC7 #8AB1D2 #E58579]参数无效。</span><br><span class="line"></span><br><span class="line">后续可以利用csv文件的数据在其他工具中重新绘制饼状图,也可以将svg打印为pdf,在ai里面手动调整颜色。</span><br></pre></td></tr></table></figure><p>得到的svg格式图片如封面所示。在软件默认设置的树节点饼状图的颜色中,蓝色表示支持这一节点的基因树的比例,绿色表示在不支持这一节点的基因树中,一致性最高的拓扑的基因树的比例,红色表示其他不支持这一节点的基因树的比例,灰色表示无信息的基因书的比例。</p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<categories>
<category> 网状进化 </category>
</categories>
<tags>
<tag> PhyParts </tag>
</tags>
</entry>
<entry>
<title>系统发育网络推断软件PhyloNetworks和Phylonet的使用及遇到的error</title>
<link href="/2023/10/07/phylonetwork/"/>
<url>/2023/10/07/phylonetwork/</url>
<content type="html"><![CDATA[<h5 id="一、系统发育网络推断软件PhyloNetworks的使用流程以及解决报错"><a href="#一、系统发育网络推断软件PhyloNetworks的使用流程以及解决报错" class="headerlink" title="一、系统发育网络推断软件PhyloNetworks的使用流程以及解决报错"></a>一、系统发育网络推断软件PhyloNetworks的使用流程以及解决报错</h5><p>该软件的详细说明请参考他的官方网站<a href="http://crsl4.github.io/PhyloNetworks.jl/latest/">Home · PhyloNetworks.jl (crsl4.github.io)</a>,该软件的中文流程(包括简介、安装使用流程等)请参考<a href="https://yanzhongsino.github.io/2022/04/14/bioinfo_geneflow_PhyloNetworks/">系统发育网络推断 —— PhyloNetworks | 生信技工 (yanzhongsino.github.io)</a>,本笔记参考以上两个站点并测试无报错。</p><h5 id="1、安装"><a href="#1、安装" class="headerlink" title="1、安装"></a>1、安装</h5><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">wget https://julialang-s3.julialang.org/bin/linux/x64/1.7/julia-1.7.2-linux-x86_64.tar.gz <span class="comment">#下载</span></span><br><span class="line">tar -xzf julia-1.7.2-linux-x86_64.tar.gz <span class="comment">#解压</span></span><br><span class="line">julia-1.7.2/bin/julia -h <span class="comment">#若无报错则安装julia成功</span></span><br><span class="line"></span><br><span class="line">julia-1.7.2/bin/julia <span class="comment">#进入julia运行界面,类似于python和r的交互模式</span></span><br><span class="line">julia> using Pkg <span class="comment">#类似于r的library和python的import来加载函数</span></span><br><span class="line">julia> Pkg.add(<span class="string">"PhyloNetworks"</span>) <span class="comment">#安装PhyloNetworks</span></span><br><span class="line">julia> Pkg.add(<span class="string">"PhyloPlots"</span>)</span><br></pre></td></tr></table></figure><p><strong>在安装过程中遇到以下报错</strong></p><figure class="highlight julia"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">ERROR: Unable to automatically install 'Bzip2' from '/home/manu/.julia/packages/Bzip2_jll/<span class="number">2</span>H8pU/Artifacts.toml</span><br></pre></td></tr></table></figure><p>参考github的论坛<a href="https://github.com/JuliaLang/Pkg.jl/issues/1705">ERROR: Unable to automatically install Artifacts.toml issue? · Issue #1705 · JuliaLang/Pkg.jl (github.com)</a>,以下这个解决方法对我的报错有效,在julia交互模式输入以下命令</p><figure class="highlight julia"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">using</span> Pkg</span><br><span class="line"></span><br><span class="line">Pkg.PlatformEngines.probe_platform_engines!()</span><br><span class="line"></span><br><span class="line">Pkg.PlatformEngines.download(<span class="string">"https://github.com/JuliaBinaryWrappers/MKL_jll.jl/releases/download/MKL-v2020.0.166%2B0/MKL.v2020.0.166.x86_64-apple-darwin14.tar.gz"</span>, <span class="string">"MKL_jll.tar.gz"</span>; verbose=<span class="literal">true</span>)</span><br></pre></td></tr></table></figure><p>再次运行Pkg.add(“PhyloNetworks”)安装成功。</p><h5 id="2、软件使用"><a href="#2、软件使用" class="headerlink" title="2、软件使用"></a>2、软件使用</h5><p>①准备两个文件,一是“多基因树文件” alltree.rooted.txt,用以计算得到CF表,CF表则用于系统发育网络推断的输入;二是“通过astral软件将多基因树整合得到的树” astral.tre,用于构建系统发育网络起点。在Julia的交互模式下</p><p>通过多基因树文件制备CF表tableCF.csv</p><figure class="highlight julia"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">using</span> PhyloNetworks</span><br><span class="line"><span class="keyword">using</span> CSV</span><br><span class="line">iqtrees=joinpath(<span class="string">"alltree.rooted.txt"</span>) <span class="comment">#读取多基因树文件</span></span><br><span class="line">genetrees = readMultiTopology(iqtrees) <span class="comment">#解析基因树</span></span><br><span class="line">q,t = countquartetsintrees(genetrees) <span class="comment">#读取基因树,计算四分类群的CFs</span></span><br><span class="line">df = writeTableCF(q,t) <span class="comment">#读取计算得到的CF值到df:基因频率</span></span><br><span class="line">CSV.write(<span class="string">"tableCF.csv"</span>, df) <span class="comment">#保存df内容为tableCF.csv文件</span></span><br></pre></td></tr></table></figure><p>②构建起始网络</p><figure class="highlight julia"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">using</span> PhyloNetworks </span><br><span class="line">astralfile= joinpath(<span class="string">"astral.tre"</span>) <span class="comment">##读取联合后的基因树文件</span></span><br><span class="line">astraltree = readMultiTopology(astralfile)[<span class="number">1</span>] <span class="comment">#读取文件中的第一棵树</span></span><br><span class="line">CF = readTableCF(<span class="string">"tableCF.csv"</span>) <span class="comment">#读取CF表的数据</span></span><br><span class="line">net0 = snaq!(astraltree,CF, hmax=<span class="number">0</span>, filename=<span class="string">"net0"</span>, seed=<span class="number">1234</span>) <span class="comment">#运行评估程序,We first impose the constraint of at most 0 hybrid node, that is, we ask for a tree</span></span><br></pre></td></tr></table></figure><p><strong>最初在运行net0 = snaq!(astraltree,CF, hmax=0, filename=”net0”, seed=1234)过程中遇到以下报错</strong></p><figure class="highlight julia"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">:MethodError: no method matching snaq!(::<span class="built_in">Vector</span>{HybridNetwork}, ::DataCF; hmax=<span class="number">0</span>, filename=<span class="string">"net0"</span>, seed=<span class="number">1234</span>)</span><br></pre></td></tr></table></figure><p>原因是读取astralfile树时命令错误,astraltree = readMultiTopology(astralfile),在最末尾没有添加[1],可能导致数据类型出现问题。当运行astraltree = readMultiTopology(astralfile)[1]时,可正常运行。</p><p>③把得到的net0作为起点来构建hmax=1的网络</p><figure class="highlight julia"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">net1 = snaq!(net0, raxmlCF, hmax=<span class="number">1</span>, filename=<span class="string">"net1"</span>, seed=<span class="number">1235</span>) </span><br></pre></td></tr></table></figure><p>④迭代运行得到net2、net3……</p><figure class="highlight julia"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">net2 = snaq!(net1, raxmlCF, hmax=<span class="number">2</span>, filename=<span class="string">"net2"</span>, seed=<span class="number">1236</span>)</span><br><span class="line">net3 = snaq!(net2, raxmlCF, hmax=<span class="number">2</span>, filename=<span class="string">"net3"</span>, seed=<span class="number">1237</span>)</span><br><span class="line">...... </span><br><span class="line"><span class="comment">#若最佳hmax在5以下,可用此方案</span></span><br></pre></td></tr></table></figure><p>另外,如果要运行的h值较多,也可以利用脚本运行,以下脚本未考虑迭代,即每次h值得运行都用同一个起始树astral.tre</p><p>当h为1时,设置文件名为runSNaQ_h1.jl,内容如下</p><figure class="highlight julia"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#!/usr/bin/env julia</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># file "runSNaQ.jl". run in the shell like this in general:</span></span><br><span class="line"><span class="comment"># julia runSNaQ.jl hvalue nruns</span></span><br><span class="line"><span class="comment"># example for h=2 and default 10 runs:</span></span><br><span class="line"><span class="comment"># julia runSNaQ.jl 2</span></span><br><span class="line"><span class="comment"># or example for h=3 and 50 runs:</span></span><br><span class="line"><span class="comment"># julia runSNaQ.jl 3 50</span></span><br><span class="line"></span><br><span class="line">length(<span class="literal">ARGS</span>) > <span class="number">0</span> ||</span><br><span class="line"> error(<span class="string">"need 1 or 2 arguments: # reticulations (h) and # runs (optional, 10 by default)"</span>)</span><br><span class="line">h = parse(<span class="built_in">Int</span>, <span class="literal">ARGS</span>[<span class="number">1</span>])</span><br><span class="line">nruns = <span class="number">10</span></span><br><span class="line"><span class="keyword">if</span> length(<span class="literal">ARGS</span>) > <span class="number">1</span></span><br><span class="line"> nruns = parse(<span class="built_in">Int</span>, <span class="literal">ARGS</span>[<span class="number">2</span>])</span><br><span class="line"><span class="keyword">end</span></span><br><span class="line">outputfile = string(<span class="string">"net"</span>, h, <span class="string">"_"</span>, nruns, <span class="string">"runs"</span>) <span class="comment"># example: "net2_10runs"</span></span><br><span class="line">seed = <span class="number">1234</span> + h <span class="comment"># change as desired! Best to have it different for different h</span></span><br><span class="line"><span class="meta">@info</span> <span class="string">"will run SNaQ with h=<span class="variable">$h</span>, # of runs=<span class="variable">$nruns</span>, seed=<span class="variable">$seed</span>, output will go to: <span class="variable">$outputfile</span>"</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">using</span> Distributed</span><br><span class="line">addprocs(nruns)</span><br><span class="line"><span class="meta">@everywhere</span> <span class="keyword">using</span> PhyloNetworks</span><br><span class="line">net0_h1 = readTopology(<span class="string">"astral.tre"</span>); <span class="comment">#读取起始树,为了避免并行时linux系统环境变量得区分,在h为1时设置为net0_h1</span></span><br><span class="line"><span class="keyword">using</span> DataFrames, CSV</span><br><span class="line">df_sp = DataFrame(CSV.File(<span class="string">"tableCF.csv"</span>, pool=<span class="literal">false</span>); copycols=<span class="literal">false</span>); <span class="comment">#读取CF表</span></span><br><span class="line">d_sp = readTableCF!(df_sp);</span><br><span class="line">net_h1 = snaq!(net0_h1, d_sp, hmax=h, filename=outputfile, seed=seed, runs=nruns) </span><br></pre></td></tr></table></figure><p>当h为2时,设置文件名为runSNaQ_h2.jl,内容如下</p><figure class="highlight julia"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#!/usr/bin/env julia</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># file "runSNaQ.jl". run in the shell like this in general:</span></span><br><span class="line"><span class="comment"># julia runSNaQ.jl hvalue nruns</span></span><br><span class="line"><span class="comment"># example for h=2 and default 10 runs:</span></span><br><span class="line"><span class="comment"># julia runSNaQ.jl 2</span></span><br><span class="line"><span class="comment"># or example for h=3 and 50 runs:</span></span><br><span class="line"><span class="comment"># julia runSNaQ.jl 3 50</span></span><br><span class="line"></span><br><span class="line">length(<span class="literal">ARGS</span>) > <span class="number">0</span> ||</span><br><span class="line"> error(<span class="string">"need 1 or 2 arguments: # reticulations (h) and # runs (optional, 10 by default)"</span>)</span><br><span class="line">h = parse(<span class="built_in">Int</span>, <span class="literal">ARGS</span>[<span class="number">1</span>])</span><br><span class="line">nruns = <span class="number">10</span></span><br><span class="line"><span class="keyword">if</span> length(<span class="literal">ARGS</span>) > <span class="number">1</span></span><br><span class="line"> nruns = parse(<span class="built_in">Int</span>, <span class="literal">ARGS</span>[<span class="number">2</span>])</span><br><span class="line"><span class="keyword">end</span></span><br><span class="line">outputfile = string(<span class="string">"net"</span>, h, <span class="string">"_"</span>, nruns, <span class="string">"runs"</span>) <span class="comment"># example: "net2_10runs"</span></span><br><span class="line">seed = <span class="number">1234</span> + h <span class="comment"># change as desired! Best to have it different for different h</span></span><br><span class="line"><span class="meta">@info</span> <span class="string">"will run SNaQ with h=<span class="variable">$h</span>, # of runs=<span class="variable">$nruns</span>, seed=<span class="variable">$seed</span>, output will go to: <span class="variable">$outputfile</span>"</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">using</span> Distributed</span><br><span class="line">addprocs(nruns)</span><br><span class="line"><span class="meta">@everywhere</span> <span class="keyword">using</span> PhyloNetworks</span><br><span class="line">net0_h2 = readTopology(<span class="string">"astral.tre"</span>); <span class="comment">#此处net0_h2需要修改</span></span><br><span class="line"><span class="keyword">using</span> DataFrames, CSV</span><br><span class="line">df_sp = DataFrame(CSV.File(<span class="string">"tableCF.csv"</span>, pool=<span class="literal">false</span>); copycols=<span class="literal">false</span>);</span><br><span class="line">d_sp = readTableCF!(df_sp);</span><br><span class="line">net_h2 = snaq!(net0_h2, d_sp, hmax=h, filename=outputfile, seed=seed, runs=nruns) <span class="comment">#此处net0_h2需要修改</span></span><br></pre></td></tr></table></figure><p>以此类推,设置h为3、4、5、6…….时得jl文件。</p><p>将运行命令写入到同一目录下run_julia.sh文件中</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">julia runSNaQ_h1.jl 1 #数字为设置得h值</span><br><span class="line">julia runSNaQ_h2.jl 2</span><br><span class="line">julia runSNaQ_h3.jl 3</span><br><span class="line">julia runSNaQ_h4.jl 4</span><br><span class="line">julia runSNaQ_h5.jl 5</span><br><span class="line">julia runSNaQ_h6.jl 6</span><br><span class="line">julia runSNaQ_h7.jl 7</span><br><span class="line">julia runSNaQ_h8.jl 8</span><br><span class="line">julia runSNaQ_h9.jl 9</span><br><span class="line">julia runSNaQ_h10.jl 10</span><br><span class="line">julia runSNaQ_h11.jl 11</span><br><span class="line">julia runSNaQ_h12.jl 12</span><br></pre></td></tr></table></figure><p>利用nohup或screen+ParaFly组合后台运行</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">screen -S julia</span><br><span class="line">ParaFly -c run_julia.sh -CPU 12</span><br></pre></td></tr></table></figure><h5 id="3、结果解读"><a href="#3、结果解读" class="headerlink" title="3、结果解读"></a>3、结果解读</h5><p>①选择最佳系统发育网络</p><p><strong>选择运行结果中-loglik值最小的hman值时的运行结果</strong>,将不同h值得-loglik值统计好,可利用excel做折线图。</p><p>②可视化,假设h为3时所得的结果为最佳网络树</p><figure class="highlight julia"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">using</span> PhyloNetworks</span><br><span class="line"><span class="keyword">using</span> PhyloPlots</span><br><span class="line"><span class="keyword">using</span> RCall</span><br><span class="line"></span><br><span class="line">net = readTopology(<span class="string">"net3_10runs.out"</span>) <span class="comment">#将最佳网络树存为变量net</span></span><br><span class="line">writeTopology(net, <span class="string">"bestnet_h3.tre"</span>) <span class="comment">#将最佳网络树写到bestnet_h3.tre文件</span></span><br><span class="line">rootatnode!(net,<span class="string">"mcap"</span>) <span class="comment">#对树进行定根</span></span><br><span class="line">imagefilename = <span class="string">"snaqplot_net_root.svg"</span> <span class="comment">#命名</span></span><br><span class="line"><span class="string">R"svg"</span>(imagefilename, width=<span class="number">4</span>, height=<span class="number">3</span>) <span class="comment">#将图片存为svg格式</span></span><br><span class="line"><span class="string">R"par"</span>(mar=[<span class="number">0</span>,<span class="number">0</span>,<span class="number">0</span>,<span class="number">0</span>])</span><br><span class="line">plot(net, showgamma=<span class="literal">true</span>, showedgenumber=<span class="literal">true</span>);</span><br><span class="line"><span class="string">R"dev.off()"</span>; <span class="comment">#将图片存为svg文件</span></span><br></pre></td></tr></table></figure><h5 id="二、系统发育网络推断软件PhyloNet的使用(2024-05-29更新)"><a href="#二、系统发育网络推断软件PhyloNet的使用(2024-05-29更新)" class="headerlink" title="二、系统发育网络推断软件PhyloNet的使用(2024.05.29更新)"></a>二、系统发育网络推断软件PhyloNet的使用(2024.05.29更新)</h5><p>该软件官网链接为<a href="https://phylogenomics.rice.edu/html/phylonetTutorial.html">PhyloNet Tutorial (rice.edu)</a></p><p>具体用法参考<a href="https://wiki.rice.edu/confluence/pages/viewpage.action?pageId=39500205#PhylonetTutorial(SSB2020)-3.BasicUsage">Phylonet Tutorial (SSB 2020) - Phylonet - Rice University Campus Wiki</a></p><h5 id="1、下载安装:下载jar文件后,若系统java版本大于等于1-8-0就可以使用"><a href="#1、下载安装:下载jar文件后,若系统java版本大于等于1-8-0就可以使用" class="headerlink" title="1、下载安装:下载jar文件后,若系统java版本大于等于1.8.0就可以使用"></a>1、下载安装:下载jar文件后,若系统java版本大于等于1.8.0就可以使用</h5><h5 id="2、PhyloNet软件使用"><a href="#2、PhyloNet软件使用" class="headerlink" title="2、PhyloNet软件使用"></a>2、PhyloNet软件使用</h5><p>准备一个nex格式文件即可,script.nex包括基因树信息以及程序运行命令,举个栗子:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#NEXUS</span></span><br><span class="line"></span><br><span class="line">BEGIN TREES;</span><br><span class="line"></span><br><span class="line">Tree gt0=(a:2.299,((e:1.329,f:1.329):0.77,(c:1.684,(b:1.232,d:1.232):0.451):0.416):0.2);</span><br><span class="line">Tree gt1=(a:2.085,((e:1.376,f:1.376):0.696,(d:1.487,(b:1.241,c:1.241):0.246):0.585):0.013);</span><br><span class="line">Tree gt2=(((c:0.52,d:0.52):1.243,(e:1.403,f:1.403):0.36):1.025,(a:1.863,b:1.863):0.925);</span><br><span class="line">Tree gt3=(a:2.82,((b:1.051,(c:0.86,d:0.86):0.19):1.357,(e:1.365,f:1.365):1.043):0.412);</span><br><span class="line">Tree gt4=((e:1.3,f:1.3):0.994,(a:1.869,(b:1.255,(c:0.849,d:0.849):0.405):0.615):0.425);</span><br><span class="line">Tree gt5=(a:2.46,((b:1.077,c:1.077):0.857,(f:1.141,(d:0.505,e:0.505):0.636):0.793):0.526);</span><br><span class="line">Tree gt6=(a:2.025,((b:1.111,c:1.111):0.416,(f:1.304,(d:0.727,e:0.727):0.577):0.223):0.498);</span><br><span class="line">Tree gt7=((d:1.526,(e:1.415,f:1.415):0.111):0.982,(a:2.188,(b:1.532,c:1.532):0.656):0.32);</span><br><span class="line">Tree gt8=(a:2.234,((b:1.057,c:1.057):0.766,(f:1.301,(d:0.849,e:0.849):0.452):0.522):0.411);</span><br><span class="line">Tree gt9=((e:1.644,(b:1.361,(c:0.503,d:0.503):0.858):0.283):0.787,(a:2.226,f:2.226):0.205);</span><br><span class="line">Tree gt10=(a:2.917,((e:1.683,(c:0.961,d:0.961):0.722):0.886,(b:1.779,f:1.779):0.79):0.348);</span><br><span class="line">Tree gt11=(a:2.391,((b:1.041,(c:0.602,d:0.602):0.439):0.516,(e:1.164,f:1.164):0.393):0.834);</span><br><span class="line">Tree gt12=((b:1.21,c:1.21):1.622,(a:2.443,(f:1.804,(d:0.583,e:0.583):1.221):0.639):0.389);</span><br><span class="line">Tree gt13=(a:2.047,((b:1.025,c:1.025):0.519,(f:1.295,(d:0.738,e:0.738):0.556):0.249):0.503);</span><br><span class="line">Tree gt14=(a:2.58,((d:0.919,e:0.919):0.834,(f:1.503,(b:1.228,c:1.228):0.275):0.251):0.827);</span><br><span class="line">Tree gt15=((f:1.267,(d:0.871,e:0.871):0.396):1.67,(a:2.181,(b:1.362,c:1.362):0.819):0.756);</span><br><span class="line">Tree gt16=(a:3.016,(b:1.892,(c:1.816,(f:1.479,(d:0.812,e:0.812):0.667):0.337):0.076):1.124);</span><br><span class="line">Tree gt17=((f:1.186,(d:0.721,e:0.721):0.465):1.822,(a:2.031,(b:1.13,c:1.13):0.902):0.977);</span><br><span class="line">Tree gt18=((c:1.51,(f:1.166,(d:0.521,e:0.521):0.645):0.345):1.218,(a:2.073,b:2.073):0.655);</span><br><span class="line">Tree gt19=(a:2.329,((b:1.354,c:1.354):0.467,(f:1.392,(d:0.955,e:0.955):0.437):0.429):0.508);</span><br><span class="line">Tree gt20=(a:3.31,((e:1.083,f:1.083):2.08,(b:1.923,(c:0.538,d:0.538):1.385):1.241):0.146);</span><br><span class="line"></span><br><span class="line">END;</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">BEGIN PHYLONET;</span><br><span class="line"></span><br><span class="line">InferNetwork_MPL (all) 2 -pl 8;</span><br><span class="line"></span><br><span class="line">END;</span><br></pre></td></tr></table></figure><p>①要准备script.nex文件,需要以下几步:a.对已有的多个序列分别进行比对、剪切、构树,可以参考<a href="https://wu-tz.github.io/2021/12/26/pipeline-of-phylogeny/">test_pipeline-of-phylogeny | TianzhenWu’ Blog (wu-tz.github.io)</a>。b. 获取多个treefile后,分别对他们进行定根处理,可以参考<a href="https://wu-tz.github.io/2023/10/05/%E5%88%A9%E7%94%A8ete3%E6%89%B9%E9%87%8F%E5%AF%B9%E5%9F%BA%E5%9B%A0%E6%A0%91%E5%AE%9A%E6%A0%B9/">利用ete3批量对基因树定根并检测单系性 | TianzhenWu’ Blog (wu-tz.github.io)</a>。c. 将所有treefile放到一个文件中,利用shell命令补全每一行的前缀“Tree gt1=”。d. 手动补全nexus格式文件的开始和结尾。e. 在文件末尾补充PHYLONET程序的参数,方法的选择参考<a href="https://phylogenomics.rice.edu/html/phylonetTutorial.html%E3%80%82">https://phylogenomics.rice.edu/html/phylonetTutorial.html。</a></p><p>②运行命令如下</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">java -jar PhyloNetv3_8_2.jar script.nex</span><br></pre></td></tr></table></figure><h5 id="3、PhyloNet结果解读"><a href="#3、PhyloNet结果解读" class="headerlink" title="3、PhyloNet结果解读"></a>3、PhyloNet结果解读</h5><p>每次运行输出5个可能的进化网络树,分别对应着统计量Total log probability,该值越大越可靠。</p><h5 id="4、-Visualizing-a-Phylogenetic-Network"><a href="#4、-Visualizing-a-Phylogenetic-Network" class="headerlink" title="4、 Visualizing a Phylogenetic Network"></a>4、 Visualizing a Phylogenetic Network</h5><p>Phylogenetic network in Rich Newick string can be visualized in <a href="http://ab.inf.uni-tuebingen.de/software/dendroscope/">Dendroscope</a> or <a href="https://icytree.org/">icytree</a>. The former needs downloading, and the latter is online. However, Dendroscope cannot recognize inheritance probabilities (branch lengths are fine), and icytree sometimes can and sometimes cannot. You need to remove those probabilities manually from the Rich Newick string, or use option “-di” so that PhyloNet returns the network that Dendroscope takes directly. 此处参考<a href="https://wiki.rice.edu/confluence/pages/viewpage.action?pageId=39500205#PhylonetTutorial(SSB2020)-3.BasicUsage">https://wiki.rice.edu/confluence/pages/viewpage.action?pageId=39500205#PhylonetTutorial(SSB2020)-3.BasicUsage</a></p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<categories>
<category> 网状进化 </category>
</categories>
<tags>
<tag> PhyloNetworks </tag>
</tags>
</entry>
<entry>
<title>利用ete3批量对基因树定根并检测单系性</title>
<link href="/2023/10/05/%E5%88%A9%E7%94%A8ete3%E6%89%B9%E9%87%8F%E5%AF%B9%E5%9F%BA%E5%9B%A0%E6%A0%91%E5%AE%9A%E6%A0%B9/"/>
<url>/2023/10/05/%E5%88%A9%E7%94%A8ete3%E6%89%B9%E9%87%8F%E5%AF%B9%E5%9F%BA%E5%9B%A0%E6%A0%91%E5%AE%9A%E6%A0%B9/</url>
<content type="html"><![CDATA[<h4 id="利用ete3的set-outgroup函数批量对基因树定根并利用check-monophyly函数检测单系性"><a href="#利用ete3的set-outgroup函数批量对基因树定根并利用check-monophyly函数检测单系性" class="headerlink" title="利用ete3的set_outgroup函数批量对基因树定根并利用check_monophyly函数检测单系性"></a>利用ete3的set_outgroup函数批量对基因树定根并利用check_monophyly函数检测单系性</h4><h5 id="对溯祖法得到的多个基因树进行定根,将多个基因树文件cat到一个文件中,得到alltree-txt,利用以下脚本对其批量定根。"><a href="#对溯祖法得到的多个基因树进行定根,将多个基因树文件cat到一个文件中,得到alltree-txt,利用以下脚本对其批量定根。" class="headerlink" title="对溯祖法得到的多个基因树进行定根,将多个基因树文件cat到一个文件中,得到alltree.txt,利用以下脚本对其批量定根。"></a>对溯祖法得到的多个基因树进行定根,将多个基因树文件cat到一个文件中,得到alltree.txt,利用以下脚本对其批量定根。</h5><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">###首先下载ete3工具包</span></span><br><span class="line"><span class="keyword">from</span> ete3 <span class="keyword">import</span> Tree</span><br><span class="line"><span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(<span class="number">7121</span>): <span class="comment">#提前创建7122个文件,用于写入定根后的树文件,修改为基因的数量减一</span></span><br><span class="line"> f = <span class="built_in">open</span>(<span class="string">'./%s'</span>%i + <span class="string">'.txt'</span>,<span class="string">"a"</span>)</span><br><span class="line"> f.write(<span class="string">""</span>)</span><br><span class="line"></span><br><span class="line">n=<span class="number">0</span></span><br><span class="line"><span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'alltree.txt'</span>,<span class="string">'r'</span>) <span class="keyword">as</span> f:</span><br><span class="line"> <span class="keyword">for</span> line <span class="keyword">in</span> f:</span><br><span class="line"> t = Tree(line)</span><br><span class="line"> t.set_outgroup(t&<span class="string">"mcap"</span>) <span class="comment">#设置外群,修改双引号之间的物种名即可,若歪群为两个物种则改为t.set_outgroup("mcap1"&"mcap2")</span></span><br><span class="line"> t.write(outfile=<span class="built_in">str</span>(n)+<span class="string">".txt"</span>) <span class="comment">#将每一行的树定根后写入到每一个文件中</span></span><br><span class="line"> n=n+<span class="number">1</span></span><br></pre></td></tr></table></figure><p>写为python脚本并运行,然后将其合并方便后续分析。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sed <span class="string">''</span> *.txt > alltree.rooted.txt</span><br></pre></td></tr></table></figure><h5 id="检测单系性"><a href="#检测单系性" class="headerlink" title="检测单系性"></a>检测单系性</h5><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> ete3 <span class="keyword">import</span> Tree</span><br><span class="line"><span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'alltree.rooted.txt'</span>,<span class="string">'r'</span>) <span class="keyword">as</span> f: <span class="comment">#此处alltree.rooted.txt为包含所有基因树的文件,每行一个基因树</span></span><br><span class="line"><span class="keyword">for</span> line <span class="keyword">in</span> f:</span><br><span class="line">t = Tree(line)</span><br><span class="line"><span class="built_in">print</span>(t.check_monophyly(values=[<span class="string">"aawi"</span>, <span class="string">"asub"</span>, <span class="string">"aflo"</span>, <span class="string">"agem"</span>, <span class="string">"aint"</span>, <span class="string">"apal"</span>, <span class="string">"adig"</span>, <span class="string">"alor"</span>, <span class="string">"aacu"</span>, <span class="string">"anas"</span>, <span class="string">"amic"</span>, <span class="string">"amil"</span>, <span class="string">"asel"</span>, <span class="string">"acyt"</span>, <span class="string">"ahya"</span>, <span class="string">"amur"</span>, <span class="string">"aech"</span>], target_attr=<span class="string">"name"</span>)) <span class="comment">#利用check_monophyly函数检测单系性,修改方括号内的物种名即可</span></span><br></pre></td></tr></table></figure><p>将上述写为python脚本并运行,输出内容为true或者false,便可判断所列物种在每个基因树中是否为单系。</p><p>参考</p><p><a href="http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.html#tree-rooting">使用树数据结构 — ETE 工具包 - 树的分析和可视化 (etetoolkit.org)</a></p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<categories>
<category> 批量处理方法 </category>
</categories>
<tags>
<tag> ete3 </tag>
</tags>
</entry>
<entry>
<title>利用cafe5进行基因家族扩张收缩分析</title>
<link href="/2023/09/02/%E5%88%A9%E7%94%A8cafe5%E8%BF%9B%E8%A1%8C%E5%9F%BA%E5%9B%A0%E5%AE%B6%E6%97%8F%E6%89%A9%E5%BC%A0%E6%94%B6%E7%BC%A9%E5%88%86%E6%9E%90/"/>
<url>/2023/09/02/%E5%88%A9%E7%94%A8cafe5%E8%BF%9B%E8%A1%8C%E5%9F%BA%E5%9B%A0%E5%AE%B6%E6%97%8F%E6%89%A9%E5%BC%A0%E6%94%B6%E7%BC%A9%E5%88%86%E6%9E%90/</url>
<content type="html"><![CDATA[<h3 id="利用cafe5进行基因家族扩张收缩分析"><a href="#利用cafe5进行基因家族扩张收缩分析" class="headerlink" title="利用cafe5进行基因家族扩张收缩分析"></a>利用cafe5进行基因家族扩张收缩分析</h3><p><em>测试于2023年9月1日</em></p><blockquote><p>基于orthofinder直系同源聚类的结果,可以看在树的特定节点上哪些基因家族发生了扩张与收缩</p></blockquote><blockquote><p>软件安装</p></blockquote><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#在单独的conda环境里安装并运行,避免环境冲突</span></span><br><span class="line">conda activate biosoft</span><br><span class="line">conda install -c bioconda cafe</span><br><span class="line"></span><br><span class="line"><span class="comment">#测试成功安装</span></span><br><span class="line">cafe5 -h</span><br></pre></td></tr></table></figure><blockquote><p>准备文件包括1、带有分歧时间的树,由MCMCtree产生;2、直系同源基因家族的聚类情况,由orthofinder结果产生;</p></blockquote><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#step 1 准备树文件,FigTree.tre来自MCMCtree结果</span></span><br><span class="line"></span><br><span class="line">grep <span class="string">"UTREE 1 ="</span> FigTree.tre | sed -E -e <span class="string">"s/\[[^]]*\]//g"</span> -e <span class="string">"s/[ \t]//g"</span> -e <span class="string">"/^$/d"</span> -e <span class="string">"s/UTREE1=//"</span> > tree.txt</span><br><span class="line"></span><br><span class="line"><span class="comment">#step 2 用orthofinder2的结果文件Orthogroups.GeneCount.tsv转换成gene_families.txt文件,文件路径在/home/tianzhen/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups</span></span><br><span class="line"></span><br><span class="line">awk -v OFS=<span class="string">"\t"</span> <span class="string">'{$NF=null;print $1,$0}'</span> Orthogroups.GeneCount.tsv |sed -E -e <span class="string">'s/Orthogroup/desc/'</span> -e <span class="string">'s/_[^\t]+//g'</span> > orthomcl2cafe.tab</span><br><span class="line"> </span><br><span class="line"><span class="comment">#或者使用perl脚本提取orthomcl2cafe.tab文件,Orthogroups.txt 文件所在路径为/OrthoFinder/Results_Nov22/Orthogroups/Orthogroups.txt</span></span><br><span class="line"></span><br><span class="line">perl orthoMCL2cafe.pl Orthogroups.txt > orthomcl2cafe.tab</span><br><span class="line"></span><br><span class="line"><span class="comment">#step 3 剔除不同物种间拷贝数差异过大的基因家族,否则会报错</span></span><br><span class="line"></span><br><span class="line">python ~/scripts/cafetutorial_clade_and_size_filter.py -i orthomcl2cafe.tab -o gene_family_filter.txt -s</span><br></pre></td></tr></table></figure><blockquote><p>运行cafe5软件</p></blockquote><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#step 1 首先评估lambda,此步应该可以省略,在运行中不添加-l参数,则会自动评估lambda值</span></span><br><span class="line">cafe5 -i orthomcl2cafe.tab -t tree.txt -p -o singlelambda</span><br><span class="line"></span><br><span class="line"><span class="comment"># 可能因为某些基因家族数量过大会报错,报错内容为:</span></span><br><span class="line">Families with largest size differentials:</span><br><span class="line">OG0000385: 97</span><br><span class="line">OG0000296: 97</span><br><span class="line">OG0000029: 96</span><br><span class="line">OG0000030: 95</span><br><span class="line">OG0000016: 93</span><br><span class="line">OG0000142: 92</span><br><span class="line">OG0000232: 91</span><br><span class="line">OG0000035: 91</span><br><span class="line">OG0000155: 89</span><br><span class="line">OG0000115: 89</span><br><span class="line">OG0000027: 88</span><br><span class="line">OG0000079: 86</span><br><span class="line">OG0000034: 83</span><br><span class="line">OG0000044: 82</span><br><span class="line">OG0000055: 80</span><br><span class="line">OG0000067: 79</span><br><span class="line">OG0000943: 78</span><br><span class="line">OG0000050: 77</span><br><span class="line">OG0000110: 76</span><br><span class="line">OG0000088: 76</span><br><span class="line">You may want to try removing the top few families with the largest difference</span><br><span class="line">between the max and min counts and <span class="keyword">then</span> re-run the analysis.</span><br><span class="line"></span><br><span class="line">Failed to initialize any reasonable values</span><br><span class="line"></span><br><span class="line"><span class="comment"># 将上述基因家族写入文件,准备从orthomcl2cafe.tab中剔除掉</span></span><br><span class="line">vi families_largest.txt</span><br><span class="line">sed -i <span class="string">'s/:.*//g'</span> families_largest.txt</span><br><span class="line"><span class="keyword">for</span> i <span class="keyword">in</span> `cat families_largest.txt`;<span class="keyword">do</span> sed -i <span class="string">"/<span class="variable">$i</span>/d"</span> orthomcl2cafe.tab;<span class="keyword">done</span></span><br><span class="line">(grep -v -f family.remove.id.txt input.tab >input.tab.new)</span><br><span class="line"><span class="comment"># 重新运行,评估lambda值</span></span><br><span class="line">cafe5 -i orthomcl2cafe.tab -t tree.txt -p -o singlelambda</span><br><span class="line"></span><br><span class="line"><span class="comment">#step 2 调整k参数2-5,多次运行,选择最优k值的结果</span></span><br><span class="line">cafe5 -i orthomcl2cafe.tab -t tree.txt -p -k 2 -l 0.0001 -o k2p</span><br><span class="line">cafe5 -i orthomcl2cafe.tab -t tree.txt -p -k 3 -l 0.0001 -o k3p</span><br><span class="line">cafe5 -i orthomcl2cafe.tab -t tree.txt -p -k 4 -l 0.0001 -o k4p</span><br><span class="line">cafe5 -i orthomcl2cafe.tab -t tree.txt -p -k 5 -l 0.0001 -o k5p</span><br></pre></td></tr></table></figure><blockquote><p>在结果中的Gamma_results.txt文件里查看lnL值,选择该值最大的k值运行结果作为后续分析的文件</p><p>接下来提取感兴趣节点的基因家族序列</p></blockquote><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#在Gamma_change.tab文件里提取节点35的扩张的家族,位于第37列</span></span><br><span class="line">cat ../Gamma_change.tab |cut -f1,37|grep <span class="string">"+[1-9]"</span> > cr.expanded </span><br><span class="line"></span><br><span class="line"><span class="comment">#根据sample ID和编号提取sample分支的基因家族显著扩张或收缩的基因家族树(Gamma_asr.tre文件中默认以p<0.05为标准判断变化是否显著)</span></span><br><span class="line">grep <span class="string">"<35>\*"</span> ../Gamma_asr.tre > cr_significant_trees.tre </span><br><span class="line"></span><br><span class="line"><span class="comment">#提取sample分支显著变化的OG IDs (默认以p<0.05为标准)</span></span><br><span class="line">grep -E -o <span class="string">"OG[0-9]+"</span> cr_significant_trees.tre > cr_significant.ogs </span><br><span class="line"></span><br><span class="line"><span class="comment"># 以p<0.05为标准提取所有显著扩张或收缩的orthogroupsID(根据情况调整,常用p<0.05或p<0.01)</span></span><br><span class="line">awk <span class="string">'$2 <0.05 {print $1}'</span> ../Gamma_family_results.txt >p0.05_significant.ogs </span><br><span class="line"></span><br><span class="line"><span class="comment"># 提取以p<0.05为标准判断显著性的sample分支基因家族显著变化的OG IDs</span></span><br><span class="line">grep -f cr_significant.ogs p0.05_significant.ogs > cr_p0.05_significant.ogs </span><br><span class="line"></span><br><span class="line"><span class="comment">#提取显著扩张的sample物种的orthogroupsID</span></span><br><span class="line">grep -f cr_p0.05_significant.ogs cr.expanded |cut -f1 > cr.expanded.significant</span><br><span class="line"></span><br><span class="line"><span class="comment">#提取显著扩张的基因列表,假设基因ID的前缀是amil</span></span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep <span class="string">"amil"</span> |sort -k 1.3n |uniq > cr.amil.expanded.significant.genes</span><br><span class="line"><span class="comment">#祖先节点则将所有物种的序列都提出来</span></span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep adig |sort -k 1.3n |uniq > cr.adig.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep aequ |sort -k 1.3n |uniq > cr.aequ.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep afen |sort -k 1.3n |uniq > cr.afen.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep alor |sort -k 1.3n |uniq > cr.alor.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep amil |sort -k 1.3n |uniq > cr.amil.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep apall |sort -k 1.3n |uniq > cr.apall.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep atene |sort -k 1.3n |uniq > cr.atene.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep atenu |sort -k 1.3n |uniq > cr.atenu.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep dgig |sort -k 1.3n |uniq > cr.dgig.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep disc |sort -k 1.3n |uniq > cr.disc.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep edia |sort -k 1.3n |uniq > cr.edia.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep fung |sort -k 1.3n |uniq > cr.fung.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep gasp |sort -k 1.3n |uniq > cr.gasp.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep gfas |sort -k 1.3n |uniq > cr.gfas.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep hvul |sort -k 1.3n |uniq > cr.hvul.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep nvec |sort -k 1.3n |uniq > cr.nvec.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep ofav |sort -k 1.3n |uniq > cr.ofav.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep pdae |sort -k 1.3n |uniq > cr.pdae.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep pdam |sort -k 1.3n |uniq > cr.pdam.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep plut |sort -k 1.3n |uniq > cr.plut.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep pspe |sort -k 1.3n |uniq > cr.pspe.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep pverr |sort -k 1.3n |uniq > cr.pverr.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep rmue |sort -k 1.3n |uniq > cr.rmue.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep spis |sort -k 1.3n |uniq > cr.spis.expanded.significant.genes</span><br><span class="line">grep -f cr.expanded.significant ~/2023-redo-cafe/OrthoFinder/Results_Aug27/Orthogroups/Orthogroups.txt|sed <span class="string">"s/ /\n/g"</span>|grep xeni |sort -k 1.3n |uniq > cr.xeni.expanded.significant.genes</span><br><span class="line"></span><br><span class="line"><span class="comment">#提取所有显著扩张的基因序列,用于基因注释</span></span><br><span class="line">seqkit grep -f ../cr.adig.expanded.significant.genes /home/tianzhen/2023-redo-cafe/adig.fasta > cr.adig.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.aequ.expanded.significant.genes /home/tianzhen/2023-redo-cafe/aequ.fasta > cr.aequ.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.afen.expanded.significant.genes /home/tianzhen/2023-redo-cafe/afen.fasta > cr.afen.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.alor.expanded.significant.genes /home/tianzhen/2023-redo-cafe/alor.fasta > cr.alor.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.amil.expanded.significant.genes /home/tianzhen/2023-redo-cafe/amil.fasta > cr.amil.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.apall.expanded.significant.genes /home/tianzhen/2023-redo-cafe/apall.fasta > cr.apall.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.atene.expanded.significant.genes /home/tianzhen/2023-redo-cafe/atene.fasta > cr.atene.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.atenu.expanded.significant.genes /home/tianzhen/2023-redo-cafe/atenu.fasta > cr.atenu.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.dgig.expanded.significant.genes /home/tianzhen/2023-redo-cafe/dgig.fasta > cr.dgig.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.disc.expanded.significant.genes /home/tianzhen/2023-redo-cafe/disc.fasta > cr.disc.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.edia.expanded.significant.genes /home/tianzhen/2023-redo-cafe/edia.fasta > cr.edia.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.fung.expanded.significant.genes /home/tianzhen/2023-redo-cafe/fung.fasta > cr.fung.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.gasp.expanded.significant.genes /home/tianzhen/2023-redo-cafe/gasp.fasta > cr.gasp.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.gfas.expanded.significant.genes /home/tianzhen/2023-redo-cafe/gfas.fasta > cr.gfas.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.hvul.expanded.significant.genes /home/tianzhen/2023-redo-cafe/hvul.fasta > cr.hvul.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.nvec.expanded.significant.genes /home/tianzhen/2023-redo-cafe/nvec.fasta > cr.nvec.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.ofav.expanded.significant.genes /home/tianzhen/2023-redo-cafe/ofav.fasta > cr.ofav.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.pdae.expanded.significant.genes /home/tianzhen/2023-redo-cafe/pdae.fasta > cr.pdae.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.pdam.expanded.significant.genes /home/tianzhen/2023-redo-cafe/pdam.fasta > cr.pdam.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.plut.expanded.significant.genes /home/tianzhen/2023-redo-cafe/plut.fasta > cr.plut.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.pspe.expanded.significant.genes /home/tianzhen/2023-redo-cafe/pspe.fasta > cr.pspe.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.pverr.expanded.significant.genes /home/tianzhen/2023-redo-cafe/pverr.fasta > cr.pverr.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.rmue.expanded.significant.genes /home/tianzhen/2023-redo-cafe/rmue.fasta > cr.rmue.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.spis.expanded.significant.genes /home/tianzhen/2023-redo-cafe/spis.fasta > cr.spis.expanded.significant.pep.fas</span><br><span class="line">seqkit grep -f ../cr.xeni.expanded.significant.genes /home/tianzhen/2023-redo-cafe/xeni.fasta > cr.xeni.expanded.significant.pep.fas</span><br><span class="line"></span><br><span class="line"><span class="comment">#将上面的序列合为一个fasta文件,用于基因功能注释</span></span><br><span class="line">cat *.fas > all_sequences_for_anno.fas</span><br></pre></td></tr></table></figure><p>参考</p><p><a href="https://github.com/hahnlab/CAFE5">hahnlab/CAFE5: Version 5 of the CAFE phylogenetics software (github.com)</a></p><p><a href="https://yanzhongsino.github.io/2021/10/29/bioinfo_gene.family_CAFE5/">分析基因家族扩张和收缩 —— CAFE5 | 生信技工 (yanzhongsino.github.io)</a></p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<categories>
<category> 分析笔记 </category>
</categories>
<tags>
<tag> cafe5 </tag>
</tags>
</entry>
<entry>
<title>ete3</title>
<link href="/2023/04/04/ete3/"/>
<url>/2023/04/04/ete3/</url>
<content type="html"><![CDATA[<h3 id="如何利用ete3包从系统发育树中提取子树"><a href="#如何利用ete3包从系统发育树中提取子树" class="headerlink" title="如何利用ete3包从系统发育树中提取子树"></a>如何利用ete3包从系统发育树中提取子树</h3><p><strong>ete3:用于构建、比较、注释、操作和可视化系统发育树的Python包。</strong></p><span id="more"></span><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">conda install ete3 #conda安装</span><br><span class="line">python #在操作文件目录下进入python命令行交互界面</span><br><span class="line">import ete3 #导入ete3包</span><br><span class="line">t = ete3.Tree("tree.txt") #将原来的进化树定义为t</span><br><span class="line">subtree_taxa = ["Orbicella_annularis","Pocillipora_damicornis","Stylophora_pistillata","Hydra_vulgaris","Acropora_digitifera","Acropora_millepora","Acropora_tenuis","Porites_lutea","Renilla_muelleri","Discosoma_santahelenae","Amplexidiscus_fenestrafer","Dendronephthya_sinaiensis","Nematostella_vectensis","Exaiptasia_pallida","Actinia_equina"] #将待提取的物种列表存入subtree_taxa变量</span><br><span class="line">t.prune(subtree_taxa,preserve_branch_length=True) #利用prune方法提取子树</span><br><span class="line">print(t) #查看子树拓扑</span><br><span class="line">t.write(outfile="subtree.txt") #将子树存为文件</span><br></pre></td></tr></table></figure><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 方法 </tag>
</tags>
</entry>
<entry>
<title>learn_perl</title>
<link href="/2022/12/17/learn-perl/"/>
<url>/2022/12/17/learn-perl/</url>
<content type="html"><![CDATA[<p><strong>Perl是一种高效处理文本文件的脚本语言,下面记录了一些常用的基因序列处理的perl函数或工具</strong></p><img src="/2022/12/17/learn-perl/perl.jpg" class title="perl"><span id="more"></span><h3 id="数据类型"><a href="#数据类型" class="headerlink" title="数据类型"></a>数据类型</h3><h4 id="1、scalar"><a href="#1、scalar" class="headerlink" title="1、scalar"></a>1、scalar</h4><p>大小写转换工具:</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">uc</span> $seq_of_BRAC2; <span class="comment">#转大写</span></span><br><span class="line"><span class="keyword">lc</span> $seq_of_BRAC2; <span class="comment">#转小写</span></span><br><span class="line">$seq_of_BRAC2 =~ <span class="regexp">tr/atgc/ATGC/</span>; <span class="comment">#转大写</span></span><br><span class="line">$seq_of_BRAC2 =~ <span class="regexp">tr/ATGC/atgc/</span>; <span class="comment">#转小写</span></span><br></pre></td></tr></table></figure><p>利用tr工具统计4种碱基的个数</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> $count = $seperate_dna =~ <span class="regexp">tr/atgc/atgc/</span>; <span class="comment">#统计$seperate_dna中小写atgc的数目</span></span><br><span class="line"><span class="keyword">my</span> $count = $seperate_dna =~ <span class="regexp">tr/ATGC/ATGC/</span>; <span class="comment">#统计$seperate_dna中大写ATGC的数目</span></span><br><span class="line"><span class="keyword">my</span> $count = $seperate_dna =~ <span class="regexp">tr/A/A/</span>; <span class="comment">#统计$seperate_dna中A的数目</span></span><br></pre></td></tr></table></figure><p>取反向互补序列</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> $reversed_zika_DNA = <span class="keyword">reverse</span>($zika_DNA);</span><br><span class="line">$reversed_zika_DNA =~ <span class="regexp">tr/ATCGatcg/TAGCTAGC/</span>;</span><br></pre></td></tr></table></figure><p>利用点符号连接字符串</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> $M_codon = <span class="string">"AUG"</span>;</span><br><span class="line"><span class="keyword">my</span> $S_codon = <span class="string">"UCA"</span>;</span><br><span class="line"></span><br><span class="line"><span class="keyword">my</span> $RNA_seq = $M_codon.$S_codon;</span><br></pre></td></tr></table></figure><p>统计字符串长度</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> $zika_DNA = <span class="string">"AGTTGTTGATCTGTGTGAGT"</span>;</span><br><span class="line"><span class="keyword">my</span> $zika_DNA_lenth = <span class="keyword">length</span>($zika_DNA);</span><br></pre></td></tr></table></figure><p>替换掉序列中的空格及子片段</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$seq_of_BRAC2 =~ <span class="regexp">s/\s//g</span>;</span><br><span class="line">$zika_DNA =~ <span class="regexp">s/atg/ATG/g</span>;</span><br><span class="line">$zika_DNA =~ <span class="regexp">s/[0123456789]//g</span>;</span><br></pre></td></tr></table></figure><p>格式化打印输出内容,分别打印3个最长长度为15,10,10的字符串</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">printf</span> <span class="string">"%15s %10s %10s \n"</span>,<span class="string">"Amino acid"</span>,<span class="string">"1-letter"</span>,<span class="string">"codon"</span>;</span><br></pre></td></tr></table></figure><p>利用index函数的返回值判断一个字符串是否包含另一个字符串,返回-1表示不包含,返回其他数值表示$shorter_seq在$seq中的索引</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> $index = <span class="keyword">index</span>($seq,$shorter_seq);</span><br><span class="line"><span class="keyword">if</span> ($index eq -<span class="number">1</span>){</span><br><span class="line"><span class="keyword">print</span> <span class="string">"the second sequence is not a substring of the first string"</span>;</span><br><span class="line">}</span><br><span class="line"><span class="keyword">else</span>{</span><br><span class="line"><span class="keyword">print</span> <span class="string">"the second sequence is a substring of the first string"</span>;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>取一个字符串在另一个字符串的索引值</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> $first_index = <span class="keyword">index</span>($up_dna,$motif); <span class="comment">#取第一个索引值</span></span><br><span class="line"><span class="keyword">my</span> $second_index = <span class="keyword">index</span>($up_dna,$motif,($first_index + <span class="keyword">length</span>($motif))); <span class="comment">#第三个参数为起始位置的索引</span></span><br><span class="line"><span class="keyword">my</span> $last_index = <span class="keyword">rindex</span>($up_dna,$motif); <span class="comment">#取最后一个索引值,如不包含,则返回-1</span></span><br></pre></td></tr></table></figure><p>利用substr函数根据索引值在DNA序列里提取DNA片段</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> $seperate_dna = <span class="keyword">substr</span>($dna,$first_index + <span class="number">4</span>,$last_index - $first_index - <span class="number">4</span>); <span class="comment">#3个参数分别表示DNA序列字符串、提取子片段的起始索引值、提取子片段的长度</span></span><br></pre></td></tr></table></figure><h4 id="2、array"><a href="#2、array" class="headerlink" title="2、array"></a>2、array</h4><p>数组内第一个元素的索引是0,$stop_codon[0]是@stop_codon的第一个元素</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> @stop_codon = (<span class="string">"TAA"</span>,<span class="string">"tAG"</span>);</span><br><span class="line"><span class="keyword">print</span> <span class="string">"Stop codon are @stop_codon\n"</span>;</span><br><span class="line"><span class="keyword">my</span> $first_stop_codon = $stop_codon[<span class="number">0</span>];</span><br><span class="line">$stop_codon[<span class="number">2</span>] = <span class="string">"TGA"</span>; <span class="comment">#向数组添加元素</span></span><br></pre></td></tr></table></figure><p>向数组内添加或去除元素</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> @aa = (<span class="string">"GAA"</span>,<span class="string">"GAG"</span>);</span><br><span class="line"><span class="keyword">push</span> (@aa,<span class="string">"GAU"</span>); <span class="comment">#在数组末尾添加元素</span></span><br><span class="line"><span class="keyword">unshift</span> (@aa,<span class="string">"GAC"</span>); <span class="comment">#添加第一个元素</span></span><br><span class="line"><span class="keyword">pop</span> @aa; <span class="comment">#去除最后一个元素</span></span><br><span class="line"><span class="keyword">shift</span> (@aa); <span class="comment">#去除第一个元素</span></span><br></pre></td></tr></table></figure><p>利用sort对数组内元素进行排序</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">@input_line = <span class="keyword">sort</span>(@input_line);</span><br><span class="line"></span><br><span class="line"><span class="keyword">my</span> @sorted_numbers = <span class="keyword">sort</span> { $a <=> $b } @unsorted_numbers;</span><br><span class="line"><span class="keyword">my</span> @sorted_numbers1 = <span class="keyword">sort</span> { $b <=> $a } @unsorted_numbers;</span><br><span class="line"><span class="comment">#sort { $a <=> $b } 升序;sort { $b <=> $a }降序</span></span><br></pre></td></tr></table></figure><p>foreach依次对数组内元素进行操作</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">foreach</span> <span class="keyword">my</span> $i (@input_line){</span><br><span class="line"><span class="keyword">print</span> <span class="string">"$i\n"</span>;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>利用scala函数统计数组内的元素个数</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> $number_of_element = <span class="keyword">scalar</span> @aa;</span><br></pre></td></tr></table></figure><p>利用split函数分割字符串并存入数组</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> $BRAC2_seq = <span class="string">"gggtgcgacgattcattgttttcggacaag"</span>;</span><br><span class="line"><span class="keyword">my</span> @nucleotides = <span class="keyword">split</span>(<span class="regexp">//</span>,$BRAC2_seq); <span class="comment">#按照斜杠内部的符号切分每一个碱基并存入数组</span></span><br></pre></td></tr></table></figure><p>利用join函数连接数组中的元素为一个字符串(与split相反)</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> $BRAC2_seq = <span class="keyword">join</span>(<span class="string">''</span>,@nucleotides); <span class="comment">#单引号内为连接符,此处表示无连接符</span></span><br></pre></td></tr></table></figure><h4 id="3、hash"><a href="#3、hash" class="headerlink" title="3、hash"></a>3、hash</h4><p>取哈希的键值分别存为数组</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> %restriction_enzymes = (<span class="string">"EcoRI"</span> => <span class="string">"GAATTC"</span>,</span><br><span class="line"><span class="string">"AluI"</span> => <span class="string">"AGCT"</span>,</span><br><span class="line"><span class="string">"NotI"</span> => <span class="string">"GCGGCCGC"</span>,</span><br><span class="line"><span class="string">"TaqI"</span> => <span class="string">"TCGA"</span>);</span><br><span class="line"><span class="keyword">my</span> @key_list = <span class="keyword">keys</span> %restriction_enzymes;</span><br><span class="line"><span class="keyword">print</span> <span class="string">"@key_list\n"</span>;</span><br><span class="line"><span class="keyword">my</span> @value_list = <span class="keyword">values</span> %restriction_enzymes;</span><br><span class="line"><span class="keyword">print</span> <span class="string">"@value_list\n"</span>;</span><br></pre></td></tr></table></figure><p>删除哈希内部元素</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">delete</span> $restriction_enzymes{<span class="string">"TaqI"</span>};</span><br></pre></td></tr></table></figure><h3 id="循环与判断"><a href="#循环与判断" class="headerlink" title="循环与判断"></a>循环与判断</h3><p>perl的if-else结构:if(判断语句){执行语句;}else{执行语句;}</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> $dna_segment = <span class="string">"ATGACATGA"</span>;</span><br><span class="line"><span class="keyword">my</span> $codon1 = <span class="keyword">substr</span>($dna_segment,<span class="number">0</span>,<span class="number">3</span>);</span><br><span class="line"><span class="keyword">if</span>( $codon1 eq <span class="string">"ATG"</span> ){</span><br><span class="line"><span class="keyword">print</span> <span class="string">"codon $codon1 is a start codon.\n"</span>;</span><br><span class="line">}</span><br><span class="line"><span class="keyword">else</span> {</span><br><span class="line"><span class="keyword">print</span> <span class="string">"codon $codon1 is not a start codon.\n"</span>;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>perl的if-elsif-else结构:if(判断语句){执行语句;}elsif(判断语句){执行语句;}else{执行语句;}</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> $dna_segment = <span class="string">"ATGACATGACCAATAA"</span>;</span><br><span class="line"><span class="keyword">my</span> $codon = <span class="keyword">substr</span>($dna_segment,-<span class="number">3</span>,<span class="number">3</span>);</span><br><span class="line"><span class="keyword">if</span>($codon eq <span class="string">"ATG"</span>){</span><br><span class="line"><span class="keyword">print</span> <span class="string">"codon $codon is a start codon\n"</span>;</span><br><span class="line">}</span><br><span class="line"><span class="keyword">elsif</span>(($codon eq <span class="string">"TAA"</span>) <span class="keyword">or</span> ($codon eq <span class="string">"TAG"</span>) <span class="keyword">or</span> ($codon eq <span class="string">"TGA"</span>)){</span><br><span class="line"><span class="keyword">print</span> <span class="string">"codon $codon is a stop codon\n"</span>;</span><br><span class="line">}</span><br><span class="line"><span class="keyword">else</span>{</span><br><span class="line"><span class="keyword">print</span> <span class="string">"Codon $codon is neither a start nor a stop codon\n"</span>;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>perl的while循环结构:while(判断语句){执行语句;}</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">while</span>($index >= <span class="number">0</span>){</span><br><span class="line">$reversed = $reversed.substr($zika_dna,$index,<span class="number">1</span>);</span><br><span class="line">$index = $index - <span class="number">1</span>;</span><br><span class="line">} <span class="comment">#此处利用序列的索引和while循环取反向序列</span></span><br></pre></td></tr></table></figure><p>无限循环</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">while</span>( ){</span><br><span class="line">commands;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>perl的foreach循环结构:foreach my $i (数组){执行语句}</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> @bases = (<span class="string">"T"</span>,<span class="string">"C"</span>,<span class="string">"A"</span>,<span class="string">"G"</span>);</span><br><span class="line"><span class="keyword">foreach</span> <span class="keyword">my</span> $base (@bases){</span><br><span class="line"><span class="keyword">print</span> <span class="string">"$base"</span>;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>perl的for循环结构:for(表达式1; 表达式2; 表达式3){执行语句;}</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> $population = <span class="number">425</span>;</span><br><span class="line"><span class="keyword">for</span> (<span class="keyword">my</span> $year = <span class="number">0</span>; $year <= <span class="number">28</span>; $year++){</span><br><span class="line"><span class="keyword">print</span> <span class="string">"at year $year, the population is $population\n"</span>;</span><br><span class="line">$population = $population + $population * <span class="number">0</span>.<span class="number">01</span>94;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><h3 id="文件操作"><a href="#文件操作" class="headerlink" title="文件操作"></a>文件操作</h3><p>利用文件句柄FF打开文件,$!是一个魔术变量,对应于操作系统的数字错误代码,die函数会输出你指定的信息到专为这类信息准备的标准错误流中,并且让你的程序立刻终止并返回不为零的退出码。</p><p>读取文件,“<”表示将右侧文件的内容传递到左侧的文件句柄中;利用while循环逐行读取每一行的内容;利用close函数关闭文件句柄;</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> $filename = <span class="string">"read_from_file.pl"</span>;</span><br><span class="line"><span class="keyword">open</span> (FF, <span class="string">"<"</span>, <span class="string">"$filename"</span>)<span class="keyword">or</span> <span class="keyword">die</span> <span class="string">"Cannot open $filename to write: $!"</span>;</span><br><span class="line"><span class="keyword">while</span> (<span class="keyword">my</span> $line = <FF>){</span><br><span class="line"><span class="keyword">print</span> <span class="string">"$line"</span>;</span><br><span class="line">}</span><br><span class="line"><span class="keyword">close</span> (FF);</span><br></pre></td></tr></table></figure><p>写入文件,”>”表示将左侧句柄表示的内容传递给右侧文件</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> $filename = <span class="string">"a.fas"</span>;</span><br><span class="line"><span class="keyword">open</span> (FF, <span class="string">">"</span>, $filename)<span class="keyword">or</span> <span class="keyword">die</span> <span class="string">"cannot open $filename to write: $!"</span>;</span><br></pre></td></tr></table></figure><p>open函数打开文件用法</p><table><thead><tr><th>模式</th><th>描述</th></tr></thead><tbody><tr><td><</td><td>只读方式打开,将文件指针指向文件头</td></tr><tr><td>></td><td>写入方式打开,将文件指针指向文件头并将文件大小截为零。如果文件不存在则尝试创建之</td></tr><tr><td>>></td><td>写入方式打开,将文件指针指向文件末尾。如果文件不存在则尝试创建之</td></tr><tr><td>+<</td><td>读写方式打开,将文件指针指向文件头</td></tr><tr><td>+></td><td>读写方式打开,将文件指针指向文件头并将文件大小截为零。如果文件不存在则尝试创建之</td></tr><tr><td>+>></td><td>读写方式打开,将文件指针指向文件末尾。如果文件不存在则尝试创建之</td></tr></tbody></table><p>利用文件句柄FF写入内容至文件</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">print</span> FF <span class="string">"ATG\n"</span>;</span><br><span class="line"><span class="keyword">close</span> (FF); </span><br></pre></td></tr></table></figure><p>利用opendir函数打开文件夹</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">opendir</span> ( DIR, $dirname ) || <span class="keyword">die</span> <span class="string">"Error in opening dir $dirname\n"</span>;</span><br><span class="line"><span class="keyword">while</span>( ($filename = <span class="keyword">readdir</span>(DIR))) {</span><br><span class="line"> <span class="keyword">print</span>(<span class="string">"$filename\n"</span>);</span><br><span class="line">}</span><br><span class="line"><span class="keyword">closedir</span>(DIR);</span><br></pre></td></tr></table></figure><p>判断是否存在文件</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">if</span> (-e $filename){</span><br><span class="line"><span class="keyword">print</span> <span class="string">"Rosetta partial genome is written to $filename file successfully!\n"</span>;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>其他文件测试操作 <strong>(</strong> Other File Test Operators **)**,可以使用“和”(&&)或“或”(||)运算符一次测试两个或多个表达式。 其他一些文件测试运算符是:</p><p><code>-r checks if the file is readable #检查文件是否可读</code></p><p><code>-w checks if the file is writeable #检查文件是否可写</code></p><p><code>-x checks if the file is executable #检查文件是否可执行</code></p><p><code>-z checks if the file is empty #检查文件是否为空</code></p><p><code>-f checks if the file is a plain file #检查文件是否为纯文件</code></p><p><code>-d checks if the file is a directory #检查文件是否为目录</code></p><p><code>-l checks if the file is a symbolic link #检查文件是否为符号链接</code></p><h3 id="正则匹配"><a href="#正则匹配" class="headerlink" title="正则匹配"></a>正则匹配</h3><p>1、判断字符串是否包含某个正则表达式</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">if</span> ($seq =~ <span class="regexp">m/$motif/</span>){</span><br><span class="line"><span class="keyword">print</span> <span class="string">"found the motif\n"</span>;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>2、模式匹配成功后,会将匹配部分自动存储在变量$1中;</p><p>3、undef和defined函数,undef表示的像是数据库中的”null”。它表示空,啥也没有,是完全未定义的。这不等于字符串的空,不等于数值0,它是另一种类型。</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">$seq =~ <span class="regexp">m/($motif)/</span>;</span><br><span class="line"><span class="keyword">if</span> (<span class="keyword">defined</span> $1){</span><br><span class="line"><span class="keyword">print</span> <span class="string">"found the motif $1\n"</span>;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>4、判断字符串是否匹配空行,^\s*$代表空行</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">if</span> ($motif =~ <span class="regexp">m/^\s*$/</span>){</span><br><span class="line"><span class="keyword">last</span>;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>5、g表示全局匹配;</p><p>6、while ($seq =~ m/($motif)/g){}对所有地匹配进行循环操作;</p><p>7、pos()函数用于查找最后匹配的子字符串的偏移量或位置,如下面示例:</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> $seq = <span class="string">"AATGAAGGGCCGCTACGATAAGGAACTTCGTAATTTCAG"</span>;</span><br><span class="line"><span class="keyword">print</span> <span class="string">"seq = $seq\n"</span>;</span><br><span class="line"></span><br><span class="line"><span class="keyword">my</span> $motif = <span class="string">"[AT]{3,6}"</span>;</span><br><span class="line"><span class="keyword">my</span> $match_motif;</span><br><span class="line"><span class="keyword">my</span> $match_loc;</span><br><span class="line"><span class="keyword">my</span> $number_of_match = <span class="number">0</span>;</span><br><span class="line"></span><br><span class="line"><span class="keyword">while</span> ($seq =~ <span class="regexp">m/($motif)/g</span>){</span><br><span class="line">$match_motif = $1;</span><br><span class="line">$match_loc = <span class="keyword">pos</span>($seq) - <span class="keyword">length</span>($match_motif);</span><br><span class="line">$number_of_match = $number_of_match + <span class="number">1</span>;</span><br><span class="line"><span class="keyword">print</span> <span class="string">"matcg $number_of_match : $match_motif at $match_loc\n"</span>;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>8、正则匹配将序列按照每10个碱基为一行打印出来</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">while</span> ($seq =~ <span class="regexp">m/(.{1,$magic_number})/g</span>){</span><br><span class="line"><span class="keyword">print</span> <span class="string">"$1\n"</span>;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><h3 id="正则表达式"><a href="#正则表达式" class="headerlink" title="正则表达式"></a>正则表达式</h3><h4 id="1、正则字符及其含义"><a href="#1、正则字符及其含义" class="headerlink" title="1、正则字符及其含义"></a>1、正则字符及其含义</h4><table><thead><tr><th>符号或字符</th><th>所表达意义</th></tr></thead><tbody><tr><td>.</td><td>任何字符,除了换行符</td></tr><tr><td>\n</td><td>换行符</td></tr><tr><td>\t</td><td>制表符</td></tr><tr><td>\s</td><td>任何空白字符,包括空格、换行符、制表符</td></tr><tr><td>\S</td><td>任何非空字符</td></tr><tr><td>\d</td><td>任何数字</td></tr><tr><td>\D</td><td>任何非数字字符</td></tr><tr><td>\w</td><td>任何单个单词字符,包括字母和下划线</td></tr><tr><td>\W</td><td>任何单个非单词字符</td></tr><tr><td>*</td><td>匹配上一个字符0次或无数次</td></tr><tr><td>+</td><td>匹配上一个字符1次或无数次</td></tr><tr><td>?</td><td>匹配上一个字符0次或1次 / 非贪婪匹配</td></tr><tr><td>{}</td><td>重复</td></tr><tr><td>{,}</td><td>重复,最小次数到最大次数</td></tr><tr><td>()</td><td>捕获 / 分组</td></tr><tr><td>\1</td><td>存储第一次捕获</td></tr><tr><td>\2</td><td>存储第二次捕获</td></tr><tr><td>\n</td><td>存储第n次捕获</td></tr><tr><td>^</td><td>以某字符串开头</td></tr><tr><td>$</td><td>以某字符串结尾</td></tr><tr><td>[]</td><td>一组字符中的任何一个字符</td></tr><tr><td>[^]</td><td>除了一组字符以外的任何字符</td></tr><tr><td>|</td><td>或者</td></tr><tr><td>\</td><td>转义符</td></tr><tr><td>(?=…)</td><td>Positive look-ahead. Matches if … matches next, but doesn’t consume any of the string</td></tr><tr><td>(?!…)</td><td>Negative look-ahead. Matches if … doesn’t match next</td></tr></tbody></table><h4 id="2、正则表达式在DNA序列中的示例"><a href="#2、正则表达式在DNA序列中的示例" class="headerlink" title="2、正则表达式在DNA序列中的示例"></a>2、正则表达式在DNA序列中的示例</h4><table><thead><tr><th>正则表达式</th><th>示例</th><th>意义</th></tr></thead><tbody><tr><td>AGA</td><td>T<strong>AGA</strong>TC</td><td>匹配AGA</td></tr><tr><td>^AGA</td><td><strong>AGA</strong>TGC</td><td>匹配处于开头位置的AGA</td></tr><tr><td>TAA$</td><td>AAG<strong>TAA</strong></td><td>匹配处于末尾位置的TAA</td></tr><tr><td>A.T</td><td>AA<strong>ACT</strong>G</td><td>匹配A和T以及两者之间任意一个字符(除换行符)</td></tr><tr><td>A.*T</td><td>C<strong>ATATCT</strong></td><td>匹配A后面跟着任意数量的字符,然后是T(贪婪匹配)</td></tr><tr><td>A.*?T</td><td>C<strong>AT</strong>ATCT</td><td>匹配A后面跟着任意数量的字符,然后是T(非贪婪匹配)</td></tr><tr><td>(A.*?T)</td><td>C<strong>AT</strong>ATCT</td><td>捕获A后面跟着任意数量的字符,然后是T(非贪婪匹配)</td></tr><tr><td>A{5}</td><td>T<strong>AAAAA</strong>TC</td><td>匹配5个连续的A</td></tr><tr><td>TA{2,4}CG</td><td>C<strong>TAAACG</strong>A</td><td>匹配T,跟着两个至4个范围的A,跟着CG</td></tr><tr><td>[AT]CG</td><td>CCT<strong>TCG</strong>A</td><td>匹配ACG或TCG</td></tr><tr><td>[AA|CC|TT]CG</td><td>A<strong>CCCG</strong>TA</td><td>匹配AA或CC或TT,然后跟着CG</td></tr><tr><td>A(CG){3}T</td><td>G<strong>ACGCGCGT</strong>A</td><td>匹配A,3个CG,跟着T</td></tr><tr><td>((.)(.)\3\2)</td><td>GA<strong>ATTA</strong>C</td><td>捕获4个连续字符,第1和第4个相同,第2和第3个相同</td></tr><tr><td>A\.T</td><td>TCG<strong>A.T</strong>AA</td><td>匹配A.T,因为点被转义</td></tr><tr><td>AAA(?=TAG|TGA|TAA)</td><td>T<strong>AAA</strong>TGAT</td><td>匹配后面跟着TAG或TGA或TAA的AAA</td></tr></tbody></table><h3 id="Perl模块"><a href="#Perl模块" class="headerlink" title="Perl模块"></a>Perl模块</h3><h3 id="子程序"><a href="#子程序" class="headerlink" title="子程序"></a>子程序</h3><h3 id="参数传递"><a href="#参数传递" class="headerlink" title="参数传递"></a>参数传递</h3><p>利用重定向符“>”将文件内容赋值给变量</p><p><code>perl inputs_as_an_array.pl < temp.txt</code></p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">my</span> @input_line = <STDIN>; <span class="comment">#将temp.txt中的内容按照每行为一个元素传递给@input_line数组</span></span><br><span class="line"><span class="comment">#当不利于重定向时,可手动从键盘敲入@input_line数组的每个元素,回车键输入下一个元素,CTRL+D终止输入</span></span><br></pre></td></tr></table></figure><p>利用@ARGV进行参数传递至脚本内部。当perl脚本运行时,从命令行上传递给它的参数存储在内建数组@ARGV中,@ARGV是perl默认用来接收参数的数组,可以有多个参数,$ARGV[0]是表示接收到的第一个参数,$ARGV[1]表示第二个。使用方法为:<code>perl my.pl $ARGV[0] $ARGV[1]</code></p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">if</span> ($#ARGV < <span class="number">0</span>){</span><br><span class="line"><span class="keyword">die</span> <span class="string">"please provide a command line argument\n"</span>;</span><br><span class="line">} <span class="comment">#判断是否输入了命令行参数</span></span><br><span class="line"><span class="keyword">my</span> $seq = $ARGV[<span class="number">0</span>]; <span class="comment">#$ARGV[0]表示第一个参数</span></span><br><span class="line"><span class="keyword">my</span> $shorter_seq = $ARGV[<span class="number">1</span>]; <span class="comment">#$ARGV[1]表示第二个参数</span></span><br></pre></td></tr></table></figure><p>利用<SDTIN>标准输入</SDTIN></p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">print</span> <span class="string">"Enter a motif to search for: "</span>;</span><br><span class="line">$motif = <STDIN>;</span><br></pre></td></tr></table></figure><h3 id="其他函数"><a href="#其他函数" class="headerlink" title="其他函数"></a>其他函数</h3><p>1、perl中的chomp函数将去掉行尾换行符;</p><p>2、eval函数用法:块中如果有一个语法错误或者运行时错误,或者一个”die” 语句被执行,”eval” 返回undef在标量上下文环境或者一个空的列表在列表环境。$@是存放错误信息的;</p><p>3、perl的qr函数用法:创建正则表达式。此函数将其STRING引用为正则表达式。STRING的插值方式与m/PATTERN/中的PATTERN相同;这个函数返回一个Perl值,它可以用来代替相应的/STRING/表达式。</p><figure class="highlight perl"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">chomp</span>($motif);</span><br><span class="line"><span class="keyword">eval</span>{ <span class="regexp">qr/$motif/</span>};</span><br><span class="line"><span class="keyword">if</span> ($@){</span><br><span class="line"><span class="keyword">print</span> <span class="string">"motif $motif is a an illegal regular expression!\n"</span>;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><h3 id="参考连接"><a href="#参考连接" class="headerlink" title="参考连接"></a>参考连接</h3><p><a href="https://www.perlforbiologists.org/#">Perl for Biologists</a></p><p><a href="https://blog.csdn.net/u014703817/article/details/32702957">(22条消息) perl函数说明(eval)_易水寒江的博客-CSDN博客_eval perl</a></p><p><a href="https://www.cnblogs.com/zhaoyangjian724/p/6199997.html">perl eval - czcb - 博客园 (cnblogs.com)</a></p><p><a href="http://www.manongjc.com/detail/31-asywzopegrdjwrh.html">Perl qr实例讲解 - 码农教程 (manongjc.com)</a></p><p><a href="https://www.learnfk.com/perl/perl-pos.html">Perl pos函数 - 基础教程 - 无涯教程网 (learnfk.com)</a></p><p>[(22条消息) <a href="https://blog.csdn.net/gsjthxy/article/details/89003539">Perl]Perl贪婪匹配、非贪婪匹配、占有优先匹配的區別和應用_元直数字电路验证的博客-CSDN博客_perl 贪婪匹配</a></p><p><a href="https://cdn.modb.pro/db/538048">perl–用die处理致命错误&用warn送出警告信息&自动检测致命错误 - 墨天轮 (modb.pro)</a></p><p><a href="https://blog.csdn.net/cumao2792/article/details/108574913">(22条消息) 如何判断Perl中是否存在文件_cumao2792的博客-CSDN博客</a></p><p><a href="https://blog.csdn.net/u012299594/article/details/81914265">(22条消息) Perl文件目录操作_Hello Hunk的博客-CSDN博客_perl怎么进入文件夹</a></p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> perl </tag>
</tags>
</entry>
<entry>
<title>提取两文件的不同行</title>
<link href="/2022/09/24/3%E7%A7%8Dlinux%E4%B8%8B%E6%8F%90%E5%8F%96%E4%B8%A4%E4%B8%AA%E6%96%87%E4%BB%B6%E7%9A%84%E4%B8%8D%E5%90%8C%E8%A1%8C%E5%86%85%E5%AE%B9%E7%9A%84%E6%96%B9%E6%B3%95/"/>
<url>/2022/09/24/3%E7%A7%8Dlinux%E4%B8%8B%E6%8F%90%E5%8F%96%E4%B8%A4%E4%B8%AA%E6%96%87%E4%BB%B6%E7%9A%84%E4%B8%8D%E5%90%8C%E8%A1%8C%E5%86%85%E5%AE%B9%E7%9A%84%E6%96%B9%E6%B3%95/</url>
<content type="html"><![CDATA[<p>文件的情景:prank.sh.completed是prank.sh的子集:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">cat prank.sh prank.sh.completed | sort | uniq -d >temp.txt</span><br><span class="line"></span><br><span class="line">cat prank.sh temp.txt | sort | uniq -u > different.txt</span><br></pre></td></tr></table></figure><span id="more"></span><p>(以下内容搭个便车)</p><p>前两天,利用cafe4.2的版本分析,根据本地以及网上教程,</p><p><em><u><strong>在网上下载了各种现成的脚本,真方便。</strong></u></em></p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">cafe 01cafe.sh</span><br></pre></td></tr></table></figure><p> 上述命令运行报错(忘了记录)。原因:缺少树的祖先位置的λ分类。添加即可。</p><p>在提取扩张收缩的基因家族ID时,cafetutorial_report_analysis.py脚本运行报错(忘了记录)。原因:缺少cafecore moduel。解决方法:github上搜索并下载cafecore.py,添加到python脚本目录下。</p><p>运行时又有了以下报错</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">File "./cafecore.py", line 8</span><br><span class="line"><!DOCTYPE html></span><br><span class="line">^SyntaxError: invalid syntax </span><br></pre></td></tr></table></figure><p>感谢stackoverflow上的用户user2357112,他/她指出原因是没有真正下载到cafecore.py这个脚本</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">“You're not downloading the script. You're downloading a GitHub web page with the script and a whole bunch of other stuff on it, like GitHub navigation and a search bar and clickable line numbers.”</span><br></pre></td></tr></table></figure><p>于是下载了python而非html语言格式的cafecore.py脚本。到此,单独运行脚本是成功的,因为出现了脚本的使用参数等信息(如下):</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br></pre></td><td class="code"><pre><span class="line">python cafetutorial_report_analysis.py</span><br><span class="line"></span><br><span class="line">|**Error 1: -i must be defined |</span><br><span class="line"></span><br><span class="line">usage: cafetutorial_report_analysis.py [-h] [-i INPUT_FILE]</span><br><span class="line"> [-e USER_ERR_START] [-d USER_TMP_DIR]</span><br><span class="line"> [-f FIRST_RUN] [-c CURVE_OPTION]</span><br><span class="line"> [-t ERROR_TRIES] [-l USER_LOG_FILE]</span><br><span class="line"> [-o OUTPUT_FILE] [-s IND_MIN]</span><br><span class="line"> [-v VERBOSE]</span><br><span class="line"></span><br><span class="line">optional arguments:</span><br><span class="line"> -h, --help show this help message and exit</span><br><span class="line"> -i INPUT_FILE A CAFE shell script with the full CAFE path in the</span><br><span class="line"> shebang line, the load, tree, and lambda commands. These</span><br><span class="line"> lines will be read and incorporated into the caferror</span><br><span class="line"> shell script.</span><br><span class="line"> -e USER_ERR_START The starting point for the grid search. Should be between</span><br><span class="line"> 0 and 1. Default: 0.4</span><br><span class="line"> -d USER_TMP_DIR A directory in which all caferror files will be stored.</span><br><span class="line"> If none is specified, it will default to caferror_X, with</span><br><span class="line"> X being some integer one higher than the last directory.</span><br><span class="line"> -f FIRST_RUN Boolean option to perform a pre-error model run (1) or</span><br><span class="line"> not (0). Default: 0</span><br><span class="line"> -c CURVE_OPTION Boolean option. caferror can either perform the grid</span><br><span class="line"> search (0) or search a pre-specified space (1). Default:</span><br><span class="line"> 0</span><br><span class="line"> -t ERROR_TRIES A list of error values to search over. Note: -c MUST be</span><br><span class="line"> set to 1 to use these values. Enter as a comma delimited</span><br><span class="line"> string, ie -t 0.1,0.2,0.3</span><br><span class="line"> -l USER_LOG_FILE Specify the name for caferror's log file here. Default:</span><br><span class="line"> caferrorLog.txt</span><br><span class="line"> -o OUTPUT_FILE Output file which stores only the error model and score</span><br><span class="line"> for each run. Default: caferror_default_output.txt</span><br><span class="line"> -s IND_MIN Boolean option to specify whether to perform only the</span><br><span class="line"> global error search (0) or continue with individual</span><br><span class="line"> species minimizations (1). Default: 0</span><br><span class="line"> -v VERBOSE Boolean option to have detailed information for each CAFE</span><br><span class="line"> run printed to the screen (1) or not (0). Default: 1</span><br></pre></td></tr></table></figure><p>当加上输入文件以及参数时,又有了新的报错:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python python_scripts/cafetutorial_report_analysis.py -i reports/report_run1.cafe -o reports/summary_run1 -r 0</span><br></pre></td></tr></table></figure><p>报错内容:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"> File "cafetutorial_report_analysis.py", line 8, in <module></span><br><span class="line"> import sys, os, argparse, cafecore as cafecore</span><br><span class="line"> File "./cafecore.py", line 420, in <module></span><br><span class="line"> treestring = Tree[Tree.index("("):];</span><br><span class="line">NameError: name 'Tree' is not defined</span><br></pre></td></tr></table></figure><p>查看输入文件的格式与脚本内容,苦思冥想之后将脚本中匹配tree改为匹配Tree,因为输入文件中是T大写,重新运行上述脚本,意料之中,又有了新的报错(如下):</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">File "cafetutorial_report_analysis.py", line 8, in <module></span><br><span class="line"> import sys, os, argparse, cafecore as cafecore</span><br><span class="line"> File "./cafecore.py", line 491, in <module></span><br><span class="line"> printWrite(caferrorLog, 1, "# CAFE path set as:", CafePath, pad);</span><br></pre></td></tr></table></figure><p>咨询了大佬,可能是没有全路径的问题,加上全路径后,还是报错,瞬时全身乏力,通关无望。</p><p>于是找到之前做过案例的文件,查看相关文件是否有差异。将本地所有python脚本上传并运行,直接运行成功。</p><p><em><strong><u>最终发现,有时候网上的脚本真坑人。</u></strong></em></p><p>没午休,回想这些报错,头又大了!</p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> shell笔记 </tag>
</tags>
</entry>
<entry>
<title>利用TBtools绘制简单的共线性图</title>
<link href="/2022/07/14/circos/"/>
<url>/2022/07/14/circos/</url>
<content type="html"><![CDATA[<p>学习目标:长颈鹿基因组文章中的基因组共线性关系图。</p><img src="/2022/07/14/circos/qiwang.jpg" class title="This is an example image"><h3 id="首先测试一下TBtools中在Graphics下面Advanced-Circos工具。"><a href="#首先测试一下TBtools中在Graphics下面Advanced-Circos工具。" class="headerlink" title="首先测试一下TBtools中在Graphics下面Advanced Circos工具。"></a><strong>首先测试一下TBtools中在Graphics下面Advanced Circos工具。</strong></h3><p>该工具下的三个输入框分别输入三个文件,文件中每列以制表符分隔开来。</p><img src="/2022/07/14/circos/qiwang.jpg" class title="qiwang"><p>①染色体的名字及其长度,共两列;</p><p>②染色体上所有元件特征的位置信息,分别为染色体名称、序列元件名称、序列原件的起始位置和序列原件的终止位置,共四列;</p><p>③同一元件在不同染色体的位置信息,分别为染色体1的名称、元件在染色体1的起始位置、染色体1的终止位置、染色体2的名称、元件在染色体2的起始位置、染色体2的终止位置。</p><p>为了方便测试,简单编辑了三个文件,格式如下</p><img src="/2022/07/14/circos/circos1.jpg" class title="circos1"><p>将这三个文件分别拖入输入框后,点击绘图,得到以下结果,表明没有软件报错,运行环境正常,如果用基因组大数据来绘图时,若文件格式正确,应该不会出问题。</p><img src="/2022/07/14/circos/circos2.jpg" class title="circos2"><h3 id="接下来是如何在公共数据库中获取基因组数据,以及如何进行数据转换,得到circos工具需要的格式。"><a href="#接下来是如何在公共数据库中获取基因组数据,以及如何进行数据转换,得到circos工具需要的格式。" class="headerlink" title="接下来是如何在公共数据库中获取基因组数据,以及如何进行数据转换,得到circos工具需要的格式。"></a><strong>接下来是如何在公共数据库中获取基因组数据,以及如何进行数据转换,得到circos工具需要的格式。</strong></h3><p>以多孔鹿角珊瑚(Acropora millepora)基因组为例,该物种基因组是目前珊瑚虫纲为数不多的组装为染色体水平的基因组,包括已装配的14个染色体和为装配的部分。</p><p>NCBI下载GFF(general feature format)文件后,发现14个染色体的长度在以“##sequence-region NC_”开头的行中,所以可以利用此特征进行grep得到染色体长度信息,这样就可以轻松准备好第一个文件。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">grep "##sequence-region NC_" GCF_013753865.1_Amil_v2.1_genomic.gff |awk '{print $2,$4}' OFS="\t" > coral-chrlen.txt</span><br></pre></td></tr></table></figure><p>第二个文件需要染色体上所有元件特征的位置信息,基因组上的元件一般包括gene、exon、mRNA等,一般取gene进行可视化,那么需要将GFF文件中装配在染色体上的并且第三列为gene的行提取出来。也就是提出”NC_xxxx.x Gnomon gene“格式的行,因此可以得到染色体上所有基因的所有信息。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">grep "NC_" GCF_013753865.1_Amil_v2.1_genomic.gff |grep $'\t'Gnomon$'\t'gene$'\t' > chr-gene.txt</span><br></pre></td></tr></table></figure><img src="/2022/07/14/circos/circos3.jpg" class title="circos3"><p>根据文件特征,利用awk和grep提取第二个文件的信息。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">awk -F'[\t;]' -v OFS="\t" '{print $1,$13,$4,$5}' chr-gene.txt | sed 's/gene=//g' > coral-genome-feature-list.txt</span><br></pre></td></tr></table></figure><p>此处得到的文件中,存在极少部分的行,其第二列并不是基因名,这是GFF文件中这一行与其他行的格式不一样导致的,在本测试中有97行出现错误,即基因名字为”gbkey=Gene“,考虑到这些基因不足总基因数的0.5%,选择直接去除这些行。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sed -i '/gbkey=Gene/d' coral-genome-feature-list.txt #删除带有gbkey=Gene的行</span><br></pre></td></tr></table></figure><p>下面制作第三个文件</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">awk '{print $2}' coral-genome-feature-list.txt > all-genes-list.txt #将所有基因提取出来</span><br><span class="line"></span><br><span class="line">for i in `cat all-genes-list.txt`;do num=`grep $i coral-genome-feature-list.txt|wc -l`;if [ $num -ne 1 ];then echo $i;fi;done > linked-gene.txt #将具有连接关系的基因打印出来,由于本次测试是一个基因组,没有相同的基因位于两个不同的染色体上。因此为了展示,手动修改一个基因,即将第二个染色体的第一个基因修改为第一个染色体第一个基因的名字。重新运行此步骤</span><br><span class="line"></span><br><span class="line">sort linked-gene.txt |uniq > linked-gene-final.txt #去重复,得到想要展示的基因名称列表</span><br><span class="line"></span><br><span class="line">for i in `cat linked-gene-final.txt`;do grep $i coral-genome-feature-list.txt;done |sed 'N;s/\n/ \t/' > coral-linked-info.txt #得到第三个文件的雏形</span><br><span class="line"></span><br><span class="line">for i in `cat linked-gene-final.txt`;do sed -i -e "s/\t$i//g" -e "s/\s\t/\t/g" coral-linked-info.txt;done #得到最终的第三个输入文件</span><br></pre></td></tr></table></figure><h3 id="最后将得到的三个文件导入到TBtools软件中,绘图。"><a href="#最后将得到的三个文件导入到TBtools软件中,绘图。" class="headerlink" title="最后将得到的三个文件导入到TBtools软件中,绘图。"></a><strong>最后将得到的三个文件导入到TBtools软件中,绘图。</strong></h3><p>由于基因太多,基因名展示在图上重叠在一起一片漆黑,调整图片位置都卡好久,但是整体上看是符合预期的。因此,此方法适合选择部分基因来展示,不适合所有基因都进行共线性展示的情况。</p><img src="/2022/07/14/circos/circos4.jpg" class title="circos4"><h3 id="为了方便展示,并美化图,将genome-feature-list文件简化,得到以下。"><a href="#为了方便展示,并美化图,将genome-feature-list文件简化,得到以下。" class="headerlink" title="为了方便展示,并美化图,将genome-feature-list文件简化,得到以下。"></a><strong>为了方便展示,并美化图,将genome-feature-list文件简化,得到以下。</strong></h3><img src="/2022/07/14/circos/circos5.jpg" class title="circos5"><p>进一步调整颜色得到最终的circos图,根据自己的需要调整图片各元素的颜色和位置。</p><img src="/2022/07/14/circos/circos6.jpg" class title="circos6"><h3 id="为什么做circos图?"><a href="#为什么做circos图?" class="headerlink" title="为什么做circos图?"></a><strong>为什么做circos图?</strong></h3><p>展示基因组各种元件的信息,比如新测基因组与已测基因组的共线性情况;</p><p>表示基因的复制情况等;</p><p>……</p><p>虽然TBtools不如circos软件的命令行模式灵活,但该工具为非编程用户提供了极大的便利,值得学习。</p><p>若绘制以下复杂的基因组信息图,建议使用circos命令行模式的软件。</p><img src="/2022/07/14/circos/7.png" class width="7"><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash">计算染色体长度</span></span><br><span class="line"><span class="meta">#</span><span class="bash">生成染色体文件 7列</span></span><br><span class="line"><span class="meta">#</span><span class="bash">生成窗口文件, 窗口大小50Kb</span></span><br><span class="line"><span class="meta">#</span><span class="bash">计算每个窗口平均GC含量</span></span><br><span class="line"><span class="meta">#</span><span class="bash">计算每个窗口基因条数</span></span><br><span class="line"><span class="meta">#</span><span class="bash">计算每个窗口重复序列含量</span></span><br><span class="line"><span class="meta">#</span><span class="bash">共线性模块鉴定</span></span><br></pre></td></tr></table></figure><h3 id="傻瓜式利用TBtools绘制两基因组的共线性关系图"><a href="#傻瓜式利用TBtools绘制两基因组的共线性关系图" class="headerlink" title="傻瓜式利用TBtools绘制两基因组的共线性关系图"></a><strong>傻瓜式利用TBtools绘制两基因组的共线性关系图</strong></h3><p>要想制作推文最初的C图,可以使用Graphics中的Comparative genomecs下的两个工具,即One Step MCScanX和Dual Systeny Plot for MCScanX,前者用于处理绘图所需的正确格式的文件,后者用于两个基因组关系的绘图。</p><p>只需要分别准备两个物种的基因组文件和GFF文件即可,这四个文件拖入第一个工具,运行耗时比较久(半小时左右?),这样就得到第二个工具的输入文件,然后拖入点击秒出图。</p><img src="/2022/07/14/circos/2-1.jpg" class title="2-1"><img src="/2022/07/14/circos/2-2.jpg" class title="2-2"><p>此工具不用任何思考,只需点点点就能得到这个。</p><img src="/2022/07/14/circos/2-3.jpg" class title="2-3"><p>注:</p><p>①最好使用两个染色体水平的基因组作图,上面的基因组仅组装到contig水平,所以看不出什么进化事件;</p><p>②可以自行编辑第一步骤得到的ctl文件,通过保留想要展示的片段,可将其他片段的名字直接删除;</p><p>③颜色可以编辑。</p><p>④NGenomeSyn是专门绘制基因组共线性的命令行软件,功能更加详细,更加灵活。</p><h3 id="参考:"><a href="#参考:" class="headerlink" title="参考:"></a><strong>参考:</strong></h3><p><a href="https://www.jianshu.com/p/9e4ad7d4881b">基因组Circos图绘制 - 简书 (jianshu.com)</a></p><p><a href="https://wenku.baidu.com/view/9564df5701768e9951e79b89680203d8ce2f6adc.html">TBtools绘制Circos图小攻略 - 百度文库 (baidu.com)</a></p><p><a href="http://events.jianshu.io/p/2334a6346941">用TBtools,快速高效实现基因组共线性分析与可视化, 赞! - 简书 (jianshu.io)</a></p><p><a href="https://www.jianshu.com/p/45bece9a0518?u_atoken=e36c9fae-6516-444d-8da6-020253c10ad5&u_asession=01RZQiEOOy5y2FZHSLGFuNm6o66TPOTCURqsOSgM91FwilwBRdfoa-E7OHE3in8YfdX0KNBwm7Lovlpxjd_P_q4JsKWYrT3W_NKPr8w6oU7K8awO_gvLOfVsK91fZVeAlxyuYfe7vWV-zsHJifFo5DumBkFo3NEHBv0PZUm6pbxQU&u_asig=05MnJ8y49Xl-DNbLqKN3ifjC9q2vGIV6OHqf9Su8nn0C3Xb8xiMVw20TZhbcICcUTvoA-bFWqXeIGcpO0swmce1Vv092BNyGYTt6Kc0QohIJhJFDhJgu5XRQUdG9FMAJ6MJ0Sq6IKMJ_1nFNQSQh4W6SRnMn4xAY1-5tBawniJyQz9JS7q8ZD7Xtz2Ly-b0kmuyAKRFSVJkkdwVUnyHAIJzUw15I7yC7kCIaO9J9SWLL5voCuZL7lUVMjY79jYx8u2fqft3yiexPr1Pj5ASov3mu3h9VXwMyh6PgyDIVSG1W_3XTunNP28J065ybLQMiTTIRox2heX479OC2z3E-OFfsT7f3EhRgdlsw_jrTQ1f-hlskxi08aOPMtTDIEjMXOAmWspDxyAEEo4kbsryBKb9Q&u_aref=MGJwi+xsSXP/s2pFuWdmzWWTGmU=">如何高效而且优雅地比较多物种的不同基因组区域? - 简书 (jianshu.com)</a></p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 方法 </tag>
</tags>
</entry>
<entry>
<title>词云图</title>
<link href="/2022/03/10/wordcloud/"/>
<url>/2022/03/10/wordcloud/</url>
<content type="html"><![CDATA[<p>利用wordcloud2包制作词云图,对数据的词频进行可视化。</p><span id="more"></span><p>准备excel表格,包括关键词和频数。例如:</p><table><thead><tr><th align="center">word</th><th align="center">freq</th></tr></thead><tbody><tr><td align="center">分子遗传学</td><td align="center">9</td></tr><tr><td align="center">分类学</td><td align="center">9</td></tr><tr><td align="center">生物地理学</td><td align="center">8</td></tr><tr><td align="center">古生物学</td><td align="center">6</td></tr><tr><td align="center">生态学</td><td align="center">10</td></tr><tr><td align="center">分子系统学</td><td align="center">8</td></tr><tr><td align="center">水生动物学</td><td align="center">10</td></tr><tr><td align="center">微生物</td><td align="center">7</td></tr><tr><td align="center">基因组学</td><td align="center">8</td></tr><tr><td align="center">地质学</td><td align="center">5</td></tr><tr><td align="center">气候学</td><td align="center">6</td></tr><tr><td align="center">环境海洋学</td><td align="center">7</td></tr><tr><td align="center">进化生物学</td><td align="center">9</td></tr></tbody></table><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">setwd(<span class="string">"D:/Desktop/词云图"</span>) <span class="comment">#设置工作路径</span></span><br><span class="line">dir()</span><br><span class="line">install.packages(<span class="string">"wordcloud2"</span>) <span class="comment">#安装wordcloud2包</span></span><br><span class="line">library(wordcloud2)</span><br><span class="line">install.packages(<span class="string">"openxlsx"</span>)</span><br><span class="line">library(openxlsx)</span><br><span class="line">wordmap<-read.xlsx(<span class="string">"wordcloud.xlsx"</span>)</span><br><span class="line">wordcloud2(wordmap,size=<span class="number">0.3</span>,shape=<span class="string">'cardioid'</span>,color=<span class="string">"random-light"</span>) <span class="comment">#可视化,关于更多参数,可由help("wordcloud2")命令查看</span></span><br></pre></td></tr></table></figure><p>可得到以下:</p><img src="/2022/03/10/wordcloud/0310-1.png" class title="0310-1"><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 可视化 </tag>
</tags>
</entry>
<entry>
<title>分子进化生物学学习框架</title>
<link href="/2022/03/08/need-learn/"/>
<url>/2022/03/08/need-learn/</url>
<content type="html"><![CDATA[<img src="/2022/03/08/need-learn/learn_141948.png" class title="learn_141948"><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 学习框架 </tag>
</tags>
</entry>
<entry>
<title>基因家族鉴定及分析</title>
<link href="/2022/01/18/%E5%9F%BA%E5%9B%A0%E5%AE%B6%E6%97%8F%E9%89%B4%E5%AE%9A%E5%8F%8A%E5%88%86%E6%9E%90/"/>
<url>/2022/01/18/%E5%9F%BA%E5%9B%A0%E5%AE%B6%E6%97%8F%E9%89%B4%E5%AE%9A%E5%8F%8A%E5%88%86%E6%9E%90/</url>
<content type="html"><![CDATA[<p><strong>单物种基因家族鉴定及分析</strong></p><span id="more"></span><div class="row"> <embed src="./gene.pdf" width="100%" height="550" type="application/pdf"></div><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 方法 </tag>
</tags>
</entry>
<entry>
<title>conda安装samtools时报错</title>
<link href="/2022/01/17/conda%E5%AE%89%E8%A3%85samtools%E6%97%B6%E6%8A%A5%E9%94%99/"/>
<url>/2022/01/17/conda%E5%AE%89%E8%A3%85samtools%E6%97%B6%E6%8A%A5%E9%94%99/</url>
<content type="html"><![CDATA[<p><strong>当利用conda工具samtools后,如出现报错信息:</strong> </p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory</span><br></pre></td></tr></table></figure><p><strong>解决方法如下</strong></p><span id="more"></span><p>(1)在miniconda目录下samtools软件目录,进入lib文件查找libcrypto.so.*文件,对其进行软连接重命名至libcrypto.so.1.0.0。参考<a href="https://blog.csdn.net/ET_April/article/details/111405941">(14条消息) 解决samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file_ET_April的博客-CSDN博客</a> 和<a href="https://blog.csdn.net/weixin_43960055/article/details/114992790">(14条消息) samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared ……的解决方法_wyh0908的博客-CSDN博客</a></p><p>(2)有人说samtools的版本已经在1.9以上了,但是conda安装的samtools版本依然是1.7。所以建议强制安装1.9版本:<code>conda install -c bioconda samtools=1.9 --force-reinstall</code><br>参考<a href="https://www.cnblogs.com/jessepeng/p/14766638.html">【samtools】运行报错: error while loading shared libraries:libcrypto.so.1.0.0或libncurses.so.5或libtinfow.so.5 - 小xuo生 - 博客园 (cnblogs.com)</a></p><p>(3)当尝试以上两种方法之后,均不能解决报错。结果查询,可能是由于samtools的依赖文件与当前环境产生冲突。于是重新创建新的环境,进行安装,果然新环境下安装的samtools可以正常使用。虽然不能从根本上解决问题,但至少可以使用它了。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">conda create -n samtools</span><br><span class="line">conda activate samtools</span><br><span class="line">conda install samtools</span><br><span class="line">samtools</span><br></pre></td></tr></table></figure><img src="/2022/01/17/conda%E5%AE%89%E8%A3%85samtools%E6%97%B6%E6%8A%A5%E9%94%99/202.png" class width="202"><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 解决报错 </tag>
</tags>
</entry>
<entry>
<title>向国家生物信息中心数据库传输文件</title>
<link href="/2022/01/14/%E5%88%A9%E7%94%A8ftp%E4%B8%8A%E4%BC%A0%E6%96%87%E4%BB%B6/"/>
<url>/2022/01/14/%E5%88%A9%E7%94%A8ftp%E4%B8%8A%E4%BC%A0%E6%96%87%E4%BB%B6/</url>
<content type="html"><![CDATA[<p><strong>利用Filezilla软件或Shell终端(FTP客户端)通过FTP协议向远程服务器(FTP服务器)进行大文件的传输</strong></p><p><strong>通过Linux中的ascp工具上传文件</strong></p><span id="more"></span><h3 id="1、通过Filezilla软件直接拖拽"><a href="#1、通过Filezilla软件直接拖拽" class="headerlink" title="1、通过Filezilla软件直接拖拽"></a><strong>1、通过Filezilla软件直接拖拽</strong></h3><p><strong>我是通过shell终端上传的,在Filezilla软件中的操作细节可能没遇到,比如如何设置二进制模式,若利用此途径可自行查阅相关帖子。</strong></p><p><strong>登录:输入地址、用户名和密码,然后快速链接</strong></p><img src="/2022/01/14/%E5%88%A9%E7%94%A8ftp%E4%B8%8A%E4%BC%A0%E6%96%87%E4%BB%B6/1.png" class width="1"><p><strong>从左侧本地文件框中直接拖拽至右侧FTP远程服务器端位置,或者右键单击,点击上传即可</strong></p><img src="/2022/01/14/%E5%88%A9%E7%94%A8ftp%E4%B8%8A%E4%BC%A0%E6%96%87%E4%BB%B6/2.png" class width="2"><h3 id="2、通过Shell终端传输(从下载ftp工具到传输文件)"><a href="#2、通过Shell终端传输(从下载ftp工具到传输文件)" class="headerlink" title="2、通过Shell终端传输(从下载ftp工具到传输文件)"></a>2、通过Shell终端传输(从下载ftp工具到传输文件)</h3><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line">yum -y install ftp #若ftp回车后显示没有安装,则可通过此命令进行安装</span><br><span class="line"></span><br><span class="line">ftp 服务器地址 #连接远程服务器</span><br><span class="line">Name: 用户名</span><br><span class="line">Password: 密码 #登录成功</span><br><span class="line"><span class="meta"></span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment">#以上传测序数据为例###</span></span></span><br><span class="line"><span class="meta">ftp></span><span class="bash"> <span class="built_in">cd</span> GSA <span class="comment">#同样地,利用cd命令切换路径</span></span></span><br><span class="line">250 Directory changed to /GSA </span><br><span class="line"><span class="meta">ftp></span><span class="bash"> binary <span class="comment">#上传二进制模式文件</span></span></span><br><span class="line">200 Command TYPE okay.</span><br><span class="line"><span class="meta">ftp></span><span class="bash"> prompt <span class="comment">#交互模式</span></span> </span><br><span class="line">Interactive mode off.</span><br><span class="line"><span class="meta">ftp></span><span class="bash"> mput * <span class="comment">#利用mput命令上传所在目录下所有文件 下载时的命令为ftp> get filename</span></span></span><br></pre></td></tr></table></figure><p><strong>可优化:①利用递归方法进行目录下文件及子目录地上传</strong></p><p><strong>当上传结束时会显示以下信息</strong></p><img src="/2022/01/14/%E5%88%A9%E7%94%A8ftp%E4%B8%8A%E4%BC%A0%E6%96%87%E4%BB%B6/3.png" class width="3"><h3 id="3、利用ascp工具上传文件"><a href="#3、利用ascp工具上传文件" class="headerlink" title="3、利用ascp工具上传文件"></a>3、利用ascp工具上传文件</h3><p><strong>无奈文件太大,利用ftp传输太慢,然后转用ascp命令行</strong></p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">/root/data1/wutianzhen2021/.aspera/connect/bin/ascp -P33001 -i /root/data1/wutianzhen2021/temp/F14HTSECKF0151/aspsub_rsa -QT -l100m -k1 -d /root/data1/wutianzhen2021/temp/F14HTSECKF0151/upload/*/*.gz aspsub@submit.big.ac.cn:uploads/chaisiminendeavor@163.com_2bac8272</span><br></pre></td></tr></table></figure><p><strong>注意事项:</strong></p><p><strong>①下载Aspera Connect插件(下载、解压、运行sh文件)</strong></p><p><strong>②报错ascp: Failed to open TCP connection for SSH, exiting.时;可在root用户尝试以下命令解决防火墙问题</strong></p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">iptables -I INPUT -p tcp --dport 33001 -j ACCEPT</span><br><span class="line">iptables -I OUTPUT -p tcp --dport 33001 -j ACCEPT</span><br></pre></td></tr></table></figure><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 方法 </tag>
</tags>
</entry>
<entry>
<title>转录组测序研究进展</title>
<link href="/2022/01/05/%E5%85%B3%E4%BA%8E%E8%BD%AC%E5%BD%95%E6%9C%AC/"/>
<url>/2022/01/05/%E5%85%B3%E4%BA%8E%E8%BD%AC%E5%BD%95%E6%9C%AC/</url>
<content type="html"><![CDATA[<p> <strong>转录组(Transcriptome)是指特定细胞或组织中全部转录产物,包括信使RNA,核糖体RNA、转运RNA 以及非编码RNA。</strong></p><h4 id="1-转录组学是什么?"><a href="#1-转录组学是什么?" class="headerlink" title="1 转录组学是什么?"></a><strong>1 转录组学是什么?</strong></h4><p> 转录组学是从整体转录水平系统研究基因转录图谱并揭示复杂生物学通路和性状调控网络分子机制的学科。</p><h4 id="2-转录组测序是什么?"><a href="#2-转录组测序是什么?" class="headerlink" title="2 转录组测序是什么?"></a><strong>2 转录组测序是什么?</strong></h4><p> 转录组测序(RNA-seq)就是利用高通量测序技术将细胞或组织中全部或部分mRNA、small RNA和no-coding RNA 进行测序分析的技术</p><h4 id="3-可以用来解决什么问题?"><a href="#3-可以用来解决什么问题?" class="headerlink" title="3 可以用来解决什么问题?"></a><strong>3 可以用来解决什么问题?</strong></h4><p> ①检测与现有基因组序列相对应的转录本;②能发现和定量新的转录本;③基因转录的物种特异性和时空差异;④探究非编码RNA的调控机制;⑤单细胞转录组解析不同类型细胞的基因表达谱或空间分布情况。</p><blockquote><p>怎么理解转录本与基因之间的关系呢?</p><p>研究一个基因时应该首先确定研究该基因的哪个转录本</p><p>转录本其实就是基因通过转录形成的一种或多种可供编码蛋白质的成熟的mRNA</p></blockquote><img src="/2022/01/05/%E5%85%B3%E4%BA%8E%E8%BD%AC%E5%BD%95%E6%9C%AC/zhuanluben.png" class title="zhuanluben"><h4 id="4-转录组测序的一般流程和测序内容?"><a href="#4-转录组测序的一般流程和测序内容?" class="headerlink" title="4 转录组测序的一般流程和测序内容?"></a><strong>4 转录组测序的一般流程和测序内容?</strong></h4><h5 id="4-1-mRNA测序"><a href="#4-1-mRNA测序" class="headerlink" title="4.1 mRNA测序"></a><strong>4.1 mRN<img src="/2022/01/05/%E5%85%B3%E4%BA%8E%E8%BD%AC%E5%BD%95%E6%9C%AC/liuchengtu.png" class title="liuchengtu">A测序</strong></h5><p> 利用mRNA 在3’ 端具有poly-A 的结构特点,富集出特定组织或细胞在特定时空条件下转录出来的不含内含子序列的mRNA 分子,反转录成cDNA 建库测序。</p><h5 id="4-2-small-RNA测序"><a href="#4-2-small-RNA测序" class="headerlink" title="4.2 small RNA测序"></a><strong>4.2 small RNA测序</strong></h5><p> Small RNA 是指长度在20-50 nt 的RNA 分子,包括miRNA、siRNA、snoRNA 和piRNA 等, 通过参与mRNA 降解、抑制翻译过程、促进异染色质形成和DNA 表观修饰等多种途径来调控生物学过程。根据small RNA 的5’ 端磷酸基和3’ 端羟基的结构特点,链接测序接头并筛选small RNA 测序文库进行测序。miRNA 在物种间的生物学功能较为保守,是small RNA 测序研究中的重点。</p><h5 id="4-3-lncRNA测序"><a href="#4-3-lncRNA测序" class="headerlink" title="4.3 lncRNA测序"></a><strong>4.3 lncRNA测序</strong></h5><p> 长链非编码RNA(lncRNA) 是一类长度在200 nt 以上、无编码蛋白质功能的RNA 分子,往往具有很强的物种、组织特异性。部分lncRNA 位于基因的增强子区域,通过自身的转录而实现增强子的功能。lncRNA 调控方式多样且广泛存在于各类动植物细胞中,可以通过参与染色体结构形成以及与转录因子、蛋白质、RNA 前体、miRNA 结合等多种方式调节各类生物学分子的功能。部分lncRNA含有ploy-A 尾结构,因而在mRNA 的测序结果中往往包含部分lncRNA 序列信息。目前对于lncRNA 的研究,以寻找差异表达的lncRNA 分子入手,主要依据lncRNA 与关键编码基因的位置关系,进一步预测两者之间的调控关系。</p><p><strong>4.4 circRNA测序</strong></p><p> 环状RNA(circRNA)具有特殊的稳定性良好的成环结构,不容易被RNA 酶降解,被认为在生物体内可以长效行使转录调控功能。同一段基因组序列可能会产生多种类型的circRNA 分子,外显子和内含子的不同剪切组合使得circRNA 可能包含多个外显子或内含子序列。circRNA 具有吸附miRNA分子的“海绵”作用,介入miRNA 对mRNA 的调控过程。</p><h5 id="4-5-全转录测序-Whole-transcriptome-sequencing"><a href="#4-5-全转录测序-Whole-transcriptome-sequencing" class="headerlink" title="4.5 全转录测序 Whole transcriptome sequencing"></a><strong>4.5 全转录测序 Whole transcriptome sequencing</strong></h5><p> 全转录组测序(Whole transcriptome sequencing)能够测定样本中的全部完整的转录本,主要包括mRNA 和非编码RNA(lncRNA,circRNA和miRNA)。全转录本测序与常规RNA-seq 的区别主要是建库方式的不同。全转录组测序在建库过程中需分别建立2 个文库(mRNA+lncRNA+circRNA文库和miRNA 文库)或3 个文库(mRNA+lncRNA文库、circRNA 文库和miRNA 文库)。通过全转录组数据,不仅可以获得全部类型转录本的表达图谱,在此基础之上,对不同RNA 分子进行鉴定和注释,分析其编码蛋白和调控功能,并对RNA 分子之间的互作调控网络进行分析,从整体上全面系统的分析特定细胞在特定时空下的生物学特征。</p><h5 id="4-6-单细胞转录组测序-scRNA-seq"><a href="#4-6-单细胞转录组测序-scRNA-seq" class="headerlink" title="4.6 单细胞转录组测序 scRNA-seq"></a><strong>4.6 单细胞转录组测序 scRNA-seq</strong></h5><p> 单细胞转录组测序技术是在单细胞水平研究整个转录组的技术,用于评估单个细胞间基因表达的差异,能避免细胞类型混杂而引入的假阴性结果,有可能识别出无法通过混合细胞检测到的罕见的细胞群体。</p><p> 单细胞分离是scRNA-seq 的关键步骤,主要通过连续稀释、显微操作分离、荧光激活细胞分选(Fluorescence-activated cell sorting,FACS)和微流控分离(Microfluidic technology)等技术实现。</p><h4 id="5-构建文库的策略"><a href="#5-构建文库的策略" class="headerlink" title="5 构建文库的策略"></a>5 构建文库的策略</h4><h5 id="5-1-非链特异性文库(Non-strand-specific-library)"><a href="#5-1-非链特异性文库(Non-strand-specific-library)" class="headerlink" title="5.1 非链特异性文库(Non-strand-specific library)"></a><strong>5.1 非链特异性文库(Non-strand-specific library)</strong></h5><p> RNA 逆转录成双链cDNA,随机加上接头、不区分RNA 的链的信息的文库。测序时以双链cDNA 进行测序,无法区分mRNA 的转录方向。</p><img src="/2022/01/05/%E5%85%B3%E4%BA%8E%E8%BD%AC%E5%BD%95%E6%9C%AC/A.png" class title="A"><h5 id="5-2-链特异性文库(Strand-specific-library)"><a href="#5-2-链特异性文库(Strand-specific-library)" class="headerlink" title="5.2 链特异性文库(Strand-specific library)"></a><strong>5.2 链特异性文库(Strand-specific library)</strong></h5><p> 以化学修饰标记一条链,比如通过重硫酸盐处理RNA 分子,或者在第二链cDNA 合成时引入dUTP,然后降解含有U 的链;</p><img src="/2022/01/05/%E5%85%B3%E4%BA%8E%E8%BD%AC%E5%BD%95%E6%9C%AC/BC.png" class title="BC"><p> 以不同接头连接RNA 分子或合成cDNA 链的5’ 和3’ 末端,来区分正反义链;</p><img src="/2022/01/05/%E5%85%B3%E4%BA%8E%E8%BD%AC%E5%BD%95%E6%9C%AC/D.png" class title="D"><img src="/2022/01/05/%E5%85%B3%E4%BA%8E%E8%BD%AC%E5%BD%95%E6%9C%AC/EFGH.png" class title="EFGH"><img src="/2022/01/05/%E5%85%B3%E4%BA%8E%E8%BD%AC%E5%BD%95%E6%9C%AC/TUZHU.png" class title="TUZHU"><p> 经科研人员测评,C和E两种构建文库方法效果较好。</p><h4 id="6-数据处理流程"><a href="#6-数据处理流程" class="headerlink" title="6 数据处理流程"></a>6 数据处理流程</h4><p> <strong>用于比较不同组别之间基因水平或转录本水平的定量差异时,其分析基本流程包括以下:</strong></p><h5 id="6-1-原始数据预处理"><a href="#6-1-原始数据预处理" class="headerlink" title="6.1 原始数据预处理"></a><strong>6.1 原始数据预处理</strong></h5><p> 常用质控软件包括:Trimmomatic、RSeQC、FASTX、Trim Galore等,QC 后得到的数据称为clean data,用于后续分析。</p><h5 id="6-2-reads比对"><a href="#6-2-reads比对" class="headerlink" title="6.2 reads比对"></a><strong>6.2 reads比对</strong></h5><p> 应用于转录组数据的比对软件, 常用的有bowtie、bowtie2、STAR、HISAT/HISAT2等,BWA软件的比对算法被认为对于分割比对不敏感,因而不适合用于RNA 序列与含有内含子序列的基因组序列之间的比对。</p><h5 id="6-3-转录本组装"><a href="#6-3-转录本组装" class="headerlink" title="6.3 转录本组装"></a><strong>6.3 转录本组装</strong></h5><p> 转录本组装就是将测序数据组装成转录本。对于有参考基因组的物种,根据转录组比对后的结果,明确外显子之间的连接方式,从而构建出转录本的结构。常用工具包括Cufflinks 和Scripture。</p><p> 对于无参考基因组序列的转录组数据,为了获得完整的转录本序列,需要对RNA 测序得到的短reads进行从头组装。常用工具包括Trinity、TransAbySS和Velvet等。以Trinity 组装小鼠的转录组数据时,为了保证组装效果,至少需要30× 以上覆盖度的测序reads。</p><h5 id="6-4-转录本预测"><a href="#6-4-转录本预测" class="headerlink" title="6.4 转录本预测"></a><strong>6.4 转录本预测</strong></h5><p> 大多数基因有多种剪接形式,且有可能产生多种转录本,从而编码产生不同的蛋白,这样有可能造成一个基因有多种功能。</p><p> 对于有参考基因组和转录本参考信息的物种,转录本结构主要是根据测序得到reads 进行比对,reads 覆盖了全部的转录本序列,依靠基因组序列组装出完整的转录本信息。</p><p> 对于无参考基因组的物种,需要自行组装出基因的转录本序列。得到的基因或转录本序列可以与同物种或近源物种的unigene 和EST 数据库进行比较,以判断得到的基因或转录本序列的可靠性。</p><h5 id="6-5-转录本表达水平分析"><a href="#6-5-转录本表达水平分析" class="headerlink" title="6.5 转录本表达水平分析"></a><strong>6.5 转录本表达水平分析</strong></h5><p> FPKM是应用于双段测序的RNA-seq 分析中。Cufflinks、DESeq/DESeq2、EDGR等软件可用来进行表达量的确定。常用FDR 等多重检验校正的方法对比较分析的显著性进行校正。</p><h5 id="6-6-变异检测"><a href="#6-6-变异检测" class="headerlink" title="6.6 变异检测"></a><strong>6.6 变异检测</strong></h5><p>检测转录本上全部的SNP 和Indel等突变类型。SAMtools、BCFtools和GATK等软件可用来检测转录组中相关的变异。</p><p>以上具体流程参考<a href="https://wu-tz.github.io/2021/11/11/Analysis-of-transcriptome/">Analysis_of_transcriptome | Wutianzhen (wu-tz.github.io)</a></p><h4 id="7-当前转录组热点"><a href="#7-当前转录组热点" class="headerlink" title="7 当前转录组热点"></a><strong>7 当前转录组热点</strong></h4><p>PacBio 的单分子实时测序技术,具有读长较长的优点,能够进行全长转录组的研究,特别适合用于发现新转录本。随着单细胞分离以及单分子测序技术的发展,单细胞转录组测序技术在异质性细胞的转录组研究中具有广阔的前景。</p><p><strong>本文大篇幅摘自综述农科院崔凯老师的《转录组测序技术的研究和应用进展》。</strong></p><p><strong>崔凯, 吴伟伟, 刁其玉. 转录组测序技术的研究和应用进展. 生物技术通报, 2019, 35(7): 1-9</strong></p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 转录组 </tag>
</tags>
</entry>
<entry>
<title>test_pipeline-of-phylogeny</title>
<link href="/2021/12/26/pipeline-of-phylogeny/"/>
<url>/2021/12/26/pipeline-of-phylogeny/</url>
<content type="html"><![CDATA[<p><strong>以2019年发表在MER期刊上的“Transcriptome-based target-enrichment baits for stony corals (Cnidaria: Anthozoa: Scleractinia)”文章中数据为测试数据,跑一遍构建物种分化时间树的探索流程如下:</strong></p><span id="more"></span><h4 id="1、数据获取"><a href="#1、数据获取" class="headerlink" title="1、数据获取"></a>1、数据获取</h4><p>在文献中获取452个直系同源基因序列文件,其中每个文件包括不同数量的物种,未进行多序列比对;分析所需软件和脚本:<strong>Mafft、Trimal、catfasta2phyml.pl、fasta2relaxedPhylip.pl、PartitionFinder、Iqtree、Raxml、astral.5.7.8.jar、Mcmctree</strong>等,多数可用<strong>conda</strong>安装,最好单独创建<strong>python2</strong>和<strong>python3</strong>环境用于以下软件的使用。</p><h4 id="2、利用mafft软件进行多序列比对;"><a href="#2、利用mafft软件进行多序列比对;" class="headerlink" title="2、利用mafft软件进行多序列比对;"></a>2、利用mafft软件进行多序列比对;</h4><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">ls></span><span class="bash">temp.txt;sed -i <span class="string">'s/temp.txt//g'</span> temp.txt;<span class="keyword">for</span> i <span class="keyword">in</span> `cat temp.txt`;<span class="keyword">do</span> <span class="built_in">echo</span> <span class="string">"mafft --maxiterate 1000 --localpair <span class="variable">$i</span> > mafft-<span class="variable">$i</span>"</span> >> multiple_mafft.sh;<span class="keyword">done</span>;rm temp.txt <span class="comment">#比对DNA</span></span></span><br><span class="line"></span><br><span class="line">ParaFly -c multiple.sh -CPU 50 #运行mafft比对</span><br></pre></td></tr></table></figure><h4 id="3、利用trimal剪切;"><a href="#3、利用trimal剪切;" class="headerlink" title="3、利用trimal剪切;"></a>3、利用trimal剪切;</h4><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">ls mafft-OG00* >temp.txt;sed -i 's/temp.txt//g' temp.txt;for i in `cat temp.txt`;do trimal -in $i -out out-$i -automated1;done #剪切,去除非保守区域</span><br></pre></td></tr></table></figure><h4 id="4、过滤;"><a href="#4、过滤;" class="headerlink" title="4、过滤;"></a>4、过滤;</h4><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">ls></span><span class="bash">temp;<span class="keyword">for</span> i <span class="keyword">in</span> `cat temp`;<span class="keyword">do</span> number=`grep -c <span class="string">'>'</span> <span class="variable">$i</span>`;<span class="keyword">if</span> [ <span class="variable">$number</span> -lt 6 ]; <span class="keyword">then</span> mv <span class="variable">$i</span> ./remove_file; <span class="keyword">fi</span>;<span class="keyword">done</span> <span class="comment">#将物种少于6个的文件移到remove_file文件中</span></span></span><br></pre></td></tr></table></figure><h4 id="5、利用串联法建树;"><a href="#5、利用串联法建树;" class="headerlink" title="5、利用串联法建树;"></a>5、利用串联法建树;</h4><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">perl ../../../scripts/catfasta2phyml.pl -c -f *.fasta > super_sequences.fasta #利用perl脚本将基因进行串联</span><br><span class="line"></span><br><span class="line">perl ../../../scripts/fasta2relaxedPhylip.pl -f super_sequences.fasta -o super_sequences.phylip #利用perl脚本将fasta格式转换为phylip格式</span><br><span class="line"></span><br><span class="line">python PartitionFinder.py task20211220/ #利用PartitionFinder计算分区方案,cfg文件参数为branchlengths=linked;models=GTR,GTR+G,GTR+I+G;model_selection=aicc;search=greedy。此软件所依赖包是python2环境下的,因此要在python2环境下运行此步骤</span><br><span class="line"></span><br><span class="line">iqtree -s super_sequences.fasta -redo -pre outtree -p best_scheme_foriqtree.txt -b 1000 -nt AUTO #利用iqtree构树,-p best_scheme_foriqtree.txt指定上述方法中的最佳分区方案,但是在运行时发现14个物种的gap过多,这些数据导致未能通过卡方检验,运行失败,猜测是数据质量问题,于是使用不分区方案直接对串联序列进行构树</span><br><span class="line"></span><br><span class="line">iqtree -s super_sequences.fasta -redo -pre outtree -m MFP -b 1000 -nt AUTO #利用-m MFP参数自动检测最佳替换模型</span><br></pre></td></tr></table></figure><h4 id="6、溯祖法建树"><a href="#6、溯祖法建树" class="headerlink" title="6、溯祖法建树"></a>6、溯祖法建树</h4><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">ls out-mafft*>temp;for i in `cat temp`;do echo "modeltest-ng -i ../$i -d nt" >>multiple_modeltest.sh;done;rm temp #为每个基因检测最佳碱基替换模型</span><br><span class="line"></span><br><span class="line">ls *.out>temp;for i in `cat temp`;do grep 'raxml-ng --msa' $i |tail -n 1 >> multiple_raxml.sh;done</span><br><span class="line">sed -i -e 's/raxml-ng/raxml-ng --all/g' -e 's/>//' -e 's/$/ --bs-trees 1000/g' multiple_raxml.sh #提取每个基因的最佳替换模型,并整理为批量运行raxml-ng软件的sh文件,其中--bs-trees参数为1000</span><br><span class="line"></span><br><span class="line">sh multiple_raxml.sh #利用raxml-ng建树</span><br><span class="line"></span><br><span class="line">for i in *.bestTree;do cat $i >>in.tree;done #将raxml-ng建树结果整理为astral输入格式</span><br><span class="line"></span><br><span class="line">java -jar ~/softwares/Astral/astral.5.7.8.jar -i in.tree -o out.tre 2>out.log #将多个基因树利用astral合并为一棵树。注:此处得到的tree只包含了内部节点的枝长,而不能得到末端枝枝长,某些美化树的软件不能显示(Figtree可以),因此我们也可以通过添加虚拟末端枝长以利用某些软件进行可视化(添加末端枝长的脚本链接为https://github.com/smirarab/global/blob/master/src/mirphyl/utils/add-bl.py)</span><br></pre></td></tr></table></figure><hr><p><strong>方案优化点:</strong><br><strong>①最初使用DNA直接比对后剪切会导致单个基因变不完整(非3整倍数),在检测分区时不能具体到每个密码子中的3个位置(第三位碱基往往比前两位碱基具有更高的突变率),因此,在比对时应采取利用氨基酸比对,再回译为DNA,从而保证基因是3的整倍数,进而得到更为精细的分区方案。</strong><br><strong>②未进行评估序列异质性、饱和程度步骤。</strong></p><hr><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 方法 </tag>
</tags>
</entry>
<entry>
<title>batch sequences alignment using MAFFT</title>
<link href="/2021/12/17/paper-data/"/>
<url>/2021/12/17/paper-data/</url>
<content type="html"><![CDATA[<h3 id="利用Mafft软件分别对蛋白和DNA进行批量比对"><a href="#利用Mafft软件分别对蛋白和DNA进行批量比对" class="headerlink" title="利用Mafft软件分别对蛋白和DNA进行批量比对"></a>利用Mafft软件分别对蛋白和DNA进行批量比对</h3><span id="more"></span><p><strong>首先利用conda安装相关软件和脚本文件:</strong></p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">conda install mafft </span><br><span class="line">conda install Gblocks</span><br><span class="line">conda install trimal</span><br><span class="line">conda install pal2nal.pl</span><br></pre></td></tr></table></figure><h3 id="1、将DNA翻译成氨基酸进行多序列比对,再剪切,最后回译为DNA。"><a href="#1、将DNA翻译成氨基酸进行多序列比对,再剪切,最后回译为DNA。" class="headerlink" title="1、将DNA翻译成氨基酸进行多序列比对,再剪切,最后回译为DNA。"></a>1、将DNA翻译成氨基酸进行多序列比对,再剪切,最后回译为DNA。</h3><p><strong>将序列文件置于当前文件下 ./</strong> </p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">ls></span><span class="bash">temp.txt;sed -i <span class="string">'s/temp.txt//g'</span> temp.txt;<span class="keyword">for</span> i <span class="keyword">in</span> `cat temp.txt`;<span class="keyword">do</span> faTrans <span class="variable">$i</span> aa-<span class="variable">$i</span>;<span class="keyword">done</span>;rm temp.txt;mkdir pepfile;mv aa-* pepfile <span class="comment">#将DNA翻译成蛋白并移入pepfile文件夹;</span></span></span><br></pre></td></tr></table></figure><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">ls></span><span class="bash">temp.txt;sed -i <span class="string">'s/temp.txt//g'</span> temp.txt;<span class="keyword">for</span> i <span class="keyword">in</span> `cat temp.txt`;<span class="keyword">do</span> <span class="built_in">echo</span> <span class="string">"mafft --maxiterate 1000 --localpair <span class="variable">$i</span> > mafft-<span class="variable">$i</span>"</span> >> multiple.sh;<span class="keyword">done</span>;rm temp.txt <span class="comment">#创建mafft批量运行sh文件;</span></span></span><br></pre></td></tr></table></figure><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">ParaFly -c multiple.sh -CPU 50 #运行mafft;</span><br></pre></td></tr></table></figure><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">ls mafft*>temp.txt;for i in `cat temp.txt`;do Gblocks $i -b5=h;done #剪切氨基酸;</span><br></pre></td></tr></table></figure><p><strong><!--此处利用gblocks是防止利用trimal时,会自动清除全长gap的物种,后续回译报错物种不匹配。--></strong></p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">rename .fasta-gb .fasta *-gb #修改后缀;</span><br></pre></td></tr></table></figure><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">ls *.fasta>temp.txt;for i in `cat temp.txt`;do pal2nal.pl ./pepfile/mafft-aa-$i $i -output fasta > out-$i;done#氨基酸回译为dna</span><br></pre></td></tr></table></figure><p><strong>此时遇到的问题是:报错inconsistency between the following pep and nuc seqs,导致回译之后的序列文件部分为空。并且此时得到的结果可能包括全长为gap的序列,这会影响后续系统发育结构的构建,应将序列中的此类物种去掉。</strong></p><!--**此部分属于试错阶段,我们可以按需选取步骤**--><p><strong>两款剪切软件的区别:</strong></p><p><strong>Gblocks:剪切后不会丢弃数据缺失物种;</strong></p><p><strong>trimal:剪切后会将数据缺失物种丢弃。</strong></p><h3 id="2、将DNA直接比对后直接剪切,用于后续分析。"><a href="#2、将DNA直接比对后直接剪切,用于后续分析。" class="headerlink" title="2、将DNA直接比对后直接剪切,用于后续分析。"></a>2、将DNA直接比对后直接剪切,用于后续分析。</h3><p><strong><!--注:Gblocks之后序列被剪切为不完整的CDS,造成序列可能非3整倍数,不过这对构建系统发生关系没有影响。所以可以避免在标题1中的问题。--></strong></p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">ls></span><span class="bash">temp.txt;sed -i <span class="string">'s/temp.txt//g'</span> temp.txt;<span class="keyword">for</span> i <span class="keyword">in</span> `cat temp.txt`;<span class="keyword">do</span> <span class="built_in">echo</span> <span class="string">"mafft --maxiterate 1000 --localpair <span class="variable">$i</span> > mafft-<span class="variable">$i</span>"</span> >> multiple_mafft.sh;<span class="keyword">done</span>;rm temp.txt <span class="comment">#比对DNA</span></span></span><br></pre></td></tr></table></figure><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">ParaFly -c multiple.sh -CPU 50 #运行mafft;</span><br></pre></td></tr></table></figure><p><strong>接下来直接用Trimal软件剪切序列,去除非保守区域。</strong></p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">ls mafft-OG00* >temp.txt;sed -i 's/temp.txt//g' temp.txt;for i in `cat temp.txt`;do trimal -in $i -out out-$i -automated1;done #剪切</span><br></pre></td></tr></table></figure><p><strong><!--此时得到的数据直接用于后续系统发育分析(非3整倍数,当用与其他分析时需谨慎,比如选择压力分析)。--></strong></p><h3 id="参考:"><a href="#参考:" class="headerlink" title="参考:"></a><strong>参考:</strong></h3><p><strong>Quek, R. Z., Jain, S. S., Neo, M. L., Rouse, G. W., & Huang, D. (2020). Transcriptome‐based target‐enrichment baits for stony corals (Cnidaria: Anthozoa: Scleractinia). <em>Molecular ecology resources</em>, <em>20</em>(3), 807-818.</strong></p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 方法 </tag>
</tags>
</entry>
<entry>
<title>将fasta序列转化为键值对后进行序列处理</title>
<link href="/2021/12/08/split-gene/"/>
<url>/2021/12/08/split-gene/</url>
<content type="html"><![CDATA[<p>功能:将fasta格式的序列文件按照特定位置切分为前后两部分,并存为两个新的fasta文件。</p><span id="more"></span><p>其中,f是待切分的序列文件;f1、f2分别是切分后产生的文件;m为切割分界线位置;n为比对后序列全长。</p><p>关键知识为将fasta格式序列中的物种名定义为健、序列定义为值。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br></pre></td><td class="code"><pre><span class="line">f = <span class="built_in">open</span>(<span class="string">"PRM2-61.fas"</span>)</span><br><span class="line">f1=<span class="built_in">open</span>(<span class="string">"61-qian.fas"</span>,<span class="string">"a"</span>)</span><br><span class="line">f2=<span class="built_in">open</span>(<span class="string">"61-hou.fas"</span>,<span class="string">"a"</span>)</span><br><span class="line">m=<span class="number">144</span></span><br><span class="line">n=<span class="number">306</span></span><br><span class="line">seq = {}</span><br><span class="line"><span class="keyword">for</span> line <span class="keyword">in</span> f:</span><br><span class="line"> <span class="keyword">if</span> line.startswith(<span class="string">'>'</span>):</span><br><span class="line"> name=line.replace(<span class="string">'>'</span>,<span class="string">''</span>).split()[<span class="number">0</span>]</span><br><span class="line"> seq[name]=<span class="string">''</span></span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> seq[name]+=line.replace(<span class="string">'\n'</span>,<span class="string">''</span>).strip()</span><br><span class="line">f.close()</span><br><span class="line"><span class="comment">#print(seq.keys())</span></span><br><span class="line">species=seq.keys()</span><br><span class="line"><span class="comment">#print(species)</span></span><br><span class="line"><span class="keyword">for</span> ecsh_spe <span class="keyword">in</span> species:</span><br><span class="line"> line1=seq[ecsh_spe][<span class="number">0</span>:m]</span><br><span class="line"> line2=seq[ecsh_spe][m:n]</span><br><span class="line"> <span class="comment">#print(line1)</span></span><br><span class="line"> f1.write(<span class="string">'>'</span>+ecsh_spe+<span class="string">"\n"</span>+line1+<span class="string">"\n"</span>)</span><br><span class="line"> <span class="comment">#print(line2)</span></span><br><span class="line"> f2.write(<span class="string">'>'</span>+ecsh_spe+<span class="string">"\n"</span>+line2+<span class="string">"\n"</span>)</span><br><span class="line">f.close()</span><br></pre></td></tr></table></figure><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 方法 </tag>
</tags>
</entry>
<entry>
<title>website links</title>
<link href="/2021/11/22/Database-link/"/>
<url>/2021/11/22/Database-link/</url>
<content type="html"><![CDATA[<p>26、</p><p>25、14种珊瑚转录组数据库(内含其他4个珊瑚转录组来源数据库)<a href="https://www.comp.hkbu.edu.hk/~db/CoralTBase/index.php">ScleractiniaTBase (hkbu.edu.hk)</a></p><p>1、美国国家生物信息中心<a href="https://www.ncbi.nlm.nih.gov/">National Center for Biotechnology Information (nih.gov)</a></p><p>2、清华大学生物信息教学<a href="https://lulab2.gitbook.io/teaching/">Bioinformatics Tutorial - Basic - Bioinformatics Tutorial - Basic (gitbook.io)</a></p><p>3、 濒危物种红色名录<a href="https://www.iucnredlist.org/">IUCN Red List of Threatened Species</a></p><p>4、生物绘图在线网站<a href="https://biorender.com/">BioRender</a></p><span id="more"></span><p>5、系统发育树在线编辑<a href="https://itol.embl.de/">iTOL: Interactive Tree Of Life (embl.de)</a></p><p>6、蛋白结构域在线预测,也可下载软件进行本地批量运行<a href="http://pfam.xfam.org/">Pfam: Home page (xfam.org)</a></p><p>7、提供的服务包括搜索、最新新闻、地图和百科、电子信箱、电子商务、互联网广告及其他服务。Yandex在俄罗斯本地的市场份额已远超俄罗斯google<a href="https://yandex.com/">Yandex</a></p><p>8、ADW是动物自然历史百科全书,由学生、摄影师和许多其他人的贡献建立;可利用上面的生态学数据用于分析,如动物的体重、寿命等数据。<a href="https://animaldiversity.org/">ADW: Home (animaldiversity.org)</a></p><p>9、R-Bioconductor的官网,包括安装、学习、使用和更新发展。<a href="http://www.bioconductor.org/">Bioconductor - Home</a></p><p>10、各种计算机语言的教学站点<a href="https://www.runoob.com/">菜鸟教程 - 学的不仅是技术,更是梦想! (runoob.com)</a></p><p>11、哈佛生信大牛刘小乐的生信教学视频<a href="https://liulab-dfci.github.io/bioinfo-combio/">Introduction to Bioinformatics and Computational Biology (liulab-dfci.github.io)</a></p><p>12、物种进化树及其分化时间网站<a href="http://www.timetree.org/">TimeTree :: The Timescale of Life</a></p><p>13、杨子恒PAML软件的中文说明书<a href="https://max.book118.com/html/2017/0323/96483278.shtm">PAML中文的说明.doc (book118.com)</a></p><p>14、清华大学 TUNA 协会,包括开源镜像站<a href="https://tuna.moe/">清华大学 TUNA 协会</a></p><p>15、操作分类单元(OTUs)聚类<a href="https://www.cnblogs.com/djx571/p/9098831.html">OTU(operational taxonomic units),即操作分类单元 - 发那个太丢人 - 博客园 (cnblogs.com)</a></p><p>16、绘制基因结构图,只需给出位置信息即可<a href="http://gsds.gao-lab.org/">Gene Structure Display Server 2.0 (gao-lab.org)</a></p><p>17、利用phytools包来重建祖先序列的脚本示例网页<a href="http://www.phytools.org/eqg2015/asr.html">Ancestral state reconstruction & visualizing ancestral states on a phylogeny (phytools.org)</a></p><p>18、查找基因功能的快捷网站,输入基因即可列出相关功能和参考文献<a href="https://omim.org/">OMIM - Online Mendelian Inheritance in Man</a></p><p>19、西工大Kun Wang的github账号<a href="https://github.com/wk8910">wk8910 (Kun Wang) · GitHub</a></p><p>20、可搜索4000+实验方法,得到具体流程<a href="https://bio-protocol.org/cn/default.aspx">Bio-protocol - Improve Research Reproducibility1</a></p><p>21、历史自然博物馆-伦敦<a href="https://www.nhm.ac.uk/">Home | Natural History Museum (nhm.ac.uk)</a></p><p>22、国家基因库生命大数据平台<a href="https://db.cngb.org/">CNGBdb-国家基因库生命大数据平台</a></p><p>23、用于基因表达分析的分子信息学资源,转录因子预测等<a href="http://www.ifti.org/">Molecular Informatics Resource for the Analysis of Gene Expression (ifti.org)</a></p><p>24、北京大学生物信息学中心<a href="https://www.cbi.pku.edu.cn/">Center for Bioinformatics, Peking University (beta) (pku.edu.cn)</a></p><hr><p><strong>实时更新</strong></p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> Links </tag>
</tags>
</entry>
<entry>
<title>10个便捷的Linux小工具</title>
<link href="/2021/11/22/20211122/"/>
<url>/2021/11/22/20211122/</url>
<content type="html"><![CDATA[<h2 id="Linux"><a href="#Linux" class="headerlink" title="Linux"></a><strong>Linux</strong></h2><h3 id="(1)axel"><a href="#(1)axel" class="headerlink" title="(1)axel"></a>(1)axel</h3><p>多线程下载工具,下载文件时可以替代curl、wget</p><p><code>axel -n 20 http://centos.ustc.edu.cn/centos/7/isos/x86_64/CentOS-7-x86_64-Minimal-1511.iso</code></p><img src="/2021/11/22/20211122/20211122010714.jpg" class width="20211122010714"><h3 id="(2)shellcheck"><a href="#(2)shellcheck" class="headerlink" title="(2)shellcheck"></a>(2)shellcheck</h3><p>shell脚本静态检查工具,能够识别语法错误以及不规范的写法.</p><h3 id="(3)fzf"><a href="#(3)fzf" class="headerlink" title="(3)fzf"></a>(3)fzf</h3><p>命令行下模糊搜索工具,能够交互式智能搜索并选取文件或者内容,配合终端ctrl-r历史命令搜索简直完美(ctrl + r查找以前输入的命令,比上下键和history好用)。</p><h3 id="(4)htop"><a href="#(4)htop" class="headerlink" title="(4)htop"></a>(4)htop</h3><p><strong>htop:</strong> 提供更美观、更方便的进程监控工具,替代top命令。</p><h3 id="(5)ag"><a href="#(5)ag" class="headerlink" title="(5)ag"></a>(5)ag</h3><p>递归搜索文件内容,类似grep 和 find,但是执行效率比后两者高</p><p>参数与grep存在相似之处,如-i -A -B -C,</p><p>ag –ignore-dir <Dir name>:忽略某些文件目录进行搜索,</Dir></p><p>ag -w PATTERN: 全匹配搜索,只搜索与所搜内容完全匹配的文本,</p><p>ag –java PATTERN: 在java文件中搜索含PATTERN的文本,</p><p>ag –xml PATTERN:在XML文件中搜索含PATTERN的文本。</p><h3 id="(6)multitail"><a href="#(6)multitail" class="headerlink" title="(6)multitail"></a>(6)multitail</h3><p>多重 tail。通常你不止一个日志文件要监控,怎么办?终端软件里开多个 tab 太占地方,可以试试这个工具。</p><h3 id="(7)script-scriptreplay"><a href="#(7)script-scriptreplay" class="headerlink" title="(7)script/scriptreplay"></a>(7)script/scriptreplay</h3><p>终端会话录制。</p><h3 id="(8)tmux"><a href="#(8)tmux" class="headerlink" title="(8)tmux"></a>(8)tmux</h3><p>终端复用工具,替代screen、nohup。</p><h3 id="(9)tig"><a href="#(9)tig" class="headerlink" title="(9)tig"></a>(9)tig</h3><p>字符模式下交互查看git项目,可以替代git命令。</p><h3 id="(10)mycli"><a href="#(10)mycli" class="headerlink" title="(10)mycli"></a>(10)mycli</h3><p>mysql客户端,支持语法高亮和命令补全,效果类似ipython,可以替代mysql命令。</p><h3 id="参考:"><a href="#参考:" class="headerlink" title="参考:"></a>参考:</h3><p><a href="https://medium.com/starbugs/do-you-understand-htop-ffb72b3d5629">你一定用過 htop,但你有看懂每個欄位嗎?. 身為一個工程師,不管你寫的是前端、後端、全端還是什麼端,一定多少用過… | by Larry Lu | Starbugs Weekly 星巴哥技術專欄 | Medium</a></p><p><a href="https://blog.csdn.net/shisanmei911/article/details/89360353">https://blog.csdn.net/shisanmei911/article/details/89360353</a></p><p><a href="https://mp.weixin.qq.com/s/MhgE1dQzspWJZ6Ce5cK8WA">https://mp.weixin.qq.com/s/MhgE1dQzspWJZ6Ce5cK8WA</a></p><p><a href="https://blog.csdn.net/iamlaosong/article/details/52538599">(11条消息) Linux用ctrl + r 查找以前输入的命令_驽马十驾 才定不舍-CSDN博客</a></p><p><a href="https://baijiahao.baidu.com/s?id=1652587589132970199&wfr=spider&for=pc">Fzf一个由Golang开发的完美通用的Shell命令行模糊查询工具 (baidu.com)</a></p><p>……</p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> shell笔记 </tag>
</tags>
</entry>
<entry>
<title>海洋鱼类的悲哀</title>
<link href="/2021/11/21/%E4%BA%BA%E7%B1%BB%E7%9A%84%E5%91%BC%E5%A3%B0%E2%80%94%E2%80%94%E4%BF%9D%E6%8A%A4/"/>
<url>/2021/11/21/%E4%BA%BA%E7%B1%BB%E7%9A%84%E5%91%BC%E5%A3%B0%E2%80%94%E2%80%94%E4%BF%9D%E6%8A%A4/</url>
<content type="html"><![CDATA[<p>那么利用率“不足100%”到底指什么呢?难道海洋中鱼类的存在仅仅是为了被人类利用吗?(Earle S A, 1995;Weber P, 1993)</p><img src="/2021/11/21/%E4%BA%BA%E7%B1%BB%E7%9A%84%E5%91%BC%E5%A3%B0%E2%80%94%E2%80%94%E4%BF%9D%E6%8A%A4/fish.jpg" class title="fish"><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 保护生物多样性 </tag>
</tags>
</entry>
<entry>
<title>Knowledge of Sequencing</title>
<link href="/2021/11/14/Sequencing/"/>
<url>/2021/11/14/Sequencing/</url>
<content type="html"><![CDATA[<p>浅谈测序</p><span id="more"></span><p>基因组组装是指使用测序方法将待测物种的基因组生成序列片段(即read),并根据reads 之间的重叠区域对片段进行拼接,先拼接成较长的连续序列(contig),再将contigs 拼接成更长的允许包含空白序列(gap)的scaffolds,通过消除scaffolds 的错误和gaps,将这些scaffolds 定位到染色体上,从而得到高质量的全基因组序列。</p><img src="/2021/11/14/Sequencing/20190511201315924.png" class width="20190511201315924"><p>自基因组测序以来的组装里程碑</p><img src="/2021/11/14/Sequencing/lcb.png" class title="lcb"><p>第三代测序技术又称为单分子测序技术,主要是指pacific biosciences公司推出的single molecular real time(SMRT)测序技术(通过荧光辨别)和Oxford nanopore technologies公司推出的纳米孔测序技术(通过电流辨别)。测序时,DNA不需要PCR扩增过程,即可实现对每一条DNA分子的单独测序。</p><p>Pacbio测序平台下SMRT测序技术的两种模式:</p><p>Standard sequencing for continuous long reads(CLR)超长度长测序:存在随即错误;</p><p>Circular consensus sequencing(CCS)环形比对测序:可自动纠错,又被称为HIFI。</p><img src="/2021/11/14/Sequencing/compare.jpg" class title="compare"><p>Hi-C(High-throughput chromosome conformation capture),高通量染色体构象捕获。</p><img src="/2021/11/14/Sequencing/com.jpg" class title="com"><p>利用甲醛对样本进行交联,质检合格后使用限制性内切酶(如MboI等)进行酶切,酶切片段经生物素标记、平末端连接、DNA纯化提取,超声打断后钓取含有生物素的片段,进行建库测序。随后,对原始下机数据进行质控,并将质控截取后的Clean reads与参考基因组比对,获得用于互作分析的Valid reads。</p><img src="/2021/11/14/Sequencing/hicshiyan.jpg" class title="hicshiyan"><p>常用的Hi-C数据处理软件是HiC-Pro,该软件采用两步比对策略,有效提高了数据的利用率的同时,还提供了一系列的质控标准,对文库质量进行评估。</p><p>目前,用于Hi-C辅助基因组组装的软件有LACHESIS、SALSA2、3D-DNA、ALLHiC等,这些软件在基因组组装方面各有优劣(各种方法具体算法参考<a href="http://www.bioon.com.cn/news/showarticle.asp?newsid=87435">Hi-C辅助组装知多少,硬核知识点来了~ - 商家动态 - 资讯 - 生物在线 (bioon.com.cn)</a>)</p><p>参考:</p><p>王通.纳米孔测序数据分析手册.</p><p>Giani, A. M., Gallo, G. R., Gianfranceschi, L., & Formenti, G. (2020). Long walk to genomics: History and current approaches to genome sequencing and assembly. Computational and structural biotechnology journal, 18, 9-19.</p><p><a href="https://blog.csdn.net/u010608296/article/details/90110770">https://blog.csdn.net/u010608296/article/details/90110770</a></p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 前沿了解 </tag>
</tags>
</entry>
<entry>
<title>Analysis_of_transcriptome</title>
<link href="/2021/11/11/Analysis-of-transcriptome/"/>
<url>/2021/11/11/Analysis-of-transcriptome/</url>
<content type="html"><![CDATA[<p><strong>以下机数据为基础,跑通的转录组分析流程</strong></p><span id="more"></span><h3 id="一、部分软件安装"><a href="#一、部分软件安装" class="headerlink" title="一、部分软件安装"></a>一、部分软件安装</h3><p>利用conda安装trimmomatic、fastqc、hisat2、samtools等软件</p><p>HTSeq的安装需要在python2.7环境下(方法如下):</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">wget https://pypi.python.org/packages/source/H/HTSeq/HTSeq-0.6.1p1.tar.gz</span><br><span class="line">tar -zxvf HTSeq-0.6.1p1.tar.gz</span><br><span class="line">cd HTSeq-0.6.1p1/</span><br><span class="line">python setup.py build</span><br><span class="line">python setup.py install</span><br><span class="line">vi ~/.bashrc</span><br><span class="line">export PATH="$PATH:/(省略)/software/HTSeq-0.6.1p1/build/scripts-2.7/htseq-count"</span><br><span class="line">source ~/.bashrc </span><br></pre></td></tr></table></figure><p>注:安装的htseq-count在HTSeq-0.6.1p1/build/scripts-2.7/目录下。</p><p>注:安装的htseq-count在HTSeq-0.6.1p1/build/scripts-2.7/目录下。</p><p>或安装featureCounts</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">wget https://nchc.dl.sourceforge.net/project/subread/subread-1.6.3/subread-1.6.3-source.tar.gz &</span><br><span class="line">tar -zxvf subread-1.6.3-source.tar.gz</span><br></pre></td></tr></table></figure><p>添加环境变量后即可使用。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">featureCounts -T 20 -t exon -g gene_id -a Danio_rerio.GRCz11.104.gtf -o count.txt align.bam</span><br></pre></td></tr></table></figure><p>添加环境变量后即可使用。<br>featureCounts -T 20 -t exon -g gene_id -a Danio_rerio.GRCz11.104.gtf -o count.txt align.bam</p><p>Rstudio安装DESeq2</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">install.packages("BiocManager")</span><br><span class="line">BiocManager::install("DESeq2")</span><br><span class="line">library(DESeq2)</span><br></pre></td></tr></table></figure><h3 id="二、数据处理流程"><a href="#二、数据处理流程" class="headerlink" title="二、数据处理流程"></a>二、数据处理流程</h3><p>过滤街接头序列,质量较差等不成对序列[1]</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">trimmomatic PE -threads 20 clocka-clocka-mut_combined_R1.fastq.gz clocka-clocka-mut_combined_R2.fastq.gz -baseout clocka-clocka-mut_combined ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:8:true SLIDINGWINDOW:5:20 LEADING:3 TRAILING:3 MINLEN:36</span><br></pre></td></tr></table></figure><p>质控[2]</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">fastqc clocka-clocka-mut_combined_1P clocka-clocka-mut_combined_2P -o ./ -t 20 </span><br></pre></td></tr></table></figure><p><u>过滤后的质控结果发现Per base sequence content和Sequence Duplication Levels[3]两项是红叉,通过查阅资料,两者对后续分析无负面影响。</u></p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">extract_exons.py Danio_rerio.GRCz11.104.gtf > genome.exon</span><br><span class="line">extract_splice_sites.py Danio_rerio.GRCz11.104.gtf > genome.ss</span><br><span class="line">hisat2-build -p 20 GCF_000002035.6_GRCz11_genomic.fna --ss genome.ss --exon genome.exon genome_tran</span><br><span class="line"></span><br><span class="line">hisat2 -p 30 --dta -x genome_tran -1 clocka-clocka-mut_combined_1P -2 clocka-clocka-mut_combined_2P -S align.sam #比对</span><br><span class="line">Warning: Unsupported file format</span><br><span class="line"></span><br><span class="line">samtools view -S align.sam -b > align.bam #转化格式sam-bam参考https://blog.csdn.net/weixin_39790504/article/details/111376943</span><br><span class="line">samtools sort -l 4 -o align_sort.bam align.bam #排序</span><br><span class="line">samtools index align_sort.bam align_sort.bam.bai #建立索引</span><br><span class="line">htseq-count -f bam -r name -i gene_id -s yes -t gene -m intersection-nonempty align_sort.bam Danio_rerio.GRCz11.104.gtf > count.txt #计数</span><br></pre></td></tr></table></figure><hr><p>未完,待续</p><hr><h3 id="三、其他参考文章"><a href="#三、其他参考文章" class="headerlink" title="三、其他参考文章"></a>三、其他参考文章</h3><p>[1]<a href="https://blog.csdn.net/sinat_32872729/article/details/93487342">https://blog.csdn.net/sinat_32872729/article/details/93487342</a><br>[2]<a href="https://www.jianshu.com/p/fe6af418a8bc">https://www.jianshu.com/p/fe6af418a8bc</a><br>[3]<a href="https://www.biostars.org/p/307361/#307372">https://www.biostars.org/p/307361/#307372</a></p><p>转录组详细流程参考于Dawn_WangTP用户 :<a href="https://www.jianshu.com/u/a64003068454">https://www.jianshu.com/u/a64003068454</a><br>对于转录组所涉及的文件格式的理解,参考<a href="https://www.jianshu.com/p/03bc06c1e84a">https://www.jianshu.com/p/03bc06c1e84a</a></p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 方法 </tag>
</tags>
</entry>
<entry>
<title>利用python快速得到两条同源序列的差异位点</title>
<link href="/2021/09/04/%E5%88%A9%E7%94%A8python%E5%BF%AB%E9%80%9F%E5%BE%97%E5%88%B0%E4%B8%A4%E6%9D%A1%E5%90%8C%E6%BA%90%E5%BA%8F%E5%88%97%E7%9A%84%E5%B7%AE%E5%BC%82%E4%BD%8D%E7%82%B9/"/>
<url>/2021/09/04/%E5%88%A9%E7%94%A8python%E5%BF%AB%E9%80%9F%E5%BE%97%E5%88%B0%E4%B8%A4%E6%9D%A1%E5%90%8C%E6%BA%90%E5%BA%8F%E5%88%97%E7%9A%84%E5%B7%AE%E5%BC%82%E4%BD%8D%E7%82%B9/</url>
<content type="html"><![CDATA[<blockquote><p>准备(windows为例):安装python、pycharm</p><span id="more"></span></blockquote><blockquote><p>输入文件: fasta格式,包括两条序列;</p><p>输出文件包括位置及两个差异位点</p></blockquote><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># -*- coding = utf-8 -*-</span></span><br><span class="line"><span class="comment"># @Time : 2021/5/28 23:22</span></span><br><span class="line"><span class="comment"># @Author : wutz</span></span><br><span class="line"><span class="comment"># @File : call_divergent_sites.py</span></span><br><span class="line"><span class="comment"># @Software : PyCharm</span></span><br><span class="line">fasfile = <span class="built_in">open</span>(<span class="string">"input.fas"</span>)</span><br><span class="line">outfile = <span class="built_in">open</span>(<span class="string">"divergent-sites.txt"</span>,<span class="string">"w"</span>)</span><br><span class="line">lines = fasfile.readlines()</span><br><span class="line">a = lines[<span class="number">1</span>].strip()</span><br><span class="line">b = lines[<span class="number">3</span>].strip()</span><br><span class="line"><span class="built_in">len</span> = <span class="built_in">len</span>(a)</span><br><span class="line">i = <span class="number">0</span></span><br><span class="line"><span class="keyword">while</span> i < <span class="built_in">len</span>:</span><br><span class="line"> <span class="keyword">if</span> a[i] != b[i] <span class="keyword">and</span> a[i] != <span class="string">"-"</span> <span class="keyword">and</span> b[i] != <span class="string">"-"</span> <span class="keyword">and</span> a[i] != <span class="string">"?"</span> <span class="keyword">and</span> b[i] != <span class="string">"?"</span>:</span><br><span class="line"> <span class="built_in">print</span>(a[i],i + <span class="number">1</span>,b[i], file=outfile)</span><br><span class="line"> i+=<span class="number">1</span></span><br></pre></td></tr></table></figure><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 方法 </tag>
</tags>
</entry>
<entry>
<title>如何利用query序列在本地匹配测序原始reads</title>
<link href="/2021/09/04/%E5%88%A9%E7%94%A8query%E6%9C%AC%E5%9C%B0%E5%8C%B9%E9%85%8D%E6%B5%8B%E5%BA%8F%E5%8E%9F%E5%A7%8Breads/"/>
<url>/2021/09/04/%E5%88%A9%E7%94%A8query%E6%9C%AC%E5%9C%B0%E5%8C%B9%E9%85%8D%E6%B5%8B%E5%BA%8F%E5%8E%9F%E5%A7%8Breads/</url>
<content type="html"><![CDATA[<p><strong>利用转录组或重测序数据原始reads匹配到目的序列,以验证序列的准确性</strong>。</p><blockquote><p><strong>以RNA-Seq of Hippopotamus amphibius: adult male skin——<a href="https://www.ncbi.nlm.nih.gov/sra/?term=SRR8270566">SRR8270566</a>为例</strong></p></blockquote><span id="more"></span><blockquote><p>下载相关软件及工具:bwa、samtools、fastq</p></blockquote><figure class="highlight powershell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">prefetch SRR8270566 <span class="comment">#下载SRA数据库的reads数据,另:wget -c https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-2/SRR8270566/SRR8270566.1 也可</span></span><br><span class="line"></span><br><span class="line">fastq -<span class="operator">-split</span><span class="literal">-3</span> SRR8270566.<span class="number">1</span> <span class="comment">#转换为fastq格式</span></span><br><span class="line"></span><br><span class="line">bwa index <span class="literal">-a</span> bwtsw hamp.fas <span class="comment">#目的序列建立索引</span></span><br><span class="line"></span><br><span class="line">bwa mem <span class="literal">-t</span> <span class="number">20</span> <span class="literal">-M</span> <span class="literal">-R</span> <span class="string">"@RG\tID:hamp\t"</span> hamp.fas SRR8270566.<span class="number">1</span>_1.fastq SRR8270566.<span class="number">1</span>_2.fastq > hamp.sam <span class="comment">#比对,生成sam文件</span></span><br><span class="line"></span><br><span class="line">samtools <span class="built_in">sort</span> -<span class="selector-tag">@</span> <span class="number">30</span> <span class="literal">-m</span> <span class="number">10</span>G <span class="literal">-O</span> bam <span class="literal">-o</span> hamp.bam hamp.sam <span class="comment">#生成二进制bam文件</span></span><br><span class="line"></span><br><span class="line">samtools index hamp.bam <span class="comment">#生成bam.bai文件</span></span><br></pre></td></tr></table></figure><blockquote><p>最后利用IGV或Tablet等可视化软件将hamp.fas、hamp.bam、hamp.bam.bai可视化即可看到query片段所匹配的reads</p></blockquote><p><strong>There’s More Than One Way To Do It!仅供参考!</strong></p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 方法 </tag>
</tags>
</entry>
<entry>
<title>shell三剑客</title>
<link href="/2021/09/02/shell%E4%B8%89%E5%89%91%E5%AE%A2/"/>
<url>/2021/09/02/shell%E4%B8%89%E5%89%91%E5%AE%A2/</url>
<content type="html"><![CDATA[<h1 align="center">利用命令行快速处理文件</h1><h3 id="1、Shell里的循环与判断结构"><a href="#1、Shell里的循环与判断结构" class="headerlink" title="1、Shell里的循环与判断结构"></a>1、Shell里的循环与判断结构</h3><p>for循环:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">for i in `cat 1.txt`;do command;done</span><br></pre></td></tr></table></figure><p>while循环:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">while [表达式];do command;done</span><br></pre></td></tr></table></figure><p>if结构:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">if [条件];then</span><br><span class="line">command</span><br><span class="line">else</span><br><span class="line">command</span><br><span class="line">fi</span><br></pre></td></tr></table></figure><h3 id="2、shell三剑客"><a href="#2、shell三剑客" class="headerlink" title="2、shell三剑客"></a>2、shell三剑客</h3><p>(sed和awk部分参考了黑马程序员武汉中心在线教程文件<a href="https://www.bilibili.com/video/BV1st411N7WS?from=search&seid=1843626299734778610">2019全新Shell脚本从入门到精通教程_哔哩哔哩_bilibili</a>)</p><h4 id="grep"><a href="#grep" class="headerlink" title="grep"></a>grep</h4><p>语法格式:</p><figure class="highlight powershell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">grep [<span class="type">options</span>] <span class="string">"patten"</span> filename1 filename2 filename3 <span class="comment">#打印包含patten的行</span></span><br></pre></td></tr></table></figure><p>常用options:</p><table><thead><tr><th align="center">选项</th><th align="center">说明</th></tr></thead><tbody><tr><td align="center">-v</td><td align="center">反向输出</td></tr><tr><td align="center">–color=auto</td><td align="center">标记匹配颜色</td></tr><tr><td align="center">-E</td><td align="center">使用正则表达式</td></tr><tr><td align="center">-o</td><td align="center">只输出匹配部分</td></tr><tr><td align="center">-c</td><td align="center">统计包含匹配字符的行数</td></tr><tr><td align="center">-n</td><td align="center">输出匹配内容及其行号</td></tr><tr><td align="center">-b</td><td align="center">打印匹配模式位于行的位置</td></tr><tr><td align="center">-l</td><td align="center">搜索多个文件,并查找匹配文本在拿个文件中</td></tr><tr><td align="center">. -r</td><td align="center">在多级目录中查找</td></tr><tr><td align="center">-i</td><td align="center">忽略大小写</td></tr><tr><td align="center">-e</td><td align="center">匹配两个patten</td></tr><tr><td align="center">-q</td><td align="center">不输出任何内容,成功返回0,失败则返回非0值</td></tr><tr><td align="center">-An</td><td align="center">打印匹配模式之后的n行</td></tr><tr><td align="center">-Bn</td><td align="center">打印匹配模式之前的n行</td></tr><tr><td align="center">-Cn</td><td align="center">打印匹配模式上下游各n行</td></tr></tbody></table><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">eg: </span><br><span class="line">echo gun is not unix | grep -b -o "not" #输出 7:not</span><br><span class="line">grep -l "patten" file1 file2 file3 #输出文件名</span><br><span class="line">grep "patten" . -r -n #.代表当当前目录</span><br><span class="line">grep -e "patten1" -e "patten2" filename</span><br><span class="line">grep -q "patten" filename</span><br></pre></td></tr></table></figure><p>另:Linux xargs命令(结合管道使用):</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">eg:</span><br><span class="line">cat url-list.txt | xargs wget -c #文件包含多个URL,使用xargs下载</span><br></pre></td></tr></table></figure><p>xargs命令的详细用法参考<a href="https://www.runoob.com/linux/linux-comm-xargs.html">Linux xargs 命令 | 菜鸟教程 (runoob.com)</a></p><h4 id="sed"><a href="#sed" class="headerlink" title="sed"></a>sed</h4><p>语法格式:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sed [options] '定位+处理动作' 文件名</span><br></pre></td></tr></table></figure><p>常用options:</p><table><thead><tr><th align="center">选项</th><th align="center">说明</th><th align="center">备注</th></tr></thead><tbody><tr><td align="center">-e</td><td align="center">进行多项(多次)编辑</td><td align="center"></td></tr><tr><td align="center">-n</td><td align="center">取消默认输出</td><td align="center">不自动打印模式空间</td></tr><tr><td align="center">-r</td><td align="center">使用<strong>扩展正则表达式</strong></td><td align="center"></td></tr><tr><td align="center">-i</td><td align="center">原地编辑(修改源文件)</td><td align="center"></td></tr><tr><td align="center">-f</td><td align="center">指定sed脚本的文件名</td><td align="center"></td></tr></tbody></table><p>常用处理(在<strong>单引号</strong>里):</p><table><thead><tr><th align="center">动作</th><th align="center">说明</th><th align="center">备注</th></tr></thead><tbody><tr><td align="center">‘p’</td><td align="center">打印</td><td align="center"></td></tr><tr><td align="center">‘i’</td><td align="center">在指定行<strong>之前</strong>插入内容</td><td align="center">类似vim里的大写O</td></tr><tr><td align="center">‘a’</td><td align="center">在指定行<strong>之后</strong>插入内容</td><td align="center">类似vim里的小写o</td></tr><tr><td align="center">‘c’</td><td align="center">替换指定行所有内容</td><td align="center"></td></tr><tr><td align="center">‘d’</td><td align="center">删除指定行</td><td align="center"></td></tr></tbody></table><p>对文件进行<u>增、删、改、查</u>操作</p><figure class="highlight powershell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sed <span class="literal">-n</span> <span class="string">'1,5p'</span> a.txt<span class="comment">#打印1到5行</span></span><br></pre></td></tr></table></figure><figure class="highlight powershell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sed <span class="literal">-n</span> <span class="string">'$p'</span> a.txt <span class="comment">#打印最后1行</span></span><br></pre></td></tr></table></figure><figure class="highlight powershell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sed <span class="string">'5a99999'</span> a.txt <span class="comment">#文件第5行下面增加内容</span></span><br></pre></td></tr></table></figure><figure class="highlight powershell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sed <span class="string">'/^uucp/ihello'</span><span class="comment">#以uucp开头行的上一行插入内容</span></span><br></pre></td></tr></table></figure><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">sed '1,5chello world' a.txt #替换文件1到5号内容为hello world</span><br><span class="line">sed '/^user01/c888888' a.txt#替换以user01开头的行</span><br></pre></td></tr></table></figure><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sed '$d' a.txt#删除文件最后一行</span><br></pre></td></tr></table></figure><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sed -n '1,5s/^/#/p' a.txt #注释掉文件的1-5行内容</span><br></pre></td></tr></table></figure><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sed -n 's/root/ROOT/gp' 1.txt #替换1.txt文件中所有的root为ROOT</span><br></pre></td></tr></table></figure><p>sed结合正则使用</p><table><thead><tr><th align="center">正则</th><th align="center">说明</th><th align="center">备注</th></tr></thead><tbody><tr><td align="center">/key/</td><td align="center">查询包含关键字的行</td><td align="center"><code>sed -n '/root/p' 1.txt</code></td></tr><tr><td align="center">/key1/,/key2/</td><td align="center">匹配包含两个关键字之间的行</td><td align="center"><code>sed -n '/\^adm/,/^mysql/p' 1.txt</code></td></tr><tr><td align="center">/key/,x</td><td align="center">从匹配关键字的行开始到==文件第x行==之间的行(包含关键字所在行)</td><td align="center"><code>sed -n '/^ftp/,7p'</code></td></tr><tr><td align="center">x,/key/</td><td align="center">从文件的第x行开始到与关键字的匹配行之间的行</td><td align="center"></td></tr><tr><td align="center">x,y!</td><td align="center">不包含x到y行</td><td align="center"></td></tr><tr><td align="center">/key/!</td><td align="center">不包括关键字的行</td><td align="center"><code>sed -n '/bash$/!p' 1.txt</code></td></tr></tbody></table><h4 id="awk"><a href="#awk" class="headerlink" title="awk"></a>awk</h4><p>awk是unix下的一个工具,也是一门语言,支持条件判断和循环语句(如for或while),是数据文件的<strong>列处理工具</strong>。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">awk [options] '命令部分' filename</span><br></pre></td></tr></table></figure><p>常用选项:</p><table><thead><tr><th align="center">选项</th><th align="center">说明</th></tr></thead><tbody><tr><td align="center">-F</td><td align="center">定义分隔符,默认为空格</td></tr><tr><td align="center">-v</td><td align="center">定义变量并赋值(awk中调用变量无需加$符号)</td></tr></tbody></table><p>命令部分:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">a)正则表达式或地址位置 </span><br><span class="line"> '/root/{awk语句}'</span><br><span class="line"> 'NR==1,NR==5{awk语句}' #sed中:'1,5p'</span><br><span class="line"> '/^root/,/^ftp/{awk语句}' #sed中:'/^root/,/^ftp/p'</span><br><span class="line">b)</span><br><span class="line">{awk语句1;awk语句2}</span><br><span class="line"> 注:awk命令语句间用分号间隔</span><br><span class="line">c)</span><br><span class="line">BEGIN...END....</span><br><span class="line"> 'BEGIN{awk语句};{处理中};END{awk语句}'</span><br><span class="line"> 'BEGIN{awk语句};{处理中}'</span><br><span class="line"> '{处理中};END{awk语句}'</span><br></pre></td></tr></table></figure><p>awk内部相关变量</p><table><thead><tr><th align="center">变量</th><th align="center">变量说明</th><th align="center">备注</th></tr></thead><tbody><tr><td align="center">$0</td><td align="center">当前处理行的所有记录</td><td align="center"></td></tr><tr><td align="center">$1,$2,$3…$n</td><td align="center">文件中每行以==间隔符号==分割的不同字段</td><td align="center"><code>awk -F: '{print $1,$3}'</code></td></tr><tr><td align="center">NF</td><td align="center">当前记录的字段数(列数)</td><td align="center"><code>awk -F: '{print NF}'</code></td></tr><tr><td align="center">$NF</td><td align="center">最后一列</td><td align="center"><code>$(NF-1)</code>表示倒数第二列</td></tr><tr><td align="center">FNR/NR</td><td align="center">行号</td><td align="center"></td></tr><tr><td align="center">FS</td><td align="center">定义间隔符</td><td align="center"><code>'BEGIN{FS=":"};{print $1,$3}'</code></td></tr><tr><td align="center">OFS</td><td align="center">定义输出字段分隔符,==默认空格==</td><td align="center"><code>'BEGIN{OFS="\t"};print $1,$3}'</code></td></tr><tr><td align="center">RS</td><td align="center">输入记录分割符,默认换行</td><td align="center"><code>'BEGIN{RS="\t"};{print $0}'</code></td></tr><tr><td align="center">ORS</td><td align="center">输出记录分割符,默认换行</td><td align="center"><code>'BEGIN{ORS="\n\n"};{print $1,$3}'</code></td></tr><tr><td align="center">FILENAME</td><td align="center">当前输入的文件名</td><td align="center"></td></tr></tbody></table><p>awk工作原理</p><p><code>awk -F: '{print $1,$3}' /etc/passwd</code></p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">awk使用一行作为输入,并将这一行赋给内部变量$0,每一行也可称为一个记录,以换行符(RS)结束</span><br><span class="line"></span><br><span class="line">每行被间隔符**:**(默认为空格或制表符)分解成字段(或域),每个字段存储在已编号的变量中,从$1开始</span><br><span class="line"></span><br><span class="line">问:awk如何知道用空格来分隔字段的呢?</span><br><span class="line"></span><br><span class="line">答:因为有一个内部变量FS来确定字段分隔符。初始时,FS赋为空格</span><br><span class="line"></span><br><span class="line">awk使用print函数打印字段,打印出来的字段会以空格分隔,因为\$1,\$3之间有一个逗号。逗号比较特殊,它映射为另一个内部变量,称为输出字段分隔符OFS,OFS默认为空格</span><br><span class="line"></span><br><span class="line">awk处理完一行后,将从文件中获取另一行,并将其存储在$0中,覆盖原来的内容,然后将新的字符串分隔成字段并进行处理。该过程将持续到所有行处理完毕</span><br></pre></td></tr></table></figure><p>格式化输出<code>print</code>和<code>printf</code></p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line">print函数类似echo "hello world"</span><br><span class="line"><span class="meta">#</span><span class="bash"> date |awk <span class="string">'{print "Month: "$2 "\nYear: "$NF}'</span></span></span><br><span class="line"><span class="meta">#</span><span class="bash"> awk -F: <span class="string">'{print "username is: " $1 "\t uid is: "$3}'</span> /etc/passwd</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line">printf函数类似echo -n</span><br><span class="line"><span class="meta">#</span><span class="bash"> awk -F: <span class="string">'{printf "%-15s %-10s %-15s\n", $1,$2,$3}'</span> /etc/passwd</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> awk -F: <span class="string">'{printf "|%15s| %10s| %15s|\n", $1,$2,$3}'</span> /etc/passwd</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> awk -F: <span class="string">'{printf "|%-15s| %-10s| %-15s|\n", $1,$2,$3}'</span> /etc/passwd</span></span><br><span class="line"></span><br><span class="line">awk 'BEGIN{FS=":"};{printf "%-15s %-15s %-15s\n",$1,$6,$NF}' a.txt</span><br><span class="line"><span class="meta"></span></span><br><span class="line"><span class="meta">%</span><span class="bash">s 字符类型 strings%-20s</span></span><br><span class="line"><span class="meta">%</span><span class="bash">d 数值类型</span></span><br><span class="line">占15字符</span><br><span class="line">- 表示左对齐,默认是右对齐</span><br><span class="line">printf默认不会在行尾自动换行,加\n</span><br></pre></td></tr></table></figure><h3 id><a href="#" class="headerlink" title></a></h3><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> shell笔记 </tag>
</tags>
</entry>
<entry>
<title>conda</title>
<link href="/2021/09/02/conda/"/>
<url>/2021/09/02/conda/</url>
<content type="html"><![CDATA[<h1 align="center">使用conda安装软件以及环境管理</h1><p>官网<a href="https://www.anaconda.com/products/individual">conda</a>下载安装程序<br><code>wget https://repo.continuum.io/miniconda/Miniconda3-4.3.21-Linux-x86_64.sh</code><br>安装<br><code>bash Miniconda-latest-Linux-x86_64.sh</code><br>安装过程中一直yes,并且将其设置为环境变量,然后重新执行.bashrc<br><code>source ~/.bashrc</code><br>添加频道<br><code>conda config --add channels conda-forge</code><br><code>conda config --add channels r</code><br> <code>conda config --add channels bioconda</code></p><p>安装生信软件<br><code>conda install softwarename</code></p><hr><p><strong>conda管理环境的部分命令集合:</strong></p><p>#创建名为env_name的新环境</p><p>conda create -n env_name </p><p>#克隆原有的环境old_env_name,并命名为new_env_name</p><p>conda create -n new_env_name –clone old_env_name </p><p>#激活env_name环境</p><p>conda activate env_name</p><p>#退出env_name环境</p><p>conda deactivate</p><p>#删除env_name环境</p><p>conda remove -n env_name –all</p><p>#查看环境列表</p><p>conda info -e</p><p>#查看当前环境下安装的软件包</p><p>conda list</p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 软件安装 </tag>
</tags>
</entry>
<entry>
<title>生命之源:海洋</title>
<link href="/2021/08/26/STARTING/"/>
<url>/2021/08/26/STARTING/</url>
<content type="html"><![CDATA[<h1 id="Here-we-come"><a href="#Here-we-come" class="headerlink" title="Here we come ~"></a>Here we come ~</h1><h2 id><a href="#" class="headerlink" title></a></h2><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
<tags>
<tag> 随笔 </tag>
</tags>
</entry>
</search>