|
1 | 1 | # DELIVERY OF RESULTS FROM EXOME ANALYSIS WITH SAREK |
2 | 2 |
|
3 | 3 | ## Analysis |
4 | | -Samples were analysed with the Sarek pipeline release {{ release }}. In short, the pipeline does the following: |
5 | | -Reads from fastq-files were mapped to a reference genome using BWA. |
6 | | -Bam-files were de-duplicated with GATK MarkDuplicates. |
7 | | -Base quality score recalibration tables were created with GATK BaseRecalibrator. |
8 | | -The tables were then used in GATK ApplyBQSR to create recalibrated bam-files. |
9 | | -SNVs and small indels were called with GATK HaplotypeCaller. |
10 | | -Variants were annoted with SnpEff. |
| 4 | +Samples were analysed with the Sarek pipeline release {{ release }}. |
11 | 5 |
|
12 | | -For details on the pipeline, folder structure and how to interpret results, please refer to the Sarek documentation: |
13 | | -https://nf-co.re/sarek/{{ release }} |
| 6 | +The workflow processes raw data from FastQ inputs, aligns the reads, mark duplicates and performs base recalibration. |
| 7 | +SNVs and small indels are called with GATK HaplotypeCaller and DeepVariant. SnpEff annotated calls are reported in |
| 8 | +separate vcf files for each caller as well as concatenated vcf files with the combined result. |
| 9 | +In addition to Sarek pipeline analysis, target region coverage was evaluated with Picard CollectHsMetrics. |
14 | 10 |
|
15 | | -After running the pipeline, Picard CollectHsMetrics was used to evaluate the coverage |
| 11 | +For information regarding the pipeline, folder structure and how to interpret results, please refer to the Sarek documentation: |
| 12 | +[https://nf-co.re/sarek/{{ release }}](https://nf-co.re/sarek/{{ release }}) |
16 | 13 |
|
17 | | -## Delivery structure, directories and files: |
| 14 | +Detailed information about standard outputs from the pipeline can be found [here](https://nf-co.re/sarek/{{ release }}/output). |
18 | 15 |
|
19 | | -``` |
| 16 | +The directory also contains the file checksums.md5, which should be used to verify the integrity of the files after transfer. |
20 | 17 |
|
21 | | -├── Annotation |
22 | | -│ ├── <sample1 name> |
23 | | -│ │ └── snpEff |
24 | | -│ └── <sample2 name> |
25 | | -│ └── snpEff |
26 | | -├── DELIVERY.README.SAREK.WES.md |
27 | | -├── pipeline_info |
28 | | -│ ├── results_description.html |
29 | | -│ └── software_versions.csv |
30 | | -├── Preprocessing |
31 | | -│ ├── TSV |
32 | | -│ │ ├── duplicates_marked_no_table.tsv |
33 | | -│ │ ├── duplicates_marked_no_table_<sample1 name>.tsv |
34 | | -│ │ ├── duplicates_marked_no_table_<sample2 name>.tsv |
35 | | -│ │ ├── duplicates_marked.tsv |
36 | | -│ │ ├── duplicates_marked_<sample1 name>.tsv |
37 | | -│ │ └── duplicates_marked_<sample2 name>.tsv |
38 | | -│ ├── <sample1 name> |
39 | | -│ │ └── DuplicatesMarked |
40 | | -│ │ ├── <sample1 name>.md.bam |
41 | | -│ │ ├── <sample1 name>.md.bam.bai |
42 | | -│ │ └── <sample1 name>.recal.table |
43 | | -│ └── <sample2 name> |
44 | | -│ └── DuplicatesMarked |
45 | | -│ ├── <sample2 name>.md.bam |
46 | | -│ ├── <sample2 name>.md.bam.bai |
47 | | -│ └── <sample2 name>.recal.table |
48 | | -├── Reports |
49 | | -│ ├── SequenceQC |
50 | | -│ │ ├── <runfolder 1> |
51 | | -│ │ │ ├── <runfolder 1>_<project>_multiqc_report_data.zip |
52 | | -│ │ │ └── <runfolder 1>_<project>_multiqc_report.html |
53 | | -│ │ └── <runfolder 2> |
54 | | -│ │ ├── <runfolder 2>_<project>_multiqc_report_data.zip |
55 | | -│ │ └── <runfolder 2>_<project>_multiqc_report.html |
56 | | -│ ├── MultiQC |
57 | | -│ │ ├── <project>_multiqc_report_data.zip |
58 | | -│ │ └── <project>_multiqc_report.html |
59 | | -│ ├── <sample1 name> |
60 | | -│ │ ├── bamQC |
61 | | -│ │ ├── BCFToolsStats |
62 | | -│ │ ├── FastQC |
63 | | -│ │ ├── HsMetrics |
64 | | -│ │ ├── MarkDuplicates |
65 | | -│ │ ├── SamToolsStats |
66 | | -│ │ ├── snpEff |
67 | | -│ │ └── VCFTools |
68 | | -│ └── <sample2 name> |
69 | | -│ ├── bamQC |
70 | | -│ ├── BCFToolsStats |
71 | | -│ ├── FastQC |
72 | | -│ ├── HsMetrics |
73 | | -│ ├── MarkDuplicates |
74 | | -│ ├── SamToolsStats |
75 | | -│ ├── snpEff |
76 | | -│ └── VCFTools |
77 | | -├── Resources |
78 | | -│ └── apply_recalibration.sh |
79 | | -├── <sample1 name>.lst |
80 | | -├── <sample1 name>.md5 |
81 | | -├── <sample2 name>.lst |
82 | | -├── <sample2 name>.md5 |
83 | | -└── VariantCalling |
84 | | - ├── <sample1 name> |
85 | | - │ ├── HaplotypeCaller |
86 | | - │ └── HaplotypeCallerGVCF |
87 | | - └── <sample2 name> |
88 | | - ├── HaplotypeCaller |
89 | | - └── HaplotypeCallerGVCF |
| 18 | + |
| 19 | +## Delivery structure |
90 | 20 |
|
91 | 21 | ``` |
| 22 | +├── checksums.md5 |
| 23 | +├── DELIVERY.README.SAREK.WES.md |
| 24 | +├── results |
| 25 | + ├── add |
| 26 | + ├── annotation |
| 27 | + │ ├── deepvariant |
| 28 | + │ └── haplotypecaller |
| 29 | + ├── csv |
| 30 | + ├── multiqc |
| 31 | + ├── pipeline_info |
| 32 | + ├── preprocessing |
| 33 | + │ ├── fastp |
| 34 | + │ ├── recalibrated |
| 35 | + │ └── recal_table |
| 36 | + ├── reference |
| 37 | + │ └── intervals |
| 38 | + ├── reports |
| 39 | + │ ├── bcftools |
| 40 | + │ ├── fastp |
| 41 | + │ ├── fastqc |
| 42 | + │ ├── HsMetrics |
| 43 | + │ ├── markduplicates |
| 44 | + │ ├── mosdepth |
| 45 | + │ ├── samtools |
| 46 | + │ ├── snpeff |
| 47 | + │ └── vcftools |
| 48 | + ├── tabix |
| 49 | + └── variant_calling |
| 50 | + ├── concat |
| 51 | + ├── deepvariant |
| 52 | + └── haplotypecaller |
92 | 53 |
|
93 | | -## FASTQ files |
| 54 | +``` |
94 | 55 |
|
95 | | -FASTQ files are not included in the delivery, but can be regenerated from the BAM files. |
96 | | -We recommend using https://github.com/qbic-pipelines/bamtofastq, refer to its documentation for usage. |
97 | 56 |
|
98 | 57 | ## Known issues |
99 | | - |
100 | 58 | - Twist bait intervals are not publicly available and therefore, when running CollectHsMetrics (Picard), the target intervals are used to specify both target and bait. |
101 | 59 | This will lead to some incorrect entries in the HsMetrics table in the MultiQC-report, i.e. entries regarding baits should be neglected. |
102 | 60 |
|
103 | | -## Additional information |
104 | 61 |
|
105 | | -- The original target file used for the analysis can be found here https://www.twistbioscience.com/resources/bed-file/twist-human-comprehensive-exome-panel-bed-files |
106 | | -Note that each region in this file was padded with 100 bp upstream and downstream before submitting it to the pipeline. |
| 62 | +## Additional information |
| 63 | +- The original target file used for the analysis can be found [here](https://www.twistbioscience.com/resources/data-files/comprehensive-exome-bed-files) |
| 64 | +Note that each region in this file was padded with 100 bp upstream and downstream before submitting it to the pipeline (available in results/reference/intervals). |
107 | 65 | - Note that samples that are sequenced on more than one flowcell/lane will be suffixed accordingly for some modules in the MultiQC report. |
108 | | -A sample that has been sequenced twice will for some metrics be presented as a joint vaule for <sample name>, and with one value per run, i.e. <sample name>_1 and <sample_name>_2. |
109 | | -- To apply the recalibrations table to the deduplicated .bam-files use the script Resources/apply_recalibration.sh |
| 66 | +A sample that has been sequenced twice will for some metrics be presented as a joint vaule for <sample name>, and with one value per run, i.e. <sample name>_1 and <sample_name>_2. |
| 67 | +- Output from GATK MarkDuplicates have been removed from the results folder. |
| 68 | +Duplicate marked cram files can be requested up to 60 days after delivery. |
| 69 | + |
0 commit comments