|
| 1 | +# Report guide |
| 2 | +This section of the user guide contains information on using `methbat report` to analyze pre-defined regions with known expected methylation patterns, such as imprinting regions. |
| 3 | +This approach extracts the CpGs overlapping the pre-defined regions and aggregates the signal for the region, allowing `methbat` to assign labels such as "Methylated" or "AlleleSpecificMethylation" for the regions. |
| 4 | +The `methbat report` command then compares the observed methylation patterns against the expected patterns, identifying regions with anomalous methylation states and generating quality control warnings. |
| 5 | + |
| 6 | +Table of contents: |
| 7 | + |
| 8 | +* [Quickstart](#quickstart) |
| 9 | +* [Input files](#input-files) |
| 10 | +* [Output files](#output-files) |
| 11 | + |
| 12 | +# Quickstart |
| 13 | +The following command will create a methylation report for a single dataset: |
| 14 | + |
| 15 | +```bash |
| 16 | +methbat report \ |
| 17 | + --input-prefix {IN_PREFIX} \ |
| 18 | + --input-regions {IN_REGIONS} \ |
| 19 | + --output-report {OUT_REPORT} |
| 20 | +``` |
| 21 | + |
| 22 | +Parameters: |
| 23 | +* `--input-prefix {IN_PREFIX}` - the prefix for the outputs from [pb-CpG-tools](https://github.com/PacificBiosciences/pb-CpG-tools), these outputs contain CpG metrics aggregated at each CpG locus |
| 24 | +* `--input-regions {IN_REGIONS}` - the genomic regions of interest with expected methylation categories; example region files are provided in the [report_regions data folder](../data/report_regions/) and the format is [specified below](#regions-file) |
| 25 | +* `--output-report {OUT_REPORT}` - the output report file (CSV/TSV) |
| 26 | + |
| 27 | +## Quickstart Example |
| 28 | +``` |
| 29 | +methbat report \ |
| 30 | + --input-prefix ./pipeline/cpg_5mc_model/HG001 \ |
| 31 | + --input-regions ./data/report_regions/imprinting_targets.tsv \ |
| 32 | + --output-report ./output/HG001.report.tsv |
| 33 | +
|
| 34 | +[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Input/Output: |
| 35 | +[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Input prefix: "./pipeline/cpg_5mc_count/HG001" |
| 36 | +[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Input profile regions: "./data/methbat_report/imprinting_targets.tsv" |
| 37 | +[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Output report file: "./output/HG001.report.tsv" |
| 38 | +[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Labeling heuristics: |
| 39 | +[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Minimum haplotype coverage: 10 |
| 40 | +[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Minimum ASM phased fraction: 0.75 |
| 41 | +[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Minimum ASM absolute delta mean: 0.5 |
| 42 | +[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Minimum Weak ASM absolute delta mean: 0.3 |
| 43 | +[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Maximum ASM Fishers exact p-value: 0.01 |
| 44 | +[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Maximum unmethylated combined fraction: 0.2 |
| 45 | +[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Minimum methylated combined fraction: 0.8 |
| 46 | +[2025-11-21T19:56:58.295Z INFO methbat::cpg_parser] Loading "./pipeline/cpg_5mc_count/HG001.combined.bed"... |
| 47 | +[2025-11-21T19:57:29.203Z INFO methbat::cpg_parser] Loading "./pipeline/cpg_5mc_count/HG001.hap1.bed"... |
| 48 | +[2025-11-21T19:57:55.854Z INFO methbat::cpg_parser] Loading "./pipeline/cpg_5mc_count/HG001.hap2.bed"... |
| 49 | +[2025-11-21T19:58:20.003Z INFO methbat::reporting] Loading "./data/methbat_report/imprinting_targets.tsv"... |
| 50 | +[2025-11-21T19:58:20.035Z INFO methbat::writers::report_writer] Saving report results to "./output/HG001.report.tsv"... |
| 51 | +[2025-11-21T19:58:20.632Z INFO methbat] Process finished successfully. |
| 52 | +``` |
| 53 | + |
| 54 | +## Additional options |
| 55 | +The following parameters control how methylation categories are assigned to regions: |
| 56 | +* `--min-haplotype-coverage {COVERAGE}` - the minimum coverage of a haplotype to consider it "normal" for QC purposes |
| 57 | +* `--min-asm-phased-fraction {FRAC}` - the minimum fraction of CpGs in a region that must be phased to consider AlleleSpecificMethylation (ASM) |
| 58 | +* `--min-asm-abs-delta-mean {DELTA}` - the minimum absolute difference between mean haplotype methylation fractions to consider ASM |
| 59 | +* `--min-weakasm-abs-delta-mean {DELTA}` - the minimum absolute difference between mean haplotype methylation fractions to label a region with the QC flag WeakASM (default: 0.3) |
| 60 | +* `--max-asm-fishers-exact {P-VALUE}` - the maximum Fisher's exact test p-value to consider ASM (default: 0.01) |
| 61 | +* `--max-unmethylated-combined {FRAC}` - the maximum combined methylation fraction to consider unmethylated status (default: 0.2) |
| 62 | +* `--min-methylated-combined {FRAC}` - the minimum combined methylation fraction to consider methylated status (default: 0.8) |
| 63 | + |
| 64 | +# Input files |
| 65 | +## Regions file |
| 66 | +The regions file is a CSV/TSV containing region coordinates along with expected methylation categories and anomalous categories. |
| 67 | +Example region files for imprinting regions are provided in the [report_regions data folder](../data/report_regions/). |
| 68 | + |
| 69 | +Fields: |
| 70 | +* `chrom` - the chromosome of the region |
| 71 | +* `start` - the 0-based start of the region, inclusive |
| 72 | +* `end` - the 0-based end of the region, exclusive |
| 73 | +* `cpg_label` - (optional column) a label assigned to the region |
| 74 | +* `expected_category` - the expected methylation category for the region; possible values are: `Uncategorized`, `Methylated`, `Unmethylated`, `AlleleSpecificMethylation` |
| 75 | +* `anomalous_categories` - a semicolon-separated list of methylation categories that are considered anomalous for this region; these categories indicate loss of the expected methylation pattern |
| 76 | + |
| 77 | +Example: |
| 78 | +``` |
| 79 | +chrom start end cpg_label expected_category anomalous_categories |
| 80 | +chr6 144006940 144008751 PLAGL1:alt-TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated |
| 81 | +chr7 50781028 50783615 GRB10:alt-TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated |
| 82 | +chr11 1997581 2003510 H19/IGF2:IG-DMR AlleleSpecificMethylation Methylated;Unmethylated |
| 83 | +... |
| 84 | +``` |
| 85 | + |
| 86 | +# Output files |
| 87 | +## Report files |
| 88 | +The CSV/TSV file containing CpG methylation metrics for the provided regions, along with comparisons against expected methylation patterns. |
| 89 | + |
| 90 | +Fields: |
| 91 | +* `chrom`, `start`, `end` - the region definition, copied from the input region file |
| 92 | +* `cpg_label` - a pass-through of the optional `cpg_label` field from the input file; empty string if not provided |
| 93 | +* `report_summary` - a high level summary label comparing the observed methylation pattern to the expected pattern; possible options are below: |
| 94 | + * `PASS` - the computed category matches the expected category |
| 95 | + * `Inconclusive` - the computed category does not match the expected category or any of the anomalous categories |
| 96 | + * `AnomalousQcWarning` - the computed category matches an anomalous category, but there is a QC warning that should be investigated |
| 97 | + * `Anomalous` - the computed category matches an anomalous category, with no QC warnings |
| 98 | +* `qc_warnings` - a semicolon-separated list of quality control warnings; possible values are below: |
| 99 | + * `PASS` - indicates there are no QC warnings |
| 100 | + * `LowPhasedCpGs` - the region has a low number of phased CpGs (below the minimum phased fraction threshold) |
| 101 | + * `LowHaplotypeCoverage` - one or both haplotypes have low coverage (below the minimum haplotype coverage threshold) |
| 102 | + * `WeakASM` - indicates that there is a weak allele-specific methylation signal (for regions where ASM is expected but not detected) |
| 103 | +* `expected_category` - the expected methylation category for the region, copied from the input file |
| 104 | +* `summary_label` - a summarization of the observed methylation status for this region, possible options are below: |
| 105 | + * `NoData` - indicates no CpGs were found inside the region |
| 106 | + * `Uncategorized` - indicates that CpGs were present, but there was not enough evidence to label this region with any of the following labels |
| 107 | + * `Methylated` - indicates that the combined CpG methylation had a high average methylation rate (by default, >=80%) |
| 108 | + * `Unmethylated` - indicates that the combined CpG methylation had a low average methylation rate (by default, <=20%) |
| 109 | + * `AlleleSpecificMethylation` - indicates that a sufficient fraction of the CpGs were phased (by default, >=75%) _and_ that ASM was detected through _both_ a significant Fisher's exact test (by default, p <= 0.01) and difference in mean methylation for haplotypes 1 and 2 (by default, >= 50% methylation delta) |
| 110 | +* `mean_combined_methyl` - the mean (average) combined methylation ratio; "combined" here indicates that phasing (i.e. haplotypes) is not considered |
| 111 | +* `mean_meth_delta` - the difference in mean methylation ratios between the two haplotypes; `mean_meth_delta = mean_hap2_methyl - mean_hap1_methyl` |
| 112 | +* `mean_hap1_methyl` - the mean (average) methylation ratio for CpGs on haplotype 1 |
| 113 | +* `mean_hap2_methyl` - the mean (average) methylation ratio for CpGs on haplotype 2 |
| 114 | +* `asm_fishers_pvalue` - the raw p-value from a Fisher's exact test comparing the two haplotypes and the number of reads that are methylated/unmethylated |
| 115 | +* `num_phased_cpgs` - the number of CpGs in the region with haplotagged reads on both haplotypes |
| 116 | +* `num_partial_cpgs` - the number of CpGs in the region with haplotagged reads on only one haplotype |
| 117 | +* `num_unphased_cpgs` - the number of CpGs in the region with no haplotagged reads |
| 118 | +* `median_total_coverage` - the median coverage across all CpGs in the region |
| 119 | +* `median_hap1_coverage` - the median coverage for CpGs with haplotype 1 information |
| 120 | +* `median_hap2_coverage` - the median coverage for CpGs with haplotype 2 information |
| 121 | + |
| 122 | +Example: |
| 123 | +``` |
| 124 | +chrom start end cpg_label report_summary qc_warnings expected_category summary_label mean_combined_methyl mean_meth_delta mean_hap1_methyl mean_hap2_methyl asm_fishers_pvalue num_phased_cpgs num_partial_cpgs num_unphased_cpgs median_total_coverage median_hap1_coverage median_hap2_coverage |
| 125 | +chr6 144006940 144008751 PLAGL1:alt-TSS-DMR PASS PASS AlleleSpecificMethylation AlleleSpecificMethylation 0.4853123556191952 -0.734328774543706 0.8759114565675648 0.14158268202385849 0.0 143 0 0 32 15 16 |
| 126 | +... |
| 127 | +``` |
| 128 | + |
0 commit comments