Skip to content

Commit 98ffc1f

Browse files
authored
Merge pull request #38 from PacificBiosciences/features/imprinting_regions
documentation for report mode
2 parents d1a7788 + a3579c8 commit 98ffc1f

File tree

8 files changed

+222
-9
lines changed

8 files changed

+222
-9
lines changed

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1+
# v0.17.0
2+
## Changes
3+
- Added new `methbat report` sub-command for analyzing pre-defined regions with known expected methylation patterns (e.g., imprinting regions). The command compares observed methylation patterns against expected patterns, identifying regions with anomalous methylation states and generating quality control warnings. See [report guide](./docs/report_guide.md) for details on usage.
4+
- Added example report region files for GRCh38 imprinting regions in [data/report_regions](./data/report_regions/)
5+
- Added an output header to most output CSV/TSV files that includes the MethBat version, command, and the datetime the command was run. These header lines are prefixed with the '#' character.
6+
7+
## Fixed
8+
- The `profile` output TSV files have been modified such that their column headers do not have the '#' prefix
9+
110
# v0.16.1
211
## Fixed
312
- Fixed an issue where compressed bed files were not recognized by `methbat signature` mode

data/README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
# Data
2-
This folder contains data resources for use with MethBat.
2+
This folder contains data resources for use with MethBat. Subfolders:
3+
4+
* [cell_atlas](./cell_atlas/) - Cell atlases that have been pre-configured to work with `methbat deconvolve`
5+
* [report_regions](./report_regions/) - Regions with known methylation patterns that have been pre-configured to work with `methbat report`
36

47
## CpG profiles
58
These files contain background / cohort CpG profiles that can be provided to MethBat to describe coordinates of regions of interest.

data/report_regions/README.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Report regions
2+
This sub-folder contains region sets that are pre-configured to work with `methbat report`.
3+
We provide brief descriptions of each below, grouped by use case.
4+
5+
## Imprinting regions
6+
The files in this group contain imprinting regions, which are expected to have one allele methylated and the other unmethylated in healthy samples.
7+
Loss of imprinting within these regions is often associated with disease, and they typically manifest as fully methylated or unmethylated in affected samples.
8+
While "AlleleSpecificMethylation" is the expected state for healthy samples, local phasing information may not always be available to confirm the ASM.
9+
In these cases, "neutral" methylation states (e.g., ~50% combined methylation) may also be considered normal.
10+
11+
### `hg38_imprinting_targets.tsv`
12+
* Source: These regions are derived from [Table 2](https://clinicalepigeneticsjournal.biomedcentral.com/articles/10.1186/s13148-022-01358-9/tables/2) of [Mackay et al. 2022](http://doi.org/10.1186/s13148-022-01358-9).
13+
* Coordinates: GRCh38
14+
* Tested with:
15+
* >200 cell line samples
16+
* >300 blood sample
17+
* All samples were WGS with at least 25x mean coverage
18+
* Known problem regions:
19+
* `H19/IGF2:IG-DMR` - In GRCh38, some versions include ALT contigs that pull reads away from this mapping location, leading to significant loss of alignments and increased classification errors.
20+
* `MEG3/DLK1:IG-DMRa` - This region is very short and typically only contains 3 CpGs in most datasets. While we did not observe issues in blood samples, we found that cell lines have elevated classification errors in this region.
21+
* Classification summary (blood only) excluding problem regions:
22+
* PASS - 4,190 (91.52%); detected as proper ASM
23+
* Inconclusive - 378 (8.26%); could not confirm nor deny ASM
24+
* AnomalousQcWarning - 8 (0.17%); loss of ASM detected with QC warnings
25+
* Anomalous - 2 (0.04%); loss of ASM detected without QC warnings
26+
27+
### `hg38_imprinting_autoasm.tsv`
28+
* Source: These regions were created by applying `methbat joint-segment` to >300 WGS blood samples with at least 25x mean coverage. They were then overlapped and labeled with the same identifiers from `imprinting_targets.tsv`. Generally, the coordinates closely match, and we see very marginal differences in the output. However, these may be slightly more accurate regions since they have been derived from HiFi observations.
29+
* Coordinates: GRCh38
30+
* Tested with:
31+
* >200 cell line samples
32+
* >300 blood sample
33+
* All samples were WGS with at least 25x mean coverage
34+
* Known problem regions:
35+
* `H19/IGF2:IG-DMR` - A consistent ASM region was not identified due to ALT contigs pulling reads away from this mapping location. It has been removed from this set.
36+
* `MEG3/DLK1:IG-DMRa` - Similar to the base regions, we found that cell lines have elevated classification errors in this region.
37+
* Classification summary (blood only) excluding problem regions:
38+
* PASS - 4,194 (91.61%); detected as proper ASM
39+
* Inconclusive - 374 (8.17%); could not confirm nor deny ASM
40+
* AnomalousQcWarning - 8 (0.17%); loss of ASM detected with QC warnings
41+
* Anomalous - 2 (0.04%); loss of ASM detected without QC warnings
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
chrom start end cpg_label expected_category anomalous_categories
2+
chr6 144006984 144008751 PLAGL1:alt-TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
3+
chr7 50782012 50783615 GRB10:alt-TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
4+
chr7 130490280 130493269 MEST:alt-TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
5+
chr11 2698717 2701029 KCNQ1OT1:TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
6+
chr14 100811000 100811037 MEG3/DLK1:IG-DMRa AlleleSpecificMethylation Methylated;Unmethylated
7+
chr14 100824207 100827641 MEG3:TSS-DMRb AlleleSpecificMethylation Methylated;Unmethylated
8+
chr15 23647277 23648622 MAGEL2:TSS-DMRb AlleleSpecificMethylation Methylated;Unmethylated
9+
chr15 24954856 24956829 SNURF:TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
10+
chr16 3442827 3444463 ZNF597:TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
11+
chr19 56837124 56841903 PEG3:TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
12+
chr20 58838983 58843218 GNAS-NESP:TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
13+
chr20 58850593 58852977 GNAS-AS1:TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
14+
chr20 58854029 58856408 GNAS-XL:Ex1-DMR AlleleSpecificMethylation Methylated;Unmethylated
15+
chr20 58888341 58890145 GNAS A/B:TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
chrom start end cpg_label expected_category anomalous_categories
2+
chr6 144006940 144008751 PLAGL1:alt-TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
3+
chr7 50781028 50783615 GRB10:alt-TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
4+
chr7 130490280 130494547 MEST:alt-TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
5+
chr11 1997581 2003510 H19/IGF2:IG-DMR AlleleSpecificMethylation Methylated;Unmethylated
6+
chr11 2698717 2701029 KCNQ1OT1:TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
7+
chr14 100824186 100827641 MEG3:TSS-DMRb AlleleSpecificMethylation Methylated;Unmethylated
8+
chr14 100811000 100811037 MEG3/DLK1:IG-DMRa AlleleSpecificMethylation Methylated;Unmethylated
9+
chr15 23647277 23648882 MAGEL2:TSS-DMRb AlleleSpecificMethylation Methylated;Unmethylated
10+
chr15 24954856 24956829 SNURF:TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
11+
chr16 3442827 3444463 ZNF597:TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
12+
chr19 56837124 56841903 PEG3:TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
13+
chr20 58838983 58843557 GNAS-NESP:TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
14+
chr20 58850593 58852978 GNAS-AS1:TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
15+
chr20 58853849 58856408 GNAS-XL:Ex1-DMR AlleleSpecificMethylation Methylated;Unmethylated
16+
chr20 58888209 58890146 GNAS A/B:TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated

docs/profile_guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -253,7 +253,7 @@ Fields:
253253

254254
Example (this file has been spliced to show different results from FEMALE v. MALE outputs):
255255
```
256-
#chrom start end baseline_category compare_category summary_comparison zscore_avg_abs_meth_deltas delta_avg_abs_meth_deltas baseline_num_phased compare_num_phased zscore_avg_combined_methyls delta_avg_combined_methyls baseline_num_samples compare_num_samples
256+
chrom start end baseline_category compare_category summary_comparison zscore_avg_abs_meth_deltas delta_avg_abs_meth_deltas baseline_num_phased compare_num_phased zscore_avg_combined_methyls delta_avg_combined_methyls baseline_num_samples compare_num_samples
257257
chr1 28735 29737 FEMALE MALE Uncategorized 0.0783692468300548 0.0003903381431094449 44 29 -2.1384731927805856 -0.0022979451513482282 45 30
258258
chr1 491107 491546 FEMALE MALE InsufficientData 0.0 0 0 -1.299378424859551 -0.044848726779527226 16 9
259259
chr1 143326822 143327608 FEMALE MALE HyperMethylated 0.6439708246041966 0.09808535591965556 9 7 4.217908672989068 0.2262013390038578 45 30

docs/report_guide.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# Report guide
2+
This section of the user guide contains information on using `methbat report` to analyze pre-defined regions with known expected methylation patterns, such as imprinting regions.
3+
This approach extracts the CpGs overlapping the pre-defined regions and aggregates the signal for the region, allowing `methbat` to assign labels such as "Methylated" or "AlleleSpecificMethylation" for the regions.
4+
The `methbat report` command then compares the observed methylation patterns against the expected patterns, identifying regions with anomalous methylation states and generating quality control warnings.
5+
6+
Table of contents:
7+
8+
* [Quickstart](#quickstart)
9+
* [Input files](#input-files)
10+
* [Output files](#output-files)
11+
12+
# Quickstart
13+
The following command will create a methylation report for a single dataset:
14+
15+
```bash
16+
methbat report \
17+
--input-prefix {IN_PREFIX} \
18+
--input-regions {IN_REGIONS} \
19+
--output-report {OUT_REPORT}
20+
```
21+
22+
Parameters:
23+
* `--input-prefix {IN_PREFIX}` - the prefix for the outputs from [pb-CpG-tools](https://github.com/PacificBiosciences/pb-CpG-tools), these outputs contain CpG metrics aggregated at each CpG locus
24+
* `--input-regions {IN_REGIONS}` - the genomic regions of interest with expected methylation categories; example region files are provided in the [report_regions data folder](../data/report_regions/) and the format is [specified below](#regions-file)
25+
* `--output-report {OUT_REPORT}` - the output report file (CSV/TSV)
26+
27+
## Quickstart Example
28+
```
29+
methbat report \
30+
--input-prefix ./pipeline/cpg_5mc_model/HG001 \
31+
--input-regions ./data/report_regions/imprinting_targets.tsv \
32+
--output-report ./output/HG001.report.tsv
33+
34+
[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Input/Output:
35+
[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Input prefix: "./pipeline/cpg_5mc_count/HG001"
36+
[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Input profile regions: "./data/methbat_report/imprinting_targets.tsv"
37+
[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Output report file: "./output/HG001.report.tsv"
38+
[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Labeling heuristics:
39+
[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Minimum haplotype coverage: 10
40+
[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Minimum ASM phased fraction: 0.75
41+
[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Minimum ASM absolute delta mean: 0.5
42+
[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Minimum Weak ASM absolute delta mean: 0.3
43+
[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Maximum ASM Fishers exact p-value: 0.01
44+
[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Maximum unmethylated combined fraction: 0.2
45+
[2025-11-21T19:56:58.295Z INFO methbat::cli::report] Minimum methylated combined fraction: 0.8
46+
[2025-11-21T19:56:58.295Z INFO methbat::cpg_parser] Loading "./pipeline/cpg_5mc_count/HG001.combined.bed"...
47+
[2025-11-21T19:57:29.203Z INFO methbat::cpg_parser] Loading "./pipeline/cpg_5mc_count/HG001.hap1.bed"...
48+
[2025-11-21T19:57:55.854Z INFO methbat::cpg_parser] Loading "./pipeline/cpg_5mc_count/HG001.hap2.bed"...
49+
[2025-11-21T19:58:20.003Z INFO methbat::reporting] Loading "./data/methbat_report/imprinting_targets.tsv"...
50+
[2025-11-21T19:58:20.035Z INFO methbat::writers::report_writer] Saving report results to "./output/HG001.report.tsv"...
51+
[2025-11-21T19:58:20.632Z INFO methbat] Process finished successfully.
52+
```
53+
54+
## Additional options
55+
The following parameters control how methylation categories are assigned to regions:
56+
* `--min-haplotype-coverage {COVERAGE}` - the minimum coverage of a haplotype to consider it "normal" for QC purposes
57+
* `--min-asm-phased-fraction {FRAC}` - the minimum fraction of CpGs in a region that must be phased to consider AlleleSpecificMethylation (ASM)
58+
* `--min-asm-abs-delta-mean {DELTA}` - the minimum absolute difference between mean haplotype methylation fractions to consider ASM
59+
* `--min-weakasm-abs-delta-mean {DELTA}` - the minimum absolute difference between mean haplotype methylation fractions to label a region with the QC flag WeakASM (default: 0.3)
60+
* `--max-asm-fishers-exact {P-VALUE}` - the maximum Fisher's exact test p-value to consider ASM (default: 0.01)
61+
* `--max-unmethylated-combined {FRAC}` - the maximum combined methylation fraction to consider unmethylated status (default: 0.2)
62+
* `--min-methylated-combined {FRAC}` - the minimum combined methylation fraction to consider methylated status (default: 0.8)
63+
64+
# Input files
65+
## Regions file
66+
The regions file is a CSV/TSV containing region coordinates along with expected methylation categories and anomalous categories.
67+
Example region files for imprinting regions are provided in the [report_regions data folder](../data/report_regions/).
68+
69+
Fields:
70+
* `chrom` - the chromosome of the region
71+
* `start` - the 0-based start of the region, inclusive
72+
* `end` - the 0-based end of the region, exclusive
73+
* `cpg_label` - (optional column) a label assigned to the region
74+
* `expected_category` - the expected methylation category for the region; possible values are: `Uncategorized`, `Methylated`, `Unmethylated`, `AlleleSpecificMethylation`
75+
* `anomalous_categories` - a semicolon-separated list of methylation categories that are considered anomalous for this region; these categories indicate loss of the expected methylation pattern
76+
77+
Example:
78+
```
79+
chrom start end cpg_label expected_category anomalous_categories
80+
chr6 144006940 144008751 PLAGL1:alt-TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
81+
chr7 50781028 50783615 GRB10:alt-TSS-DMR AlleleSpecificMethylation Methylated;Unmethylated
82+
chr11 1997581 2003510 H19/IGF2:IG-DMR AlleleSpecificMethylation Methylated;Unmethylated
83+
...
84+
```
85+
86+
# Output files
87+
## Report files
88+
The CSV/TSV file containing CpG methylation metrics for the provided regions, along with comparisons against expected methylation patterns.
89+
90+
Fields:
91+
* `chrom`, `start`, `end` - the region definition, copied from the input region file
92+
* `cpg_label` - a pass-through of the optional `cpg_label` field from the input file; empty string if not provided
93+
* `report_summary` - a high level summary label comparing the observed methylation pattern to the expected pattern; possible options are below:
94+
* `PASS` - the computed category matches the expected category
95+
* `Inconclusive` - the computed category does not match the expected category or any of the anomalous categories
96+
* `AnomalousQcWarning` - the computed category matches an anomalous category, but there is a QC warning that should be investigated
97+
* `Anomalous` - the computed category matches an anomalous category, with no QC warnings
98+
* `qc_warnings` - a semicolon-separated list of quality control warnings; possible values are below:
99+
* `PASS` - indicates there are no QC warnings
100+
* `LowPhasedCpGs` - the region has a low number of phased CpGs (below the minimum phased fraction threshold)
101+
* `LowHaplotypeCoverage` - one or both haplotypes have low coverage (below the minimum haplotype coverage threshold)
102+
* `WeakASM` - indicates that there is a weak allele-specific methylation signal (for regions where ASM is expected but not detected)
103+
* `expected_category` - the expected methylation category for the region, copied from the input file
104+
* `summary_label` - a summarization of the observed methylation status for this region, possible options are below:
105+
* `NoData` - indicates no CpGs were found inside the region
106+
* `Uncategorized` - indicates that CpGs were present, but there was not enough evidence to label this region with any of the following labels
107+
* `Methylated` - indicates that the combined CpG methylation had a high average methylation rate (by default, >=80%)
108+
* `Unmethylated` - indicates that the combined CpG methylation had a low average methylation rate (by default, <=20%)
109+
* `AlleleSpecificMethylation` - indicates that a sufficient fraction of the CpGs were phased (by default, >=75%) _and_ that ASM was detected through _both_ a significant Fisher's exact test (by default, p <= 0.01) and difference in mean methylation for haplotypes 1 and 2 (by default, >= 50% methylation delta)
110+
* `mean_combined_methyl` - the mean (average) combined methylation ratio; "combined" here indicates that phasing (i.e. haplotypes) is not considered
111+
* `mean_meth_delta` - the difference in mean methylation ratios between the two haplotypes; `mean_meth_delta = mean_hap2_methyl - mean_hap1_methyl`
112+
* `mean_hap1_methyl` - the mean (average) methylation ratio for CpGs on haplotype 1
113+
* `mean_hap2_methyl` - the mean (average) methylation ratio for CpGs on haplotype 2
114+
* `asm_fishers_pvalue` - the raw p-value from a Fisher's exact test comparing the two haplotypes and the number of reads that are methylated/unmethylated
115+
* `num_phased_cpgs` - the number of CpGs in the region with haplotagged reads on both haplotypes
116+
* `num_partial_cpgs` - the number of CpGs in the region with haplotagged reads on only one haplotype
117+
* `num_unphased_cpgs` - the number of CpGs in the region with no haplotagged reads
118+
* `median_total_coverage` - the median coverage across all CpGs in the region
119+
* `median_hap1_coverage` - the median coverage for CpGs with haplotype 1 information
120+
* `median_hap2_coverage` - the median coverage for CpGs with haplotype 2 information
121+
122+
Example:
123+
```
124+
chrom start end cpg_label report_summary qc_warnings expected_category summary_label mean_combined_methyl mean_meth_delta mean_hap1_methyl mean_hap2_methyl asm_fishers_pvalue num_phased_cpgs num_partial_cpgs num_unphased_cpgs median_total_coverage median_hap1_coverage median_hap2_coverage
125+
chr6 144006940 144008751 PLAGL1:alt-TSS-DMR PASS PASS AlleleSpecificMethylation AlleleSpecificMethylation 0.4853123556191952 -0.734328774543706 0.8759114565675648 0.14158268202385849 0.0 143 0 0 32 15 16
126+
...
127+
```
128+

0 commit comments

Comments
 (0)