|
| 1 | +MatchIds are used to tie base/comparison calls together in post-processing for debugging or other exploring. MatchIds have a structure of `{chunkid}.{callid}`. The chunkid is unique id per-chunk of calls. All calls sharing chunkid were within `--chunksize` distance and were compared. The callid is unique to a call in a chunk for each VCF. Because `bench` processes two VCFs (the base and comparison VCFs), the `MatchId` has two values: the first is the base variant's MatchId and the second the comparison variant's MatchId. |
| 2 | + |
| 3 | +For `--pick single`, the two MatchIds will be identical in the e.g. tp-base.vcf.gz and tp-comp.vcf.gz. However, for `--pick ac|multi`, it's possible to have cases such as one base variant matching to multiple comparison variants. That would give us MatchIds like: |
| 4 | + |
| 5 | +``` |
| 6 | +# tp-base.vcf |
| 7 | +MatchId=4.0,4.1 |
| 8 | +
|
| 9 | +# tp-comp.vcf |
| 10 | +MatchId=4.0,4.1 |
| 11 | +MatchId=4.0,4.2 |
| 12 | +``` |
| 13 | + |
| 14 | +This example tells us that the tp-comp variants are both pointing to `4.0` in tp-base. The tp-base variant has a higher match to the tp-comp `4.1` variant. |
| 15 | + |
| 16 | +One easy way to combine matched variants is to use `truvari vcf2df` to convert a benchmarking result to a pandas DataFrame and leverage pandas' merge operation. First, we convert the `truvari bench` result. |
| 17 | + |
| 18 | +```bash |
| 19 | +truvari vcf2df --info --bench-dir bench_result/ data.jl |
| 20 | +``` |
| 21 | + |
| 22 | +Next, we combine rows of matched variants: |
| 23 | +```python |
| 24 | +import joblib |
| 25 | +import pandas as pd |
| 26 | + |
| 27 | +# Load the data |
| 28 | +data = joblib.load("data.jl") |
| 29 | + |
| 30 | +# Separate out the variants from the base VCF and add new columns of the base/comp ids |
| 31 | +base = data[data['state'].isin(['tpbase', 'fn'])].copy() |
| 32 | +base['base_id'] = base['MatchId'].apply(lambda x: x[0]) |
| 33 | +base['comp_id'] = base['MatchId'].apply(lambda x: x[1]) |
| 34 | + |
| 35 | +# Separate out the variants from the comparison VCF and add new columns of the base/comp ids |
| 36 | +comp = data[data['state'].isin(['tp', 'fp'])].copy() |
| 37 | +comp['base_id'] = comp['MatchId'].apply(lambda x: x[0]) |
| 38 | +comp['comp_id'] = comp['MatchId'].apply(lambda x: x[1]) |
| 39 | + |
| 40 | +# Merge the base/comparison variants |
| 41 | +combined = pd.merge(base, comp, left_on='base_id', right_on='comp_id', suffixes=('_base', '_comp')) |
| 42 | + |
| 43 | +# How many comp variants matched to multiple base variants? |
| 44 | +counts1 = combined['base_id_comp'].value_counts() |
| 45 | +print('multi-matched comp count', (counts1 != 1).sum()) |
| 46 | + |
| 47 | +# How many base variants matched to multiple comp variants? |
| 48 | +counts2 = combined['comp_id_base'].value_counts() |
| 49 | +print('multi-matched base count', (counts2 != 1).sum()) |
| 50 | +``` |
| 51 | + |
| 52 | +The `MatchId` is also used by `truvari collapse`. However there are two differences. First, in the main `collapse` output, the relevant INFO field is named `CollapsedId`. Second, because collapse only has a single input VCF, it is much easier to merge DataFrames. To merge collapse results kept variants with those that were removed, we again need to convert the VCFs to DataFrames: |
| 53 | + |
| 54 | +```bash |
| 55 | +truvari vcf2df -i kept.vcf.gz kept.jl |
| 56 | +truvari vcf2df -i removed.vcf.gz remov.jl |
| 57 | +``` |
| 58 | + |
| 59 | +Then we combine them: |
| 60 | +```python |
| 61 | +import joblib |
| 62 | +import pandas as pd |
| 63 | + |
| 64 | +# Load the kept variants and set the index. |
| 65 | +kept = joblib.load("kept.jl").set_index('CollapseId') |
| 66 | + |
| 67 | +# Load the removed variants and set the index. |
| 68 | +remov = joblib.load("remov.jl") |
| 69 | +remov['CollapseId'] = remov['MatchId'].apply(lambda x: x[0]) |
| 70 | +remov.set_index('CollapseId', inplace=True) |
| 71 | + |
| 72 | +# Join the two sets of variants |
| 73 | +result_df = kept.join(remov, how='right', rsuffix='_removed') |
| 74 | +``` |
0 commit comments