Productionize LD code for gnomAD v4 SNVs/Indels#634
Conversation
|
okay I'm actually okay attaching my name to this now! Ready for review |
| pop_freq = pop_mt.freq[meta_index] | ||
| pop_mt = pop_mt.annotate_rows(pop_freq=pop_freq) | ||
|
|
||
| pop_mt = pop_mt.filter_rows((hl.len(pop_mt.filters) == 0)) |
There was a problem hiding this comment.
Minor comment: this line duplicates line 88
| pop_freq = pop_mt.freq[meta_index] | ||
| pop_mt = pop_mt.annotate_rows(pop_freq=pop_freq) | ||
|
|
||
| pop_mt = pop_mt.filter_rows((hl.len(pop_mt.filters) == 0)) |
There was a problem hiding this comment.
More of a note for future reference:
In my case this line of filter is replaced with a set of filters below,
- Entries: high quality variants (~VQSR)
- Entries: adj filters
- Rows: hl.agg.any(mt.GT.n_alt_alleles() > 0)
| return pop_mt | ||
|
|
||
|
|
||
| def generate_ld_pruned_set( |
There was a problem hiding this comment.
Just a note to confirm, this function is not needed for the purpose of computing LD scores.
| ), | ||
| overwrite, | ||
| ) | ||
| ld = hl.ld_matrix(pop_mt.GT.n_alt_alleles(), pop_mt.locus, radius) |
There was a problem hiding this comment.
In my old script, I ran BlockMatrix.write_from_entry_expr( mt.GT.n_alt_alleles(), tmp_bm_path, mean_impute=True, center=False, normalize=False, overwrite=args.overwrite ), wondering how much difference this will introduce
| ) | ||
| ld = hl.ld_matrix(pop_mt.GT.n_alt_alleles(), pop_mt.locus, radius) | ||
| if data_type != "genomes_snv_sv": | ||
| ld = ld.sparsify_triangle() |
There was a problem hiding this comment.
question: any thoughts in why this shouldn't be applied to all cases?
|
|
||
| l2row = r2_adj.sum(axis=0).T | ||
| l2col = r2_adj.sum(axis=1) | ||
| l2 = l2row + l2col + 1 |
There was a problem hiding this comment.
one more note, I had this line as
r2_diag = checkpoint_tmp(r2_adj.diagonal()).T
l2 = l2row + l2col - r2_diag
still pretty slow checkpointing chr22 before join/annotation , and does not work for chr1 (stalls out)
| mt = mt.select_cols(mt.meta) | ||
| mt = mt.select_rows() | ||
| logger.info(f"Filtering to {args.pop}...") | ||
| mt = mt.filter_cols(mt.meta.population_inference.pop == pop) |
There was a problem hiding this comment.
We did also improve things by adding in this option to the exome vds loading function filter_samples_ht=meta_ht
And we added the naive_coalesce to the loading function too
|
|
||
| # NOTE: QUESTION IF THIS SHOULD BE RAN OR NOT | ||
| logger.info(f"Checkpointing {pop} on {chrom if chrom else default_str}") | ||
| mt = mt.checkpoint(new_temp_file()) |
There was a problem hiding this comment.
Agree that this checkpoint is good since the MT is filtered, and this will improve the LD computation
Code for LD, pending VDS performance investigation
Ported code from gnomAD v2. Not running on SVs or for cross-pop analyses