-
Notifications
You must be signed in to change notification settings - Fork 0
Description
We should develop a protocol that better validates mapping output. Current mapping output looks generally good, but we don't really do validation with sequences where we know what they would map to. Doing this sort of validation on arbitrary sequences would greatly improve internal confidence in mapping results. Certain aspects of this might be manual, other aspects could be programmatic, and others might involve additional visibility into the mapping processes.
Manual work might involve spot checking mapping output and ensuring the target is mapped to the correct region of the reference and that mapped variant HGVS strings are in the correct position and have the correct ref/alt information.
Programmatic work might include the development of a 'reverse mapper'. This mapper might take mapped variants as its input and convert the results to an original sequence. This sequence could then be compared to the original target sequence and validated. We might also consider validating mapping results by selecting arbitrary sequences from the reference and sending them through the mapper to ensure they map to the transcript/region we expect. We could additionally saturate this sequence with variants to validate variant output. We might also consider modifying these sequences slightly, as target sequences are not guaranteed to fully match the reference sequence they are based on.
Visibility and reliability work might include adding flags for known issues (eg: del-ins variants) and rejecting mappings if they trip a flag. This would also include alerting to allow the team to investigate the mapping output. It might also be worth investing time in making it easier to run the mapper with test input and not as only a step of the score set processing routine. We should also add more fully featured logging to the mapper along with canonical log records.
As part of this work, we might consider integrating upstream mapper changes made by the Wagner lab as a way of reducing the work we must to and ensuring we are up to date on any improvements they have made.
Some things that might be worth looking into specifically:
- Sequences which have many alternatively spliced transcripts
- Intronic variants and regions
- Mappings (especially in cases where targets overlap psuedogenes)
- Transcript selection and the generation of variants with different HGVS structure than a pre-mapped variant
- Thresholds for toleration of target and reference differences. See Genetic background information mavedb-api#83