-
Notifications
You must be signed in to change notification settings - Fork 0
Multi-target and accession-based mapping #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Note that this is mostly structured to handle multi-target mapping, but does not contain all changes required for multi-target mapping (specifically final output / reference sequence structure). Such changes would require corresponding changes in mavedb-api which we are not prepared to deploy yet.
bencap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work on this Sally, I know it was a lot to fit these new features into how this was originally built. I left a decent amount of comments, but all of them are pretty minor and that mostly stems from these being a large set of changes. Although I didn't personally run mapping jobs while testing these, it seems to me like the capability to map all of our pillar project data sets is a really reasonable test set.
Lets try and translate the TODOs here into the dcd mapping backlog. I tried to comment on most of them but might have missed a few and/or skipped some duplicates. That backlog isn't super fleshed out yet so it'd be good to get some items in there.
I think you were planning on fixing the tests before merging, but feel free to merge these changes once you do.
| @router.post(path="/map/{urn}", status_code=200, response_model=ScoresetMapping) | ||
| @with_mavedb_score_set | ||
| async def map_scoreset(urn: str, store_path: Path | None = None) -> ScoresetMapping: | ||
| """Perform end-to-end mapping for a scoreset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this function is a little inconsistent with its return types. Sometimes it returns a ScoresetMapping, and other times it returns a JSONResponse containing a ScoresetMapping. We should probably make that consistent and choose one or the other.
src/api/routers/map.py
Outdated
| # TODO this should instead check if all values in dict are none. or might not need this at all. | ||
| if vrs_results is None or len(vrs_results) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, since vrs_results can never be None now (we assign an empty dict to it up front), this should probably be checking to make sure vrs_results has items in it and that at least one of those values is not None.
| # TODO this should instead check if all values in dict are none. or might not need this at all. | |
| if vrs_results is None or len(vrs_results) == 0: | |
| if not vrs_results or all(mapping_result is None for mapping_result in vrs_results.values()): |
src/api/routers/map.py
Outdated
| # TODO this should instead check if all values in dict are none. or might not need this at all. | ||
| if annotated_vrs_results is None or len(annotated_vrs_results) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as prior comment
src/dcd_mapping/align.py
Outdated
| output = read_blat(out_file, "blat-psl") | ||
| output = parse_blat(out_file, "blat-psl") | ||
|
|
||
| # TODO reevaluate this code block - are there cases in mavedb where target sequence type is incorrectly supplied? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would hope not :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's possible any of the weird mapping outputs we have seen are coming from cases where we enter into this block? Or do you think we never enter this block with existing data.
src/dcd_mapping/align.py
Outdated
| def _get_blat_output(metadata: ScoresetMetadata, silent: bool) -> Any: # noqa: ANN401 | ||
| """Run a BLAT query and returns a path to the output object. | ||
| If unable to produce a valid query the first time, then try a query using ``dnax`` | ||
| bases. | ||
| :param scoreset_metadata: object containing scoreset attributes | ||
| :param silent: suppress BLAT command output | ||
| :return: BLAT query result | ||
| :return: dict where keys are target gene identifiers and values are BLAT query result objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should still be able to type this return value with dict[str, QueryResult].
src/dcd_mapping/main.py
Outdated
| # TODO this should be if all values in dict are none. or might not need this at all. | ||
| if annotated_vrs_results is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here
| # TODO this should be if all values in dict are none. or might not need this at all. | |
| if annotated_vrs_results is None: | |
| if not vrs_results: |
src/dcd_mapping/main.py
Outdated
| metadata.urn, | ||
| vrs_version, | ||
| ) | ||
| except Exception as e: # TODO create AnnotationError class and replace ValueErrors in annotation steps with AnnotationErrors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add a backlog item for this
src/dcd_mapping/mavedb_data.py
Outdated
|
|
||
| for gene in metadata["targetGenes"]: | ||
| if not _metadata_response_is_human(metadata): | ||
| # TODO allow score sets with mix of human and non-human targets? This may not come up, but is doable with a little restructuring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backlog item
src/dcd_mapping/mavedb_data.py
Outdated
| # Should we quit the whole mapping job if this comes up, or just skip this row and only quit if none contain hgvs_nt or hgvs_pro? | ||
| msg = f"Each score row in {metadata.urn} must contain hgvs_nt or hgvs_pro variant description " | ||
| raise ScoresetNotSupportedError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question here. I think the answer to this is that we want a guarantee of at least a pre-mapped variant for every variant in a mapping query. If we can't provide that, we must return some sort of error like we do at present.
| if row["hgvs_nt"] != "NA": | ||
| # TODO check assumption of no colon in hgvs unless reference sequence identifier present | ||
| prefix = row["hgvs_nt"].split(":")[0] if ":" in row["hgvs_nt"] else None | ||
| elif row["hgvs_pro"] != "NA": | ||
| # TODO check assumption of no colon in hgvs unless reference sequence identifier present | ||
| prefix = ( | ||
| row["hgvs_pro"].split(":")[0] if ":" in row["hgvs_pro"] else None | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this assumption is correct.
|
Note to self - bump mapper version |
This change also checks that the number of mappings for a score set is greater than 0 and returns an error if not.
… and return an error if not. This change also removes a TODO for which there is now a backlog entry.
BLAT automatically removes certain characters from query names, including removing all characters after a space. If the BLAT result name does not match any target genes in the score set, attempt to match based on BLAT's query name shortening patterns. If multiple matches (could happen if labels are something like "Gene 1" and "Gene 2", in which case both would be shortened to "Gene"), raise an error.
Although cdot fetching is not currently directly used in any tests, the cdot REST data provider is imported into dcd_mapping.vrs_map. The cdot data provider uses a ChainedSeqFetcher, to which specific fasta file paths are passed in dcd_mapping.lookup. This results in a FileNotFoundError during tests, since these fasta files are not available outside of the mapper Docker container. This fix uses a fake fasta file to generate a ChainedSeqFetcher, so this change does not support actual cdot transcript fetches for future tests.
No description provided.