Skip to content

Conversation

@sallybg
Copy link
Collaborator

@sallybg sallybg commented May 22, 2025

No description provided.

@sallybg sallybg requested a review from bencap May 22, 2025 21:06
Copy link
Collaborator

@bencap bencap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work on this Sally, I know it was a lot to fit these new features into how this was originally built. I left a decent amount of comments, but all of them are pretty minor and that mostly stems from these being a large set of changes. Although I didn't personally run mapping jobs while testing these, it seems to me like the capability to map all of our pillar project data sets is a really reasonable test set.

Lets try and translate the TODOs here into the dcd mapping backlog. I tried to comment on most of them but might have missed a few and/or skipped some duplicates. That backlog isn't super fleshed out yet so it'd be good to get some items in there.

I think you were planning on fixing the tests before merging, but feel free to merge these changes once you do.

@router.post(path="/map/{urn}", status_code=200, response_model=ScoresetMapping)
@with_mavedb_score_set
async def map_scoreset(urn: str, store_path: Path | None = None) -> ScoresetMapping:
"""Perform end-to-end mapping for a scoreset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this function is a little inconsistent with its return types. Sometimes it returns a ScoresetMapping, and other times it returns a JSONResponse containing a ScoresetMapping. We should probably make that consistent and choose one or the other.

Comment on lines 119 to 120
# TODO this should instead check if all values in dict are none. or might not need this at all.
if vrs_results is None or len(vrs_results) == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, since vrs_results can never be None now (we assign an empty dict to it up front), this should probably be checking to make sure vrs_results has items in it and that at least one of those values is not None.

Suggested change
# TODO this should instead check if all values in dict are none. or might not need this at all.
if vrs_results is None or len(vrs_results) == 0:
if not vrs_results or all(mapping_result is None for mapping_result in vrs_results.values()):

Comment on lines 142 to 143
# TODO this should instead check if all values in dict are none. or might not need this at all.
if annotated_vrs_results is None or len(annotated_vrs_results) == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as prior comment

output = read_blat(out_file, "blat-psl")
output = parse_blat(out_file, "blat-psl")

# TODO reevaluate this code block - are there cases in mavedb where target sequence type is incorrectly supplied?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hope not :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's possible any of the weird mapping outputs we have seen are coming from cases where we enter into this block? Or do you think we never enter this block with existing data.

Comment on lines 170 to 178
def _get_blat_output(metadata: ScoresetMetadata, silent: bool) -> Any: # noqa: ANN401
"""Run a BLAT query and returns a path to the output object.
If unable to produce a valid query the first time, then try a query using ``dnax``
bases.
:param scoreset_metadata: object containing scoreset attributes
:param silent: suppress BLAT command output
:return: BLAT query result
:return: dict where keys are target gene identifiers and values are BLAT query result objects
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still be able to type this return value with dict[str, QueryResult].

Comment on lines 276 to 277
# TODO this should be if all values in dict are none. or might not need this at all.
if annotated_vrs_results is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here

Suggested change
# TODO this should be if all values in dict are none. or might not need this at all.
if annotated_vrs_results is None:
if not vrs_results:

metadata.urn,
vrs_version,
)
except Exception as e: # TODO create AnnotationError class and replace ValueErrors in annotation steps with AnnotationErrors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a backlog item for this


for gene in metadata["targetGenes"]:
if not _metadata_response_is_human(metadata):
# TODO allow score sets with mix of human and non-human targets? This may not come up, but is doable with a little restructuring.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backlog item

Comment on lines 262 to 264
# Should we quit the whole mapping job if this comes up, or just skip this row and only quit if none contain hgvs_nt or hgvs_pro?
msg = f"Each score row in {metadata.urn} must contain hgvs_nt or hgvs_pro variant description "
raise ScoresetNotSupportedError(msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question here. I think the answer to this is that we want a guarantee of at least a pre-mapped variant for every variant in a mapping query. If we can't provide that, we must return some sort of error like we do at present.

Comment on lines 253 to 260
if row["hgvs_nt"] != "NA":
# TODO check assumption of no colon in hgvs unless reference sequence identifier present
prefix = row["hgvs_nt"].split(":")[0] if ":" in row["hgvs_nt"] else None
elif row["hgvs_pro"] != "NA":
# TODO check assumption of no colon in hgvs unless reference sequence identifier present
prefix = (
row["hgvs_pro"].split(":")[0] if ":" in row["hgvs_pro"] else None
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this assumption is correct.

@sallybg
Copy link
Collaborator Author

sallybg commented Jun 3, 2025

Note to self - bump mapper version

sallybg added 8 commits June 3, 2025 16:37
This change also checks that the number of mappings for a score set is greater than 0 and returns an error if not.
… and return an error if not.

This change also removes a TODO for which there is now a backlog entry.
BLAT automatically removes certain characters from query names, including removing all characters after a space.
If the BLAT result name does not match any target genes in the score set, attempt to match based on BLAT's query name shortening patterns.
If multiple matches (could happen if labels are something like "Gene 1" and "Gene 2", in which case both would be shortened to "Gene"), raise an error.
Although cdot fetching is not currently directly used in any tests,
the cdot REST data provider is imported into dcd_mapping.vrs_map.
The cdot data provider uses a ChainedSeqFetcher, to which specific
fasta file paths are passed in dcd_mapping.lookup.
This results in a FileNotFoundError during tests, since these fasta
files are not available outside of the mapper Docker container.
This fix uses a fake fasta file to generate a ChainedSeqFetcher,
so this change does not support actual cdot transcript fetches for
future tests.
@bencap bencap merged commit 085d011 into mavedb-main Jun 11, 2025
6 checks passed
@bencap bencap deleted the accession-based branch June 11, 2025 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VRS mapping for accession-based targets Mapping Score Sets with more than 1 Target

3 participants