Multi-target and accession-based mapping #38

sallybg · 2025-05-22T21:06:41Z

No description provided.

Note that this is mostly structured to handle multi-target mapping, but does not contain all changes required for multi-target mapping (specifically final output / reference sequence structure). Such changes would require corresponding changes in mavedb-api which we are not prepared to deploy yet.

bencap

Thanks for all the work on this Sally, I know it was a lot to fit these new features into how this was originally built. I left a decent amount of comments, but all of them are pretty minor and that mostly stems from these being a large set of changes. Although I didn't personally run mapping jobs while testing these, it seems to me like the capability to map all of our pillar project data sets is a really reasonable test set.

Lets try and translate the TODOs here into the dcd mapping backlog. I tried to comment on most of them but might have missed a few and/or skipped some duplicates. That backlog isn't super fleshed out yet so it'd be good to get some items in there.

I think you were planning on fixing the tests before merging, but feel free to merge these changes once you do.

bencap · 2025-05-22T22:49:16Z

src/api/routers/map.py

 @router.post(path="/map/{urn}", status_code=200, response_model=ScoresetMapping)
 @with_mavedb_score_set
 async def map_scoreset(urn: str, store_path: Path | None = None) -> ScoresetMapping:
    """Perform end-to-end mapping for a scoreset.


I noticed this function is a little inconsistent with its return types. Sometimes it returns a ScoresetMapping, and other times it returns a JSONResponse containing a ScoresetMapping. We should probably make that consistent and choose one or the other.

bencap · 2025-05-22T22:53:56Z

src/api/routers/map.py

+    # TODO this should instead check if all values in dict are none. or might not need this at all.
+    if vrs_results is None or len(vrs_results) == 0:


Yeah, since vrs_results can never be None now (we assign an empty dict to it up front), this should probably be checking to make sure vrs_results has items in it and that at least one of those values is not None.

Suggested change

# TODO this should instead check if all values in dict are none. or might not need this at all.

if vrs_results is None or len(vrs_results) == 0:

if not vrs_results or all(mapping_result is None for mapping_result in vrs_results.values()):

bencap · 2025-05-22T22:54:20Z

src/api/routers/map.py

+    # TODO this should instead check if all values in dict are none. or might not need this at all.
+    if annotated_vrs_results is None or len(annotated_vrs_results) == 0:


Same as prior comment

bencap · 2025-05-27T21:51:04Z

src/dcd_mapping/align.py

-            output = read_blat(out_file, "blat-psl")
+            output = parse_blat(out_file, "blat-psl")
+
+        # TODO reevaluate this code block - are there cases in mavedb where target sequence type is incorrectly supplied?


I would hope not :)

Do you think it's possible any of the weird mapping outputs we have seen are coming from cases where we enter into this block? Or do you think we never enter this block with existing data.

bencap · 2025-05-27T21:59:25Z

src/dcd_mapping/align.py

+def _get_blat_output(metadata: ScoresetMetadata, silent: bool) -> Any:  # noqa: ANN401
    """Run a BLAT query and returns a path to the output object.

    If unable to produce a valid query the first time, then try a query using ``dnax``
    bases.

    :param scoreset_metadata: object containing scoreset attributes
    :param silent: suppress BLAT command output
-    :return: BLAT query result
+    :return: dict where keys are target gene identifiers and values are BLAT query result objects


We should still be able to type this return value with dict[str, QueryResult].

bencap · 2025-05-27T23:10:19Z

src/dcd_mapping/main.py

+    # TODO this should be if all values in dict are none. or might not need this at all.
+    if annotated_vrs_results is None:


Same thing here

Suggested change

# TODO this should be if all values in dict are none. or might not need this at all.

if annotated_vrs_results is None:

if not vrs_results:

bencap · 2025-05-27T23:12:38Z

src/dcd_mapping/main.py

+                metadata.urn,
+                vrs_version,
+            )
+        except Exception as e:  # TODO create AnnotationError class and replace ValueErrors in annotation steps with AnnotationErrors


Lets add a backlog item for this

bencap · 2025-05-27T23:16:45Z

src/dcd_mapping/mavedb_data.py

+
+    for gene in metadata["targetGenes"]:
+        if not _metadata_response_is_human(metadata):
+            # TODO allow score sets with mix of human and non-human targets? This may not come up, but is doable with a little restructuring.


Backlog item

bencap · 2025-05-27T23:33:40Z

src/dcd_mapping/mavedb_data.py

+                # Should we quit the whole mapping job if this comes up, or just skip this row and only quit if none contain hgvs_nt or hgvs_pro?
+                msg = f"Each score row in {metadata.urn} must contain hgvs_nt or hgvs_pro variant description "
+                raise ScoresetNotSupportedError(msg)


Good question here. I think the answer to this is that we want a guarantee of at least a pre-mapped variant for every variant in a mapping query. If we can't provide that, we must return some sort of error like we do at present.

bencap · 2025-05-27T23:34:49Z

src/dcd_mapping/mavedb_data.py

+            if row["hgvs_nt"] != "NA":
+                # TODO check assumption of no colon in hgvs unless reference sequence identifier present
+                prefix = row["hgvs_nt"].split(":")[0] if ":" in row["hgvs_nt"] else None
+            elif row["hgvs_pro"] != "NA":
+                # TODO check assumption of no colon in hgvs unless reference sequence identifier present
+                prefix = (
+                    row["hgvs_pro"].split(":")[0] if ":" in row["hgvs_pro"] else None
+                )


I believe this assumption is correct.

sallybg · 2025-06-03T17:35:08Z

Note to self - bump mapper version

This change also checks that the number of mappings for a score set is greater than 0 and returns an error if not.

… and return an error if not. This change also removes a TODO for which there is now a backlog entry.

BLAT automatically removes certain characters from query names, including removing all characters after a space. If the BLAT result name does not match any target genes in the score set, attempt to match based on BLAT's query name shortening patterns. If multiple matches (could happen if labels are something like "Gene 1" and "Gene 2", in which case both would be shortened to "Gene"), raise an error.

Although cdot fetching is not currently directly used in any tests, the cdot REST data provider is imported into dcd_mapping.vrs_map. The cdot data provider uses a ChainedSeqFetcher, to which specific fasta file paths are passed in dcd_mapping.lookup. This results in a FileNotFoundError during tests, since these fasta files are not available outside of the mapper Docker container. This fix uses a fake fasta file to generate a ChainedSeqFetcher, so this change does not support actual cdot transcript fetches for future tests.

sallybg added 15 commits February 6, 2025 13:49

Multi-target mapping partial draft: alignment and transcript selection

c8a31e7

Make it the default behavior to prefer genomic mappings when available

0b68f5a

Map score sets with multiple targets

4ed6cb9

fix: single-target blat result named incorrectly

d94d55c

Remove blank reference sequence entries from output

bfcbcba

Fix type annotations for target gene metadata param

20e3aa2

Accession-based mapping and add cdot to environment

8fb8e0b

Temporarily remove support for multi-target score sets

f8586ce

Corrections for accession-based mapping without multi-target mapping

d38f364

Bug fixes for temporary non-multi-target staging deployment

5869536

Update UTA DB version

4cdc7a3

Re-implement multi-target mapping

ccec5a5

Add transcript information to mapped metadata for genomic score sets

a44bd2e

Fix CLI prefer_genomic flag

5a825e3

sallybg requested a review from bencap May 22, 2025 21:06

bencap approved these changes May 27, 2025

View reviewed changes

sallybg added 8 commits June 3, 2025 16:37

Change tests and fixtures to reflect multi-target mapper changes

1b1ab48

Always return a custom JSON response type from mapping api

c3ade2e

This change also checks that the number of mappings for a score set is greater than 0 and returns an error if not.

CLI: Check that number of mappings for a score set is greater than 0,…

e4b41b2

… and return an error if not. This change also removes a TODO for which there is now a backlog entry.

Remove TODO comments for which backlog entries have been created

20e604d

Bump mapper version

a59c560

Use https for mock requests

bb04c51

This was linked to issues Jun 5, 2025

Mapping Score Sets with more than 1 Target #2

Closed

VRS mapping for accession-based targets #3

Closed

bencap merged commit 085d011 into mavedb-main Jun 11, 2025
6 checks passed

bencap deleted the accession-based branch June 11, 2025 18:46

		# TODO this should instead check if all values in dict are none. or might not need this at all.
		if vrs_results is None or len(vrs_results) == 0:

	# TODO this should instead check if all values in dict are none. or might not need this at all.
	if vrs_results is None or len(vrs_results) == 0:
	if not vrs_results or all(mapping_result is None for mapping_result in vrs_results.values()):

		# TODO this should instead check if all values in dict are none. or might not need this at all.
		if annotated_vrs_results is None or len(annotated_vrs_results) == 0:

		# TODO this should be if all values in dict are none. or might not need this at all.
		if annotated_vrs_results is None:

	# TODO this should be if all values in dict are none. or might not need this at all.
	if annotated_vrs_results is None:
	if not vrs_results:

Multi-target and accession-based mapping #38

Multi-target and accession-based mapping #38

Uh oh!

Conversation

sallybg commented May 22, 2025

Uh oh!

bencap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sallybg commented Jun 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants