Skip to content

Gene-specific charts include sample ids#11904

Open
i-am-leslie wants to merge 10 commits intocBioPortal:masterfrom
i-am-leslie:gene-chart-sampleids
Open

Gene-specific charts include sample ids#11904
i-am-leslie wants to merge 10 commits intocBioPortal:masterfrom
i-am-leslie:gene-chart-sampleids

Conversation

@i-am-leslie
Copy link
Contributor

@i-am-leslie i-am-leslie commented Jan 7, 2026

Fix #11771
Describe changes proposed in this pull request:

  • This change adds optional support for including sample IDs in the mutation data counts response. When enabled, the backend populates sample IDs required by the comparison page, while preserving existing behaviour when the flag is disabled

Checks

  • Yes, the change has been tested using Junit to check the presence of a list of strings when includeSampleId is true and no list when includeSampleId is false
  • Yes, the commit log is comprehensive.
  • [ ]No, it is not adding logic based on one or more clinical attributes

count(*) as count,
count(distinct(sample_unique_id)) as uniqueCount
count(distinct(sample_unique_id)) as uniqueCount,
arrayStringConcat(groupArray(DISTINCT sample_unique_id), ',') AS sampleIdsStr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're going to want to make this conditional using MyBatis. I.e. we will pass in the "flag" all the way from the controller and only return the samples when we need to.

mapper.getMutationCountsByType(
StudyViewFilterFactory.make(studyViewFilter, null, studyViewFilter.getStudyIds(), null),
List.of(genomicDataFilterMutation));
List.of(genomicDataFilterMutation), false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test to this spec where it's true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will do

@i-am-leslie i-am-leslie marked this pull request as ready for review January 15, 2026 21:42
@i-am-leslie i-am-leslie force-pushed the gene-chart-sampleids branch 3 times, most recently from 7c046e6 to 9b48295 Compare February 8, 2026 16:01
Copy link
Contributor

@alisman alisman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Purpose: Add an includeSampleIds flag to the mutation-data-counts endpoint so the comparison page can retrieve which samples belong to each mutation type bucket.


1. Legacy StudyViewMapper.xml is not updated -- divergent code paths (Medium)

The legacy MyBatis mapper at src/main/resources/org/cbioportal/persistence/mybatisclickhouse/StudyViewMapper.xml has its own copy of the getMutationCountsByType query. It was not updated with the new includeSampleIds parameter or the sampleIdsStr column. Similarly, the legacy StudyViewMapper.java interface still has the old 2-parameter signature.

The legacy StudyViewController.java calls studyViewService.getMutationTypeCountsByGeneSpecific(studyIds, sampleIds, genomicDataFilters) (the legacy StudyViewService interface), which goes through StudyViewServiceImpl and ultimately calls the legacy StudyViewMyBatisRepository -- a completely separate code path from the domain layer that was updated.

The legacy endpoint (/api/mutation-data-counts/fetch in StudyViewController) still works because it was never changed -- it just will not support the new includeSampleIds feature. However, two divergent code paths now exist. If the legacy controller is still active, clients may get different behavior depending on which endpoint they hit.

2. GenomicDataCountItemResultMap now has sampleIds for ALL queries that use it (Low)

The shared GenomicDataCountItemResultMap in ClickhouseGenomicDataMapper.xml now maps sampleIdsStr to sampleIds unconditionally. This same result map is used by getCNACounts. The CNA query does not produce a sampleIdsStr column, so MyBatis will map it as null -- which is harmless. However, CNA count responses will now include a sampleIds: null field in the JSON output that was not there before. If any frontend code does a strict schema check or if OpenAPI validation is enabled, this could cause issues.

The legacy StudyViewMapper.xml result map was not updated, so the legacy path will not have this field -- further divergence.

3. setSampleIds(String) setter has a type mismatch (Low)

In GenomicDataCount.java, the setter accepts a String (for MyBatis mapping from comma-delimited sampleIdsStr) but the getter returns List<String>. This is a JavaBean convention violation -- setSampleIds(String) and getSampleIds() returning List<String> do not match types. While Jackson serialization uses the getter (so JSON output will be List<String>), this could break:

  • Any code calling setSampleIds with a List<String> at the Java level (there is no such overload)
  • Standard JavaBean introspection tools
  • The equals()/hashCode() methods do not include sampleIds, meaning two GenomicDataCount objects with different sample IDs would be considered equal

4. Hardcoded genomicDataFilters[0] -- wrong results for multi-gene queries (High)

In the new SQL in ClickhouseGenomicDataMapper.xml, the "Not Mutated" and "Not Profiled" UNION ALL blocks and the profiled_samples CTE use:

#{genomicDataFilters[0].hugoGeneSymbol}

This hardcodes the first gene filter only. If genomicDataFilters contains multiple genes, the "Not Mutated" and "Not Profiled" rows will only be computed for genomicDataFilters[0], silently ignoring the rest. The mutated_samples CTE correctly uses a <foreach> loop over all filters, creating an inconsistency.

Impact: When includeSampleIds=true and multiple genes are passed, the "Not Mutated" and "Not Profiled" counts and sample IDs will be wrong -- they will only reflect the profiling status of the first gene.

5. Performance concern: unbounded groupArray (Medium)

When includeSampleIds=true, the query collects all distinct sample IDs into a comma-separated string via arrayStringConcat(groupArray(DISTINCT sample_unique_id), ','). For large studies with thousands of samples, this can produce enormous strings that:

  • Consume significant ClickHouse memory during aggregation
  • Create very large JSON responses
  • May hit ClickHouse max_query_size or max_memory_usage limits

There is no limit or pagination on the sample IDs list.

6. includeSampleIds silently ignored with SUMMARY projection (Low)

In ColumnarStoreStudyViewController.java:

projection == Projection.SUMMARY
    ? studyViewService.getMutationCountsByGeneSpecific(studyViewFilter, genomicDataFilters)
    : studyViewService.getMutationTypeCountsByGeneSpecific(
        studyViewFilter, genomicDataFilters, includeSampleIds);

The includeSampleIds parameter only takes effect when projection != SUMMARY. If someone passes ?includeSampleIds=true&projection=SUMMARY, the flag is silently ignored. This is not necessarily a bug, but it is undocumented behavior since the API accepts the parameter in both cases.

7. Test gap: no multi-gene test with includeSampleIds=true (Medium)

The new test getMutationCountsByTypeAddSampleId only tests with a single gene (AKT1). Given issue #4 above, a multi-gene test would expose the genomicDataFilters[0] bug.


Summary

# Severity Issue
1 Medium Legacy mapper/service not updated -- divergent code paths
2 Low CNA results will now include sampleIds: null in JSON
3 Low setSampleIds(String) type mismatch with getter; missing from equals/hashCode
4 High Hardcoded genomicDataFilters[0] -- wrong results for multi-gene queries with includeSampleIds=true
5 Medium Unbounded groupArray could cause performance issues on large studies
6 Low includeSampleIds silently ignored with SUMMARY projection
7 Medium No test coverage for multi-gene + includeSampleIds=true scenario

@alisman
Copy link
Contributor

alisman commented Feb 10, 2026

@i-am-leslie i asked Claude code to review the PR. Maybe we can meet tomorrow an go over it? Let me know if you have time. Some of the issues it raises are probably not a big deal.

…ase layer checks extracts the one hugo gene symbol to the argument, to remove edge cases of passing multiple genes.
…ase layer checks extracts the one hugo gene symbol to the argument, to remove edge cases of passing multiple genes.
@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gene specific charts do not have a 'compare groups' option

2 participants