Gene-specific charts include sample ids#11904
Gene-specific charts include sample ids#11904i-am-leslie wants to merge 10 commits intocBioPortal:masterfrom
Conversation
| count(*) as count, | ||
| count(distinct(sample_unique_id)) as uniqueCount | ||
| count(distinct(sample_unique_id)) as uniqueCount, | ||
| arrayStringConcat(groupArray(DISTINCT sample_unique_id), ',') AS sampleIdsStr |
There was a problem hiding this comment.
you're going to want to make this conditional using MyBatis. I.e. we will pass in the "flag" all the way from the controller and only return the samples when we need to.
| mapper.getMutationCountsByType( | ||
| StudyViewFilterFactory.make(studyViewFilter, null, studyViewFilter.getStudyIds(), null), | ||
| List.of(genomicDataFilterMutation)); | ||
| List.of(genomicDataFilterMutation), false); |
There was a problem hiding this comment.
can you add a test to this spec where it's true?
f87cfdd to
fa3c671
Compare
7c046e6 to
9b48295
Compare
9b48295 to
69cda0d
Compare
alisman
left a comment
There was a problem hiding this comment.
Code Review
Purpose: Add an includeSampleIds flag to the mutation-data-counts endpoint so the comparison page can retrieve which samples belong to each mutation type bucket.
1. Legacy StudyViewMapper.xml is not updated -- divergent code paths (Medium)
The legacy MyBatis mapper at src/main/resources/org/cbioportal/persistence/mybatisclickhouse/StudyViewMapper.xml has its own copy of the getMutationCountsByType query. It was not updated with the new includeSampleIds parameter or the sampleIdsStr column. Similarly, the legacy StudyViewMapper.java interface still has the old 2-parameter signature.
The legacy StudyViewController.java calls studyViewService.getMutationTypeCountsByGeneSpecific(studyIds, sampleIds, genomicDataFilters) (the legacy StudyViewService interface), which goes through StudyViewServiceImpl and ultimately calls the legacy StudyViewMyBatisRepository -- a completely separate code path from the domain layer that was updated.
The legacy endpoint (/api/mutation-data-counts/fetch in StudyViewController) still works because it was never changed -- it just will not support the new includeSampleIds feature. However, two divergent code paths now exist. If the legacy controller is still active, clients may get different behavior depending on which endpoint they hit.
2. GenomicDataCountItemResultMap now has sampleIds for ALL queries that use it (Low)
The shared GenomicDataCountItemResultMap in ClickhouseGenomicDataMapper.xml now maps sampleIdsStr to sampleIds unconditionally. This same result map is used by getCNACounts. The CNA query does not produce a sampleIdsStr column, so MyBatis will map it as null -- which is harmless. However, CNA count responses will now include a sampleIds: null field in the JSON output that was not there before. If any frontend code does a strict schema check or if OpenAPI validation is enabled, this could cause issues.
The legacy StudyViewMapper.xml result map was not updated, so the legacy path will not have this field -- further divergence.
3. setSampleIds(String) setter has a type mismatch (Low)
In GenomicDataCount.java, the setter accepts a String (for MyBatis mapping from comma-delimited sampleIdsStr) but the getter returns List<String>. This is a JavaBean convention violation -- setSampleIds(String) and getSampleIds() returning List<String> do not match types. While Jackson serialization uses the getter (so JSON output will be List<String>), this could break:
- Any code calling
setSampleIdswith aList<String>at the Java level (there is no such overload) - Standard JavaBean introspection tools
- The
equals()/hashCode()methods do not includesampleIds, meaning twoGenomicDataCountobjects with different sample IDs would be considered equal
4. Hardcoded genomicDataFilters[0] -- wrong results for multi-gene queries (High)
In the new SQL in ClickhouseGenomicDataMapper.xml, the "Not Mutated" and "Not Profiled" UNION ALL blocks and the profiled_samples CTE use:
#{genomicDataFilters[0].hugoGeneSymbol}This hardcodes the first gene filter only. If genomicDataFilters contains multiple genes, the "Not Mutated" and "Not Profiled" rows will only be computed for genomicDataFilters[0], silently ignoring the rest. The mutated_samples CTE correctly uses a <foreach> loop over all filters, creating an inconsistency.
Impact: When includeSampleIds=true and multiple genes are passed, the "Not Mutated" and "Not Profiled" counts and sample IDs will be wrong -- they will only reflect the profiling status of the first gene.
5. Performance concern: unbounded groupArray (Medium)
When includeSampleIds=true, the query collects all distinct sample IDs into a comma-separated string via arrayStringConcat(groupArray(DISTINCT sample_unique_id), ','). For large studies with thousands of samples, this can produce enormous strings that:
- Consume significant ClickHouse memory during aggregation
- Create very large JSON responses
- May hit ClickHouse max_query_size or max_memory_usage limits
There is no limit or pagination on the sample IDs list.
6. includeSampleIds silently ignored with SUMMARY projection (Low)
In ColumnarStoreStudyViewController.java:
projection == Projection.SUMMARY
? studyViewService.getMutationCountsByGeneSpecific(studyViewFilter, genomicDataFilters)
: studyViewService.getMutationTypeCountsByGeneSpecific(
studyViewFilter, genomicDataFilters, includeSampleIds);The includeSampleIds parameter only takes effect when projection != SUMMARY. If someone passes ?includeSampleIds=true&projection=SUMMARY, the flag is silently ignored. This is not necessarily a bug, but it is undocumented behavior since the API accepts the parameter in both cases.
7. Test gap: no multi-gene test with includeSampleIds=true (Medium)
The new test getMutationCountsByTypeAddSampleId only tests with a single gene (AKT1). Given issue #4 above, a multi-gene test would expose the genomicDataFilters[0] bug.
Summary
| # | Severity | Issue |
|---|---|---|
| 1 | Medium | Legacy mapper/service not updated -- divergent code paths |
| 2 | Low | CNA results will now include sampleIds: null in JSON |
| 3 | Low | setSampleIds(String) type mismatch with getter; missing from equals/hashCode |
| 4 | High | Hardcoded genomicDataFilters[0] -- wrong results for multi-gene queries with includeSampleIds=true |
| 5 | Medium | Unbounded groupArray could cause performance issues on large studies |
| 6 | Low | includeSampleIds silently ignored with SUMMARY projection |
| 7 | Medium | No test coverage for multi-gene + includeSampleIds=true scenario |
|
@i-am-leslie i asked Claude code to review the PR. Maybe we can meet tomorrow an go over it? Let me know if you have time. Some of the issues it raises are probably not a big deal. |
…ase layer checks extracts the one hugo gene symbol to the argument, to remove edge cases of passing multiple genes.
…ase layer checks extracts the one hugo gene symbol to the argument, to remove edge cases of passing multiple genes.
|



Fix #11771
Describe changes proposed in this pull request:
Checks