Fixes #11583 — Optimize molecular data multi-profile fetch for ClickHouse (reduce N+1 queries)#11840
Conversation
There was a problem hiding this comment.
Pull request overview
This PR optimizes molecular data fetching for ClickHouse by eliminating N+1 query patterns when retrieving data across multiple profiles. Instead of querying per gene, the implementation now fetches all requested genes in a single ClickHouse query and aggregates per-sample rows into the legacy CSV format expected by the service layer.
Key Changes
- Introduced ClickHouse-specific repository that queries
genetic_alteration_derivedand aggregates results intoGeneMolecularAlterationobjects - Modified service layer to batch all entrez gene IDs into a single repository call
- Added comprehensive unit tests for both service and repository aggregation logic
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
MolecularDataServiceImpl.java |
Replaced per-gene streaming queries with single batched repository call |
MolecularDataMyBatisClickhouseRepository.java |
New repository implementation that aggregates per-sample ClickHouse rows into CSV format |
MolecularDataMapper.java |
New mapper interface for ClickHouse queries |
MolecularDataMapper.xml |
MyBatis XML query definition for fetching per-sample molecular data |
MolecularDataRowPerSample.java |
New model class representing individual sample-level molecular data rows |
MolecularDataServiceImplTest.java |
Added test verifying multi-profile molecular data fetch with single repository call |
MolecularDataMyBatisClickhouseRepositoryTest.java |
Added test verifying aggregation logic from per-sample rows to CSV format |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...bioportal/legacy/persistence/mybatisclickhouse/MolecularDataMyBatisClickhouseRepository.java
Outdated
Show resolved
Hide resolved
onursumer
left a comment
There was a problem hiding this comment.
Not sure if master is the best target for this PR. It might be better if we try optimizing the rc-7.0-clickhouse-only branch because eventually we will switch to clickhouse only implementation.
|
|
||
| @Repository | ||
| @ConditionalOnProperty(name = "clickhouse_mode", havingValue = "test") | ||
| public class MolecularDataMyBatisClickhouseRepository implements MolecularDataRepository { |
There was a problem hiding this comment.
We should probably just modify MolecularDataMyBatisRepository instead of introducing another legacy repository class.
|
|
||
| import java.io.Serializable; | ||
|
|
||
| public class MolecularDataRowPerSample implements Serializable { |
There was a problem hiding this comment.
Do we really need to introduce a new legacy model? Can't we achieve the same thing by just using a map and the existing GeneMolecularAlteration model?
| <?xml version="1.0" encoding="UTF-8"?> | ||
| <!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd"> | ||
|
|
||
| <mapper namespace="org.cbioportal.legacy.persistence.mybatisclickhouse.MolecularDataMapper"> |
There was a problem hiding this comment.
We can probably just modify the existing mapper src/main/resources/org/cbioportal/legacy/persistence/mybatis/MolecularDataMapper.xml instead of introducing a new one.
| import org.apache.ibatis.annotations.Param; | ||
| import org.cbioportal.legacy.model.MolecularDataRowPerSample; | ||
|
|
||
| public interface MolecularDataMapper { |
There was a problem hiding this comment.
I would just modify the existing legacy mapper instead of introducing another legacy mapper
|
@onursumer |
Refactoring Complete per Reviewer FeedbackI've updated this PR to address all the feedback from @onursumer Changes Made:
The optimization still eliminates N+1 queries by fetching all genes in a single ClickHouse query, but now integrates cleanly into existing codebase without introducing parallel legacy classes. |
|
@onursumer waiting for the review !! and further guidance |
…ioPortal#11761 -- Created MolecularDataCountItem model to represent per-profile counts -- Added fetchMolecularDataCountsInMultipleMolecularProfiles method to service layer -- Implemented new /api/molecular-data/counts POST endpoint - Returns JSON array with count per molecular profile in single database query -- Leverages existing getMolecularDataInMultipleMolecularProfiles optimization from PR cBioPortal#11840 -- Added unit tests for service and controller layers -- Includes implementation plan document for reference
|
@zainasir @inodb @onursumer @sheridancbio I've pushed a fix (commit 97a0773) that resolves the circular dependency issue causing the build failures. The fix:
Could you please approve the pending workflows so the new builds can run with the fixed code? Thanks! |
|
Friendly ping on this PR. All feedback from @onursumer has been implemented, circular dependency fixed in 97a0773, and tests pass locally. When you have time, could you please take another look ? |
|
Hi @immortal71, we recently merged rc-7.0-clickhouse-only into master. Can you change the base branch back to master and rebase your PR? Thanks! |
|
@onursumer done !! |
|
@onursumer Can you review it ? |
|
@immortal71 can you also rebase your branch on master ( |
@onursumer done !! |
|
@onursumer |
|
@immortal71 your branch is still 22 commits behind the master branch. Can you rebase it on the latest
|
…gle query to reduce N+1 queries (ClickHouse perf)
…ry that aggregates per-sample rows into gene-profile values
…nto existing repository/mapper - Remove separate ClickHouse classes - Change conditional property to true
Per reviewer feedback from @onursumer: --> Removed separate ClickHouse-specific repository and mapper classes --> Moved optimization into existing MolecularDataMyBatisRepository - Updated existing MolecularDataMapper.xml with conditional ClickHouse query --> Changed @ConditionalOnProperty havingValue from 'test' to 'true' --> Reuses existing GeneMolecularAlteration model instead of new legacy classes The ClickHouse path queries genetic_alteration_derived table and aggregates per-sample rows into CSV format in the repository layer.
…ization - Replaced SampleService injection with SampleMapper to avoid circular dependency - Added null safety check for optional SampleMapper dependency - Added try-catch with fallback to standard method for database compatibility - Added SLF4J logger for debugging and error tracking - Ensures tests pass in both MySQL and ClickHouse environments Fixes cBioPortal#11583
97a0773 to
d4c610f
Compare
|
@onursumer done!! |

Fixes #11583
This PR addresses the performance bottleneck when using ClickHouse in multi-profile molecular data fetches. Instead of repeated per-gene queries (N+1), the ClickHouse repository now fetches per-sample rows from the
genetic_alteration_derivedtable and aggregates them into the legacyvaluesCSV format expected by the service layer. The service now requests all entrez gene IDs in a single call.Key changes:
genetic_alteration_derived.GeneMolecularAlteration.Notes & next steps: Add
entrez_gene_idto the derived table to avoid a join togeneduring the ClickHouse query for better perf.