-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Summary
The /v3/api/histopheno/{id} endpoint currently returns only association counts per anatomical system bin. We want to enhance it to also return frequency statistics derived from two complementary data sources:
- Disease-level frequency — averaged from
frequency_computed_sortable_floatonDiseaseToPhenotypicFeatureAssociationrecords (normalized fromhas_count/has_total,has_percentage, orfrequency_qualifier) - Case-level frequency — computed as (distinct cases with a phenotype in the bin) / (total cases for the disease), using
CaseToPhenotypicFeatureAssociationrecords linked viaCaseToDiseaseAssociationjoin
Motivation
The current histopheno chart shows how many phenotype associations fall into each anatomical system, but says nothing about how common those phenotypes are. A bin with 500 associations where each phenotype is "Very rare" tells a different story than one with 500 where each is "Very frequent." Adding frequency data makes the visualization much more informative.
Case-level frequency from phenopacket data provides a complementary, independent signal: what fraction of individual patients with this disease have phenotypes in each system.
Experimental findings
We validated that Solr's JSON Facet API supports all needed aggregations efficiently. A single Solr request can return both disease-level and case-level stats for all 20 bins using excludeTags/{!tag=...} to run the case join query in a separate domain from the disease filter.
Per-bin stats available from Solr
| Stat | Source | Solr function |
|---|---|---|
count |
Disease associations | query facet count (existing) |
avg_freq |
Disease associations | avg(frequency_computed_sortable_float) — only over docs with data |
num_with_freq |
Disease associations | countvals(frequency_computed_sortable_float) |
distinct_cases |
Case associations | unique(subject) via join |
total_cases |
Case associations | unique(subject) on full case result set |
Example output (MONDO:0020121, Duchenne muscular dystrophy)
Bin Assoc AvgFreq Cases/136 Case%
musculature 2135 48.3% 135/136 99.3%
nervous_system 1042 43.4% 95/136 69.9%
head_neck 608 40.2% 79/136 58.1%
skeletal_system 524 39.2% 91/136 66.9%
eye 300 42.4% 18/136 13.2%
metabolism_homeostasis 229 60.0% 77/136 56.6%
blood 204 64.6% 75/136 55.1%
respiratory 161 32.9% 50/136 36.8%
Performance (warmed, local)
| Query type | QTime |
|---|---|
| Current (traditional facet) | ~2ms |
| JSON facet with avg + countvals | ~5ms |
| Combined disease + case (single request) | ~12ms |
| Adding percentile/median | +~20ms (t-digest computation) |
Percentile adds ~3x CPU cost. Median is arguably more informative than mean for this data (distribution is bimodal, clustered at qualifier midpoints like 0.05, 0.30, 0.80), but could be made optional if server load is a concern.
Data coverage
- ~327K of 15.3M association docs have
frequency_computed_sortable_float - For a well-annotated disease like Duchenne, ~65-85% of associations per bin have frequency data
- 168K
CaseToPhenotypicFeatureAssociationrecords exist across ~8K cases - Case availability is disease-dependent (136 cases for Duchenne, 0 for many diseases)
Implementation approach
- Switch from traditional Solr facet queries to the JSON Facet API in
build_histopheno_query - Use
excludeTags/{!tag=...}pattern to run disease and case facets in a single request - Extend
HistoBinmodel with optional frequency fields (frequency_mean,frequency_count,case_count,total_cases, etc.) - Case stats naturally return 0/absent for diseases without case data
- Consider whether frequency stats should be opt-in via query parameter or always returned
Technical notes
- Solr
avg()automatically excludes documents without the field (averages only over docs with frequency data) countvals()is a lightweight alternative to nested{type:query}for counting docs with a field- The join query
{!join from=subject to=subject}links cases to diseases efficiently (~4ms warmed) - Frequency values cluster at qualifier midpoints (0.05=Occasional, 0.30=Frequent, 0.80=Very frequent, 0.01=Very rare, 1.0=Obligate) mixed with precise fractions from has_count/has_total