Skip to content

Enhance histopheno endpoint with frequency statistics per bin #1266

@kevinschaper

Description

@kevinschaper

Summary

The /v3/api/histopheno/{id} endpoint currently returns only association counts per anatomical system bin. We want to enhance it to also return frequency statistics derived from two complementary data sources:

  1. Disease-level frequency — averaged from frequency_computed_sortable_float on DiseaseToPhenotypicFeatureAssociation records (normalized from has_count/has_total, has_percentage, or frequency_qualifier)
  2. Case-level frequency — computed as (distinct cases with a phenotype in the bin) / (total cases for the disease), using CaseToPhenotypicFeatureAssociation records linked via CaseToDiseaseAssociation join

Motivation

The current histopheno chart shows how many phenotype associations fall into each anatomical system, but says nothing about how common those phenotypes are. A bin with 500 associations where each phenotype is "Very rare" tells a different story than one with 500 where each is "Very frequent." Adding frequency data makes the visualization much more informative.

Case-level frequency from phenopacket data provides a complementary, independent signal: what fraction of individual patients with this disease have phenotypes in each system.

Experimental findings

We validated that Solr's JSON Facet API supports all needed aggregations efficiently. A single Solr request can return both disease-level and case-level stats for all 20 bins using excludeTags/{!tag=...} to run the case join query in a separate domain from the disease filter.

Per-bin stats available from Solr

Stat Source Solr function
count Disease associations query facet count (existing)
avg_freq Disease associations avg(frequency_computed_sortable_float) — only over docs with data
num_with_freq Disease associations countvals(frequency_computed_sortable_float)
distinct_cases Case associations unique(subject) via join
total_cases Case associations unique(subject) on full case result set

Example output (MONDO:0020121, Duchenne muscular dystrophy)

Bin                    Assoc  AvgFreq  Cases/136  Case%
musculature             2135   48.3%    135/136   99.3%
nervous_system          1042   43.4%     95/136   69.9%
head_neck                608   40.2%     79/136   58.1%
skeletal_system          524   39.2%     91/136   66.9%
eye                      300   42.4%     18/136   13.2%
metabolism_homeostasis   229   60.0%     77/136   56.6%
blood                    204   64.6%     75/136   55.1%
respiratory              161   32.9%     50/136   36.8%

Performance (warmed, local)

Query type QTime
Current (traditional facet) ~2ms
JSON facet with avg + countvals ~5ms
Combined disease + case (single request) ~12ms
Adding percentile/median +~20ms (t-digest computation)

Percentile adds ~3x CPU cost. Median is arguably more informative than mean for this data (distribution is bimodal, clustered at qualifier midpoints like 0.05, 0.30, 0.80), but could be made optional if server load is a concern.

Data coverage

  • ~327K of 15.3M association docs have frequency_computed_sortable_float
  • For a well-annotated disease like Duchenne, ~65-85% of associations per bin have frequency data
  • 168K CaseToPhenotypicFeatureAssociation records exist across ~8K cases
  • Case availability is disease-dependent (136 cases for Duchenne, 0 for many diseases)

Implementation approach

  • Switch from traditional Solr facet queries to the JSON Facet API in build_histopheno_query
  • Use excludeTags/{!tag=...} pattern to run disease and case facets in a single request
  • Extend HistoBin model with optional frequency fields (frequency_mean, frequency_count, case_count, total_cases, etc.)
  • Case stats naturally return 0/absent for diseases without case data
  • Consider whether frequency stats should be opt-in via query parameter or always returned

Technical notes

  • Solr avg() automatically excludes documents without the field (averages only over docs with frequency data)
  • countvals() is a lightweight alternative to nested {type:query} for counting docs with a field
  • The join query {!join from=subject to=subject} links cases to diseases efficiently (~4ms warmed)
  • Frequency values cluster at qualifier midpoints (0.05=Occasional, 0.30=Frequent, 0.80=Very frequent, 0.01=Very rare, 1.0=Obligate) mixed with precise fractions from has_count/has_total

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions