Skip to content

Comments

feat: add consent code eligibility logic for search API#199

Merged
NoopDog merged 7 commits intomainfrom
noopdog/198/consent-code-compatibility
Feb 22, 2026
Merged

feat: add consent code eligibility logic for search API#199
NoopDog merged 7 commits intomainfrom
noopdog/198/consent-code-compatibility

Conversation

@NoopDog
Copy link
Contributor

@NoopDog NoopDog commented Feb 22, 2026

Summary

  • Add deterministic consent code eligibility logic (consent_logic.py) that computes which consent codes a researcher is eligible to use based on their research purpose, disease of interest, and profit status
  • Add compute_consent_eligibility tool to the resolve agent so consent code mentions are expanded to the full set of eligible codes (e.g. "diabetes research" → GRU-* + HMB-* + DS-DIAB-* + DS-T1D-* etc.)
  • Update extract agent to recognize eligibility language ("what can I use", "consented for", "for-profit") and emit dual focus + consentCode mentions
  • Disease abbreviations sourced from the authoritative catalog-build/common/disease_abbrev_mapping.tsv (388 entries)

Before/After

Query Before After
"diabetes research" consent DS-DIAB only (~9 studies) GRU-* + HMB-* + DS-DIAB family (~2,300+ studies)
"GRU" consent Exact match (1,081 studies) All GRU-* variants (1,543 studies)
"for-profit cancer" No consent handling GRU-* + HMB-* + DS-CA-* minus NPU codes
"diabetes only" Same as above DS-DIAB-* only (excludes GRU/HMB)

Key design decisions

  • Deterministic Python, not LLM: Permission hierarchy (GRU ⊇ HMB ⊇ DS-X) and modifier semantics (NPU = non-profit only) are fixed GA4GH rules computed in pure Python. The LLM's only job is mapping natural language to tool parameters.
  • Single tool call: resolve_disease_name() accepts full names ("diabetes"), abbreviations ("DIAB"), or partial names ("cardiovascular") so the agent doesn't need a lookup step first.
  • disease_only flag: Distinguishes "eligible for diabetes" (GRU+HMB+DS-DIAB) from "specifically consented for diabetes only" (DS-DIAB only).

Closes #198

Test plan

  • 43 unit tests for consent_logic.py (parse, expand, eligibility, disease name resolution)
  • 48 resolve agent eval cases at 1.00 (including 16 new consent cases)
  • 41 extract agent eval cases (including 8 new consent/eligibility cases)
  • Manual E2E: start server, query "what diabetes datasets can I use?" — should return GRU + HMB + DS-DIAB family

🤖 Generated with Claude Code

NoopDog and others added 2 commits February 21, 2026 13:02
Add deterministic consent eligibility computation so the resolve agent
can expand consent code mentions into the full set of eligible codes.
A query like "what can I use for diabetes research?" now returns GRU,
HMB, and DS-DIAB family studies instead of just exact DS-DIAB matches.

- New consent_logic.py: parse codes, expand disease hierarchies,
  compute eligible code sets (pure Python, zero LLM)
- New compute_consent_eligibility tool in resolve agent with
  disease name resolution and disease_only/NPU filters
- Updated extract prompt with eligibility cue word recognition
- Updated resolve prompt with Pattern A (explicit) / B (eligibility)
- 43 unit tests, 7 extract eval cases, 12 resolve eval cases added

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…des #198

Replace hand-written disease_abbreviations in consent_codes.json with
the 388-entry TSV maintained in catalog-build. Fix substring matching
to prefer shortest name. Add dual-mention extract eval and strengthen
prompt for nonprofit/for-profit + disease patterns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Raise FileNotFoundError if disease_abbrev_mapping.tsv is missing
  instead of silently falling back to empty dict
- Remove unused pytest import
- Add test for cardiovascular → CVD substring preference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements deterministic consent code eligibility logic for the search API, allowing researchers to find datasets by describing their research use case rather than knowing exact consent code strings. The implementation adds a pure Python module for parsing GA4GH consent codes and computing eligible codes based on research purpose, disease of interest, and profit status, integrated into the resolve agent as a new tool.

Changes:

  • Adds consent_logic.py module with parsing, disease hierarchy expansion, and eligibility computation functions
  • Integrates compute_consent_eligibility tool into resolve agent for expanding consent code mentions
  • Updates extract agent prompt to recognize eligibility language and emit dual focus + consent code mentions
  • Adds comprehensive test coverage with 43 unit tests and updates to eval test suites
  • Sources disease abbreviations from authoritative TSV file in catalog-build

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
backend/concept_search/consent_logic.py New module implementing deterministic consent code parsing, disease expansion, name resolution, and eligibility computation
backend/tests/test_consent_logic.py Comprehensive unit tests for all consent_logic functions covering parsing, expansion, eligibility, and disease resolution
backend/concept_search/resolve_agent.py Adds compute_consent_eligibility tool to resolve agent for computing eligible consent codes from research profiles
backend/concept_search/consent_codes.json Replaces disease_abbreviations dict with disease_hierarchy structure defining parent-child disease relationships
backend/concept_search/RESOLVE_PROMPT.md Updates consent code resolution strategy with two patterns: explicit code and eligibility/use-case based resolution
backend/concept_search/EXTRACT_PROMPT.md Adds eligibility language detection rules for identifying when users ask about data usage permissions
backend/concept_search/eval_resolve.py Adds 16 new consent-related test cases covering explicit codes, eligibility queries, and GRU/HMB disambiguation
backend/concept_search/eval_extract.py Adds 8 new test cases for consent eligibility detection and dual mention extraction

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

),
Case(
name="consent-consented-alzheimers",
inputs=_mention("Alzheimer's", Facet.CONSENT_CODE),
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test expects "Alzheimer's" (with possessive 's') to resolve to a disease abbreviation and be eligible for consent codes (GRU, HMB). However, the disease_abbrev_mapping.tsv has "Alzheimer Disease" (without the possessive). The substring matching in resolve_disease_name requires either the query to be in the disease name or vice versa. Since "alzheimer's" is not a substring of "alzheimer disease" and vice versa, this lookup will fail and resolve_disease_name will return None. This would cause the eligibility computation to not include DS-ALZ codes. Either the test case should use "Alzheimer" instead of "Alzheimer's", or the resolve_disease_name function needs to be enhanced to handle possessive forms and other variations.

Suggested change
inputs=_mention("Alzheimer's", Facet.CONSENT_CODE),
inputs=_mention("Alzheimer", Facet.CONSENT_CODE),

Copilot uses AI. Check for mistakes.
expected_output=ExtractResult(
intent="study",
mentions=[
_rm("Alzheimer's", Facet.CONSENT_CODE),
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the issue in eval_resolve.py, this test uses "Alzheimer's" (with possessive) which won't match "Alzheimer Disease" in the disease_abbrev_mapping.tsv. The substring matching in resolve_disease_name won't find a match for "alzheimer's" in "alzheimer disease". Consider using "Alzheimer" or "Alzheimer Disease" instead, or enhance the resolve_disease_name function to handle possessive forms.

Suggested change
_rm("Alzheimer's", Facet.CONSENT_CODE),
_rm("Alzheimer Disease", Facet.CONSENT_CODE),

Copilot uses AI. Check for mistakes.
- "general research use" → `compute_consent_eligibility(explicit_code="GRU")`
- "open access no restrictions" → `compute_consent_eligibility(explicit_code="GRU")`
- "type 1 diabetes consent" → `compute_consent_eligibility(purpose="disease", disease="type 1 diabetes")`
- "consented for diabetes only" → `compute_consent_eligibility(purpose="disease", disease="diabetes", disease_only=True)` → returns only DS-DIAB\* codes, not GRU/HMB
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent escaping of asterisks in documentation. Lines 105-106 use unescaped asterisks (e.g., "GRU*", "HMB*") while line 111 uses escaped asterisks ("DS-DIAB*"). For consistency, either all asterisks should be escaped or none should be. Since the earlier lines don't escape them, line 111 should match that style.

Suggested change
- "consented for diabetes only" → `compute_consent_eligibility(purpose="disease", disease="diabetes", disease_only=True)` → returns only DS-DIAB\* codes, not GRU/HMB
- "consented for diabetes only" → `compute_consent_eligibility(purpose="disease", disease="diabetes", disease_only=True)` → returns only DS-DIAB* codes, not GRU/HMB

Copilot uses AI. Check for mistakes.
- Strip possessives in resolve_disease_name ("Alzheimer's" → ALZ)
- Remove bidirectional substring match (only query-in-name direction)
- Fix class typo TestResolvDiseaseName → TestResolveDiseaseName
- Fix test using old "Cancer (all types)" name → "Cancer"
- Add possessive tests for Alzheimer's and Parkinson's

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


_data: dict = json.loads(_DATA_PATH.read_text())
_MODIFIERS: set[str] = set(_data.get("modifiers", {}))
_BASE_CODES: set[str] = set(_data.get("base_codes", {}))
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable _BASE_CODES is defined but never used anywhere in the module. Consider removing it if it's not needed, or using it to validate base codes in parse_consent_code or _is_eligible_by_purpose if there's a need to ensure only valid base codes are processed.

Suggested change
_BASE_CODES: set[str] = set(_data.get("base_codes", {}))

Copilot uses AI. Check for mistakes.
if lower in _DISEASE_NAME_TO_ABBREV:
return _DISEASE_NAME_TO_ABBREV[lower]
# Strip possessives ("Alzheimer's" → "alzheimer")
normalized = lower.rstrip("'s").rstrip("'")
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The possessive stripping logic uses rstrip which removes characters, not suffixes. This means rstrip("'s") removes any trailing ' or s characters, not the suffix "'s". For example, searching for a disease name ending in 's' (like "arthritis") would incorrectly strip the final 's'. Use removesuffix("'s").removesuffix("'") instead to correctly handle possessive forms like "Alzheimer's" and "Parkinson's" without affecting non-possessive words ending in 's'.

Suggested change
normalized = lower.rstrip("'s").rstrip("'")
normalized = lower.removesuffix("'s").removesuffix("'")

Copilot uses AI. Check for mistakes.
# OR if user's disease is a child of the code's disease
code_diseases = expand_disease(parsed.disease)
return bool(code_diseases & user_diseases)

Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function doesn't handle all base codes listed in consent_codes.json. The base codes NPU, CADM, and IRU are defined but not handled in the eligibility logic. If these codes can appear as standalone consent codes in the index, they would never be considered eligible for any purpose. Either add handling for these codes (with appropriate purpose mappings) or document why they are excluded.

Suggested change
# NPU / CADM / IRU: these are restriction/modifier codes, not primary
# consent categories. As standalone bases they are not considered to
# grant any research use, so they are always ineligible here.
if base in ("NPU", "CADM", "IRU"):
return False

Copilot uses AI. Check for mistakes.
_DISEASE_ABBREVIATIONS: dict[str, str] = {}
if _DISEASE_TSV_PATH.exists():
with _DISEASE_TSV_PATH.open() as f:
reader = csv.DictReader(f, delimiter="\t")
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the TSV file has incorrect column names (not "Disease abbrev" or "Disease name"), this will raise a KeyError when accessing row["Disease abbrev"] on line 38. Consider wrapping the CSV reading in a try-except block to provide a more helpful error message if the TSV format is unexpected.

Suggested change
reader = csv.DictReader(f, delimiter="\t")
reader = csv.DictReader(f, delimiter="\t")
required_columns = {"Disease abbrev", "Disease name"}
fieldnames = set(reader.fieldnames or [])
missing = required_columns - fieldnames
if missing:
raise ValueError(
f"Disease abbreviation TSV '{_DISEASE_TSV_PATH}' is missing "
f"required column(s): {', '.join(sorted(missing))}. "
f"Expected columns at least: {', '.join(sorted(required_columns))}."
)

Copilot uses AI. Check for mistakes.
NoopDog and others added 3 commits February 21, 2026 21:03
Rewrite to be higher-level — describes the algorithm and data flow
without duplicating code. Removes outdated signatures, test tables,
and phase breakdowns that diverged from what was built.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix rstrip("'s") → removesuffix("'s") to avoid stripping trailing
  's' from non-possessive names like "arthritis"
- Remove unused _BASE_CODES variable
- Add comment explaining NPU/CADM/IRU fallthrough in eligibility
- Add test verifying trailing 's' is preserved

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The LLM normalizes disease names before passing them to the tool,
so defensive possessive handling is unnecessary complexity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +127 to +155
def resolve_disease_name(name: str) -> str | None:
"""Map a disease name or abbreviation to a consent code abbreviation.

Accepts either an abbreviation (``"DIAB"``) or a full/partial disease
name (``"diabetes"``, ``"Breast Cancer"``). Returns the abbreviation
or ``None`` if no match is found.

Args:
name: A disease name or abbreviation.

Returns:
The disease abbreviation, or ``None``.
"""
upper = name.upper()
# Direct abbreviation match
if upper in _DISEASE_ABBREVIATIONS:
return upper
# Full name match (case-insensitive)
lower = name.lower()
if lower in _DISEASE_NAME_TO_ABBREV:
return _DISEASE_NAME_TO_ABBREV[lower]
# Substring match: find abbreviation whose full name contains the query.
# Prefer the shortest matching name (most specific match).
best: tuple[str, int] | None = None
for full_name, abbrev in _DISEASE_NAME_TO_ABBREV.items():
if lower in full_name:
if best is None or len(full_name) < best[1]:
best = (abbrev, len(full_name))
return best[0] if best else None
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PRD states that "Possessive forms ("Alzheimer's") are stripped before matching" but this is not implemented in the resolve_disease_name function. The disease_abbrev_mapping.tsv file has inconsistent use of possessives (e.g., "Alzheimer Disease" without apostrophe, but "Parkinson's Disease" with apostrophe). To ensure reliable matching regardless of user input format, the function should normalize both the input query and the disease names by removing possessive endings (e.g., replacing "'s" and "s'" with empty string) before comparison. This would ensure "Alzheimer's" matches "Alzheimer Disease", "Parkinsons" matches "Parkinson's Disease", etc. Consider adding a helper function like normalize_possessive(name) that strips these patterns, and apply it to both the input name and when building the _DISEASE_NAME_TO_ABBREV dict.

Copilot uses AI. Check for mistakes.
| `concept_search/resolve_agent.py` | `compute_consent_eligibility` tool |
| `concept_search/RESOLVE_PROMPT.md` | Pattern A / Pattern B instructions |
| `concept_search/EXTRACT_PROMPT.md` | Eligibility language recognition |
| `tests/test_consent_logic.py` | 46 unit tests |
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a discrepancy between the PR description (which claims 43 unit tests) and the PRD document (which claims 46 unit tests at line 115). After searching the test file, there are 43 test methods defined. Please verify the correct count and update either the PR description or the PRD to be consistent.

Copilot uses AI. Check for mistakes.
@NoopDog NoopDog merged commit bc8d7df into main Feb 22, 2026
5 checks passed
@NoopDog NoopDog deleted the noopdog/198/consent-code-compatibility branch February 22, 2026 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: consent code compatibility model for search API

1 participant