feat: add consent code eligibility logic for search API#199
Conversation
Add deterministic consent eligibility computation so the resolve agent can expand consent code mentions into the full set of eligible codes. A query like "what can I use for diabetes research?" now returns GRU, HMB, and DS-DIAB family studies instead of just exact DS-DIAB matches. - New consent_logic.py: parse codes, expand disease hierarchies, compute eligible code sets (pure Python, zero LLM) - New compute_consent_eligibility tool in resolve agent with disease name resolution and disease_only/NPU filters - Updated extract prompt with eligibility cue word recognition - Updated resolve prompt with Pattern A (explicit) / B (eligibility) - 43 unit tests, 7 extract eval cases, 12 resolve eval cases added Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…des #198 Replace hand-written disease_abbreviations in consent_codes.json with the 388-entry TSV maintained in catalog-build. Fix substring matching to prefer shortest name. Add dual-mention extract eval and strengthen prompt for nonprofit/for-profit + disease patterns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Raise FileNotFoundError if disease_abbrev_mapping.tsv is missing instead of silently falling back to empty dict - Remove unused pytest import - Add test for cardiovascular → CVD substring preference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR implements deterministic consent code eligibility logic for the search API, allowing researchers to find datasets by describing their research use case rather than knowing exact consent code strings. The implementation adds a pure Python module for parsing GA4GH consent codes and computing eligible codes based on research purpose, disease of interest, and profit status, integrated into the resolve agent as a new tool.
Changes:
- Adds
consent_logic.pymodule with parsing, disease hierarchy expansion, and eligibility computation functions - Integrates
compute_consent_eligibilitytool into resolve agent for expanding consent code mentions - Updates extract agent prompt to recognize eligibility language and emit dual focus + consent code mentions
- Adds comprehensive test coverage with 43 unit tests and updates to eval test suites
- Sources disease abbreviations from authoritative TSV file in catalog-build
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| backend/concept_search/consent_logic.py | New module implementing deterministic consent code parsing, disease expansion, name resolution, and eligibility computation |
| backend/tests/test_consent_logic.py | Comprehensive unit tests for all consent_logic functions covering parsing, expansion, eligibility, and disease resolution |
| backend/concept_search/resolve_agent.py | Adds compute_consent_eligibility tool to resolve agent for computing eligible consent codes from research profiles |
| backend/concept_search/consent_codes.json | Replaces disease_abbreviations dict with disease_hierarchy structure defining parent-child disease relationships |
| backend/concept_search/RESOLVE_PROMPT.md | Updates consent code resolution strategy with two patterns: explicit code and eligibility/use-case based resolution |
| backend/concept_search/EXTRACT_PROMPT.md | Adds eligibility language detection rules for identifying when users ask about data usage permissions |
| backend/concept_search/eval_resolve.py | Adds 16 new consent-related test cases covering explicit codes, eligibility queries, and GRU/HMB disambiguation |
| backend/concept_search/eval_extract.py | Adds 8 new test cases for consent eligibility detection and dual mention extraction |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ), | ||
| Case( | ||
| name="consent-consented-alzheimers", | ||
| inputs=_mention("Alzheimer's", Facet.CONSENT_CODE), |
There was a problem hiding this comment.
The test expects "Alzheimer's" (with possessive 's') to resolve to a disease abbreviation and be eligible for consent codes (GRU, HMB). However, the disease_abbrev_mapping.tsv has "Alzheimer Disease" (without the possessive). The substring matching in resolve_disease_name requires either the query to be in the disease name or vice versa. Since "alzheimer's" is not a substring of "alzheimer disease" and vice versa, this lookup will fail and resolve_disease_name will return None. This would cause the eligibility computation to not include DS-ALZ codes. Either the test case should use "Alzheimer" instead of "Alzheimer's", or the resolve_disease_name function needs to be enhanced to handle possessive forms and other variations.
| inputs=_mention("Alzheimer's", Facet.CONSENT_CODE), | |
| inputs=_mention("Alzheimer", Facet.CONSENT_CODE), |
| expected_output=ExtractResult( | ||
| intent="study", | ||
| mentions=[ | ||
| _rm("Alzheimer's", Facet.CONSENT_CODE), |
There was a problem hiding this comment.
Similar to the issue in eval_resolve.py, this test uses "Alzheimer's" (with possessive) which won't match "Alzheimer Disease" in the disease_abbrev_mapping.tsv. The substring matching in resolve_disease_name won't find a match for "alzheimer's" in "alzheimer disease". Consider using "Alzheimer" or "Alzheimer Disease" instead, or enhance the resolve_disease_name function to handle possessive forms.
| _rm("Alzheimer's", Facet.CONSENT_CODE), | |
| _rm("Alzheimer Disease", Facet.CONSENT_CODE), |
| - "general research use" → `compute_consent_eligibility(explicit_code="GRU")` | ||
| - "open access no restrictions" → `compute_consent_eligibility(explicit_code="GRU")` | ||
| - "type 1 diabetes consent" → `compute_consent_eligibility(purpose="disease", disease="type 1 diabetes")` | ||
| - "consented for diabetes only" → `compute_consent_eligibility(purpose="disease", disease="diabetes", disease_only=True)` → returns only DS-DIAB\* codes, not GRU/HMB |
There was a problem hiding this comment.
Inconsistent escaping of asterisks in documentation. Lines 105-106 use unescaped asterisks (e.g., "GRU*", "HMB*") while line 111 uses escaped asterisks ("DS-DIAB*"). For consistency, either all asterisks should be escaped or none should be. Since the earlier lines don't escape them, line 111 should match that style.
| - "consented for diabetes only" → `compute_consent_eligibility(purpose="disease", disease="diabetes", disease_only=True)` → returns only DS-DIAB\* codes, not GRU/HMB | |
| - "consented for diabetes only" → `compute_consent_eligibility(purpose="disease", disease="diabetes", disease_only=True)` → returns only DS-DIAB* codes, not GRU/HMB |
- Strip possessives in resolve_disease_name ("Alzheimer's" → ALZ)
- Remove bidirectional substring match (only query-in-name direction)
- Fix class typo TestResolvDiseaseName → TestResolveDiseaseName
- Fix test using old "Cancer (all types)" name → "Cancer"
- Add possessive tests for Alzheimer's and Parkinson's
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| _data: dict = json.loads(_DATA_PATH.read_text()) | ||
| _MODIFIERS: set[str] = set(_data.get("modifiers", {})) | ||
| _BASE_CODES: set[str] = set(_data.get("base_codes", {})) |
There was a problem hiding this comment.
The variable _BASE_CODES is defined but never used anywhere in the module. Consider removing it if it's not needed, or using it to validate base codes in parse_consent_code or _is_eligible_by_purpose if there's a need to ensure only valid base codes are processed.
| _BASE_CODES: set[str] = set(_data.get("base_codes", {})) |
| if lower in _DISEASE_NAME_TO_ABBREV: | ||
| return _DISEASE_NAME_TO_ABBREV[lower] | ||
| # Strip possessives ("Alzheimer's" → "alzheimer") | ||
| normalized = lower.rstrip("'s").rstrip("'") |
There was a problem hiding this comment.
The possessive stripping logic uses rstrip which removes characters, not suffixes. This means rstrip("'s") removes any trailing ' or s characters, not the suffix "'s". For example, searching for a disease name ending in 's' (like "arthritis") would incorrectly strip the final 's'. Use removesuffix("'s").removesuffix("'") instead to correctly handle possessive forms like "Alzheimer's" and "Parkinson's" without affecting non-possessive words ending in 's'.
| normalized = lower.rstrip("'s").rstrip("'") | |
| normalized = lower.removesuffix("'s").removesuffix("'") |
| # OR if user's disease is a child of the code's disease | ||
| code_diseases = expand_disease(parsed.disease) | ||
| return bool(code_diseases & user_diseases) | ||
|
|
There was a problem hiding this comment.
The function doesn't handle all base codes listed in consent_codes.json. The base codes NPU, CADM, and IRU are defined but not handled in the eligibility logic. If these codes can appear as standalone consent codes in the index, they would never be considered eligible for any purpose. Either add handling for these codes (with appropriate purpose mappings) or document why they are excluded.
| # NPU / CADM / IRU: these are restriction/modifier codes, not primary | |
| # consent categories. As standalone bases they are not considered to | |
| # grant any research use, so they are always ineligible here. | |
| if base in ("NPU", "CADM", "IRU"): | |
| return False |
| _DISEASE_ABBREVIATIONS: dict[str, str] = {} | ||
| if _DISEASE_TSV_PATH.exists(): | ||
| with _DISEASE_TSV_PATH.open() as f: | ||
| reader = csv.DictReader(f, delimiter="\t") |
There was a problem hiding this comment.
If the TSV file has incorrect column names (not "Disease abbrev" or "Disease name"), this will raise a KeyError when accessing row["Disease abbrev"] on line 38. Consider wrapping the CSV reading in a try-except block to provide a more helpful error message if the TSV format is unexpected.
| reader = csv.DictReader(f, delimiter="\t") | |
| reader = csv.DictReader(f, delimiter="\t") | |
| required_columns = {"Disease abbrev", "Disease name"} | |
| fieldnames = set(reader.fieldnames or []) | |
| missing = required_columns - fieldnames | |
| if missing: | |
| raise ValueError( | |
| f"Disease abbreviation TSV '{_DISEASE_TSV_PATH}' is missing " | |
| f"required column(s): {', '.join(sorted(missing))}. " | |
| f"Expected columns at least: {', '.join(sorted(required_columns))}." | |
| ) |
Rewrite to be higher-level — describes the algorithm and data flow without duplicating code. Removes outdated signatures, test tables, and phase breakdowns that diverged from what was built. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix rstrip("'s") → removesuffix("'s") to avoid stripping trailing
's' from non-possessive names like "arthritis"
- Remove unused _BASE_CODES variable
- Add comment explaining NPU/CADM/IRU fallthrough in eligibility
- Add test verifying trailing 's' is preserved
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The LLM normalizes disease names before passing them to the tool, so defensive possessive handling is unnecessary complexity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def resolve_disease_name(name: str) -> str | None: | ||
| """Map a disease name or abbreviation to a consent code abbreviation. | ||
|
|
||
| Accepts either an abbreviation (``"DIAB"``) or a full/partial disease | ||
| name (``"diabetes"``, ``"Breast Cancer"``). Returns the abbreviation | ||
| or ``None`` if no match is found. | ||
|
|
||
| Args: | ||
| name: A disease name or abbreviation. | ||
|
|
||
| Returns: | ||
| The disease abbreviation, or ``None``. | ||
| """ | ||
| upper = name.upper() | ||
| # Direct abbreviation match | ||
| if upper in _DISEASE_ABBREVIATIONS: | ||
| return upper | ||
| # Full name match (case-insensitive) | ||
| lower = name.lower() | ||
| if lower in _DISEASE_NAME_TO_ABBREV: | ||
| return _DISEASE_NAME_TO_ABBREV[lower] | ||
| # Substring match: find abbreviation whose full name contains the query. | ||
| # Prefer the shortest matching name (most specific match). | ||
| best: tuple[str, int] | None = None | ||
| for full_name, abbrev in _DISEASE_NAME_TO_ABBREV.items(): | ||
| if lower in full_name: | ||
| if best is None or len(full_name) < best[1]: | ||
| best = (abbrev, len(full_name)) | ||
| return best[0] if best else None |
There was a problem hiding this comment.
The PRD states that "Possessive forms ("Alzheimer's") are stripped before matching" but this is not implemented in the resolve_disease_name function. The disease_abbrev_mapping.tsv file has inconsistent use of possessives (e.g., "Alzheimer Disease" without apostrophe, but "Parkinson's Disease" with apostrophe). To ensure reliable matching regardless of user input format, the function should normalize both the input query and the disease names by removing possessive endings (e.g., replacing "'s" and "s'" with empty string) before comparison. This would ensure "Alzheimer's" matches "Alzheimer Disease", "Parkinsons" matches "Parkinson's Disease", etc. Consider adding a helper function like normalize_possessive(name) that strips these patterns, and apply it to both the input name and when building the _DISEASE_NAME_TO_ABBREV dict.
| | `concept_search/resolve_agent.py` | `compute_consent_eligibility` tool | | ||
| | `concept_search/RESOLVE_PROMPT.md` | Pattern A / Pattern B instructions | | ||
| | `concept_search/EXTRACT_PROMPT.md` | Eligibility language recognition | | ||
| | `tests/test_consent_logic.py` | 46 unit tests | |
There was a problem hiding this comment.
There is a discrepancy between the PR description (which claims 43 unit tests) and the PRD document (which claims 46 unit tests at line 115). After searching the test file, there are 43 test methods defined. Please verify the correct count and update either the PR description or the PRD to be consistent.
Summary
consent_logic.py) that computes which consent codes a researcher is eligible to use based on their research purpose, disease of interest, and profit statuscompute_consent_eligibilitytool to the resolve agent so consent code mentions are expanded to the full set of eligible codes (e.g. "diabetes research" → GRU-* + HMB-* + DS-DIAB-* + DS-T1D-* etc.)catalog-build/common/disease_abbrev_mapping.tsv(388 entries)Before/After
Key design decisions
resolve_disease_name()accepts full names ("diabetes"), abbreviations ("DIAB"), or partial names ("cardiovascular") so the agent doesn't need a lookup step first.Closes #198
Test plan
consent_logic.py(parse, expand, eligibility, disease name resolution)🤖 Generated with Claude Code