feat: add consent code eligibility logic for search API by NoopDog · Pull Request #199 · NIH-NCPI/ncpi-dataset-catalog

NoopDog · 2026-02-22T04:37:15Z

Summary

Add deterministic consent code eligibility logic (consent_logic.py) that computes which consent codes a researcher is eligible to use based on their research purpose, disease of interest, and profit status
Add compute_consent_eligibility tool to the resolve agent so consent code mentions are expanded to the full set of eligible codes (e.g. "diabetes research" → GRU-* + HMB-* + DS-DIAB-* + DS-T1D-* etc.)
Update extract agent to recognize eligibility language ("what can I use", "consented for", "for-profit") and emit dual focus + consentCode mentions
Disease abbreviations sourced from the authoritative catalog-build/common/disease_abbrev_mapping.tsv (388 entries)

Before/After

Query	Before	After
"diabetes research" consent	DS-DIAB only (~9 studies)	GRU-* + HMB-* + DS-DIAB family (~2,300+ studies)
"GRU" consent	Exact match (1,081 studies)	All GRU-* variants (1,543 studies)
"for-profit cancer"	No consent handling	GRU-* + HMB-* + DS-CA-* minus NPU codes
"diabetes only"	Same as above	DS-DIAB-* only (excludes GRU/HMB)

Key design decisions

Deterministic Python, not LLM: Permission hierarchy (GRU ⊇ HMB ⊇ DS-X) and modifier semantics (NPU = non-profit only) are fixed GA4GH rules computed in pure Python. The LLM's only job is mapping natural language to tool parameters.
Single tool call: resolve_disease_name() accepts full names ("diabetes"), abbreviations ("DIAB"), or partial names ("cardiovascular") so the agent doesn't need a lookup step first.
disease_only flag: Distinguishes "eligible for diabetes" (GRU+HMB+DS-DIAB) from "specifically consented for diabetes only" (DS-DIAB only).

Closes #198

Test plan

43 unit tests for consent_logic.py (parse, expand, eligibility, disease name resolution)
48 resolve agent eval cases at 1.00 (including 16 new consent cases)
41 extract agent eval cases (including 8 new consent/eligibility cases)
Manual E2E: start server, query "what diabetes datasets can I use?" — should return GRU + HMB + DS-DIAB family

🤖 Generated with Claude Code

Add deterministic consent eligibility computation so the resolve agent can expand consent code mentions into the full set of eligible codes. A query like "what can I use for diabetes research?" now returns GRU, HMB, and DS-DIAB family studies instead of just exact DS-DIAB matches. - New consent_logic.py: parse codes, expand disease hierarchies, compute eligible code sets (pure Python, zero LLM) - New compute_consent_eligibility tool in resolve agent with disease name resolution and disease_only/NPU filters - Updated extract prompt with eligibility cue word recognition - Updated resolve prompt with Pattern A (explicit) / B (eligibility) - 43 unit tests, 7 extract eval cases, 12 resolve eval cases added Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…des #198 Replace hand-written disease_abbreviations in consent_codes.json with the 388-entry TSV maintained in catalog-build. Fix substring matching to prefer shortest name. Add dual-mention extract eval and strengthen prompt for nonprofit/for-profit + disease patterns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Raise FileNotFoundError if disease_abbrev_mapping.tsv is missing instead of silently falling back to empty dict - Remove unused pytest import - Add test for cardiovascular → CVD substring preference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR implements deterministic consent code eligibility logic for the search API, allowing researchers to find datasets by describing their research use case rather than knowing exact consent code strings. The implementation adds a pure Python module for parsing GA4GH consent codes and computing eligible codes based on research purpose, disease of interest, and profit status, integrated into the resolve agent as a new tool.

Changes:

Adds consent_logic.py module with parsing, disease hierarchy expansion, and eligibility computation functions
Integrates compute_consent_eligibility tool into resolve agent for expanding consent code mentions
Updates extract agent prompt to recognize eligibility language and emit dual focus + consent code mentions
Adds comprehensive test coverage with 43 unit tests and updates to eval test suites
Sources disease abbreviations from authoritative TSV file in catalog-build

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
backend/concept_search/consent_logic.py	New module implementing deterministic consent code parsing, disease expansion, name resolution, and eligibility computation
backend/tests/test_consent_logic.py	Comprehensive unit tests for all consent_logic functions covering parsing, expansion, eligibility, and disease resolution
backend/concept_search/resolve_agent.py	Adds compute_consent_eligibility tool to resolve agent for computing eligible consent codes from research profiles
backend/concept_search/consent_codes.json	Replaces disease_abbreviations dict with disease_hierarchy structure defining parent-child disease relationships
backend/concept_search/RESOLVE_PROMPT.md	Updates consent code resolution strategy with two patterns: explicit code and eligibility/use-case based resolution
backend/concept_search/EXTRACT_PROMPT.md	Adds eligibility language detection rules for identifying when users ask about data usage permissions
backend/concept_search/eval_resolve.py	Adds 16 new consent-related test cases covering explicit codes, eligibility queries, and GRU/HMB disambiguation
backend/concept_search/eval_extract.py	Adds 8 new test cases for consent eligibility detection and dual mention extraction

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

backend/concept_search/consent_logic.py

Copilot · 2026-02-22T04:46:43Z

backend/concept_search/eval_resolve.py

+        ),
+        Case(
+            name="consent-consented-alzheimers",
+            inputs=_mention("Alzheimer's", Facet.CONSENT_CODE),


The test expects "Alzheimer's" (with possessive 's') to resolve to a disease abbreviation and be eligible for consent codes (GRU, HMB). However, the disease_abbrev_mapping.tsv has "Alzheimer Disease" (without the possessive). The substring matching in resolve_disease_name requires either the query to be in the disease name or vice versa. Since "alzheimer's" is not a substring of "alzheimer disease" and vice versa, this lookup will fail and resolve_disease_name will return None. This would cause the eligibility computation to not include DS-ALZ codes. Either the test case should use "Alzheimer" instead of "Alzheimer's", or the resolve_disease_name function needs to be enhanced to handle possessive forms and other variations.

Suggested change

inputs=_mention("Alzheimer's", Facet.CONSENT_CODE),

inputs=_mention("Alzheimer", Facet.CONSENT_CODE),

Copilot · 2026-02-22T04:46:44Z

backend/concept_search/eval_extract.py

+            expected_output=ExtractResult(
+                intent="study",
+                mentions=[
+                    _rm("Alzheimer's", Facet.CONSENT_CODE),


Similar to the issue in eval_resolve.py, this test uses "Alzheimer's" (with possessive) which won't match "Alzheimer Disease" in the disease_abbrev_mapping.tsv. The substring matching in resolve_disease_name won't find a match for "alzheimer's" in "alzheimer disease". Consider using "Alzheimer" or "Alzheimer Disease" instead, or enhance the resolve_disease_name function to handle possessive forms.

Suggested change

_rm("Alzheimer's", Facet.CONSENT_CODE),

_rm("Alzheimer Disease", Facet.CONSENT_CODE),

Copilot · 2026-02-22T04:46:44Z

backend/concept_search/RESOLVE_PROMPT.md

+- "general research use" → `compute_consent_eligibility(explicit_code="GRU")`
+- "open access no restrictions" → `compute_consent_eligibility(explicit_code="GRU")`
+- "type 1 diabetes consent" → `compute_consent_eligibility(purpose="disease", disease="type 1 diabetes")`
+- "consented for diabetes only" → `compute_consent_eligibility(purpose="disease", disease="diabetes", disease_only=True)` → returns only DS-DIAB\* codes, not GRU/HMB


Inconsistent escaping of asterisks in documentation. Lines 105-106 use unescaped asterisks (e.g., "GRU*", "HMB*") while line 111 uses escaped asterisks ("DS-DIAB*"). For consistency, either all asterisks should be escaped or none should be. Since the earlier lines don't escape them, line 111 should match that style.

Suggested change

- "consented for diabetes only" → `compute_consent_eligibility(purpose="disease", disease="diabetes", disease_only=True)` → returns only DS-DIAB\* codes, not GRU/HMB

- "consented for diabetes only" → `compute_consent_eligibility(purpose="disease", disease="diabetes", disease_only=True)` → returns only DS-DIAB* codes, not GRU/HMB

backend/tests/test_consent_logic.py

- Strip possessives in resolve_disease_name ("Alzheimer's" → ALZ) - Remove bidirectional substring match (only query-in-name direction) - Fix class typo TestResolvDiseaseName → TestResolveDiseaseName - Fix test using old "Cancer (all types)" name → "Cancer" - Add possessive tests for Alzheimer's and Parkinson's Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-22T05:02:02Z

backend/concept_search/consent_logic.py

+
+_data: dict = json.loads(_DATA_PATH.read_text())
+_MODIFIERS: set[str] = set(_data.get("modifiers", {}))
+_BASE_CODES: set[str] = set(_data.get("base_codes", {}))


The variable _BASE_CODES is defined but never used anywhere in the module. Consider removing it if it's not needed, or using it to validate base codes in parse_consent_code or _is_eligible_by_purpose if there's a need to ensure only valid base codes are processed.

Suggested change

_BASE_CODES: set[str] = set(_data.get("base_codes", {}))

Copilot · 2026-02-22T05:02:02Z

backend/concept_search/consent_logic.py

+    if lower in _DISEASE_NAME_TO_ABBREV:
+        return _DISEASE_NAME_TO_ABBREV[lower]
+    # Strip possessives ("Alzheimer's" → "alzheimer")
+    normalized = lower.rstrip("'s").rstrip("'")


The possessive stripping logic uses rstrip which removes characters, not suffixes. This means rstrip("'s") removes any trailing ' or s characters, not the suffix "'s". For example, searching for a disease name ending in 's' (like "arthritis") would incorrectly strip the final 's'. Use removesuffix("'s").removesuffix("'") instead to correctly handle possessive forms like "Alzheimer's" and "Parkinson's" without affecting non-possessive words ending in 's'.

Suggested change

normalized = lower.rstrip("'s").rstrip("'")

normalized = lower.removesuffix("'s").removesuffix("'")

Copilot · 2026-02-22T05:02:02Z

backend/concept_search/consent_logic.py

+        # OR if user's disease is a child of the code's disease
+        code_diseases = expand_disease(parsed.disease)
+        return bool(code_diseases & user_diseases)
+


The function doesn't handle all base codes listed in consent_codes.json. The base codes NPU, CADM, and IRU are defined but not handled in the eligibility logic. If these codes can appear as standalone consent codes in the index, they would never be considered eligible for any purpose. Either add handling for these codes (with appropriate purpose mappings) or document why they are excluded.

Suggested change

# NPU / CADM / IRU: these are restriction/modifier codes, not primary

# consent categories. As standalone bases they are not considered to

# grant any research use, so they are always ineligible here.

if base in ("NPU", "CADM", "IRU"):

return False

Copilot · 2026-02-22T05:02:02Z

backend/concept_search/consent_logic.py

+_DISEASE_ABBREVIATIONS: dict[str, str] = {}
+if _DISEASE_TSV_PATH.exists():
+    with _DISEASE_TSV_PATH.open() as f:
+        reader = csv.DictReader(f, delimiter="\t")


If the TSV file has incorrect column names (not "Disease abbrev" or "Disease name"), this will raise a KeyError when accessing row["Disease abbrev"] on line 38. Consider wrapping the CSV reading in a try-except block to provide a more helpful error message if the TSV format is unexpected.

Suggested change

reader = csv.DictReader(f, delimiter="\t")

reader = csv.DictReader(f, delimiter="\t")

required_columns = {"Disease abbrev", "Disease name"}

fieldnames = set(reader.fieldnames or [])

missing = required_columns - fieldnames

if missing:

raise ValueError(

f"Disease abbreviation TSV '{_DISEASE_TSV_PATH}' is missing "

f"required column(s): {', '.join(sorted(missing))}. "

f"Expected columns at least: {', '.join(sorted(required_columns))}."

)

Rewrite to be higher-level — describes the algorithm and data flow without duplicating code. Removes outdated signatures, test tables, and phase breakdowns that diverged from what was built. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix rstrip("'s") → removesuffix("'s") to avoid stripping trailing 's' from non-possessive names like "arthritis" - Remove unused _BASE_CODES variable - Add comment explaining NPU/CADM/IRU fallthrough in eligibility - Add test verifying trailing 's' is preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The LLM normalizes disease names before passing them to the tool, so defensive possessive handling is unnecessary complexity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-22T05:17:03Z

backend/concept_search/consent_logic.py

+def resolve_disease_name(name: str) -> str | None:
+    """Map a disease name or abbreviation to a consent code abbreviation.
+
+    Accepts either an abbreviation (``"DIAB"``) or a full/partial disease
+    name (``"diabetes"``, ``"Breast Cancer"``).  Returns the abbreviation
+    or ``None`` if no match is found.
+
+    Args:
+        name: A disease name or abbreviation.
+
+    Returns:
+        The disease abbreviation, or ``None``.
+    """
+    upper = name.upper()
+    # Direct abbreviation match
+    if upper in _DISEASE_ABBREVIATIONS:
+        return upper
+    # Full name match (case-insensitive)
+    lower = name.lower()
+    if lower in _DISEASE_NAME_TO_ABBREV:
+        return _DISEASE_NAME_TO_ABBREV[lower]
+    # Substring match: find abbreviation whose full name contains the query.
+    # Prefer the shortest matching name (most specific match).
+    best: tuple[str, int] | None = None
+    for full_name, abbrev in _DISEASE_NAME_TO_ABBREV.items():
+        if lower in full_name:
+            if best is None or len(full_name) < best[1]:
+                best = (abbrev, len(full_name))
+    return best[0] if best else None


The PRD states that "Possessive forms ("Alzheimer's") are stripped before matching" but this is not implemented in the resolve_disease_name function. The disease_abbrev_mapping.tsv file has inconsistent use of possessives (e.g., "Alzheimer Disease" without apostrophe, but "Parkinson's Disease" with apostrophe). To ensure reliable matching regardless of user input format, the function should normalize both the input query and the disease names by removing possessive endings (e.g., replacing "'s" and "s'" with empty string) before comparison. This would ensure "Alzheimer's" matches "Alzheimer Disease", "Parkinsons" matches "Parkinson's Disease", etc. Consider adding a helper function like normalize_possessive(name) that strips these patterns, and apply it to both the input name and when building the _DISEASE_NAME_TO_ABBREV dict.

Copilot · 2026-02-22T05:17:03Z

docs/PRD-consent-code-compatibility.md

+| `concept_search/resolve_agent.py`                 | `compute_consent_eligibility` tool          |
+| `concept_search/RESOLVE_PROMPT.md`                | Pattern A / Pattern B instructions          |
+| `concept_search/EXTRACT_PROMPT.md`                | Eligibility language recognition            |
+| `tests/test_consent_logic.py`                     | 46 unit tests                               |


There is a discrepancy between the PR description (which claims 43 unit tests) and the PRD document (which claims 46 unit tests at line 115). After searching the test file, there are 43 test methods defined. Please verify the correct count and update either the PR description or the PRD to be consistent.

NoopDog and others added 2 commits February 21, 2026 13:02

NoopDog requested a review from Copilot February 22, 2026 04:38

Copilot started reviewing on behalf of NoopDog February 22, 2026 04:39 View session

Copilot AI reviewed Feb 22, 2026

View reviewed changes

NoopDog requested a review from Copilot February 22, 2026 04:51

Copilot started reviewing on behalf of NoopDog February 22, 2026 04:52 View session

Copilot AI reviewed Feb 22, 2026

View reviewed changes

NoopDog and others added 3 commits February 21, 2026 21:03

refactor: remove possessive stripping from resolve_disease_name #198

0d1ed7c

The LLM normalizes disease names before passing them to the tool, so defensive possessive handling is unnecessary complexity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

NoopDog requested a review from Copilot February 22, 2026 05:06

Copilot started reviewing on behalf of NoopDog February 22, 2026 05:07 View session

Copilot AI reviewed Feb 22, 2026

View reviewed changes

NoopDog merged commit bc8d7df into main Feb 22, 2026
5 checks passed

NoopDog deleted the noopdog/198/consent-code-compatibility branch February 22, 2026 06:46

github-actions bot mentioned this pull request Feb 22, 2026

chore(main): release 0.11.0 #200

Open

	inputs=_mention("Alzheimer's", Facet.CONSENT_CODE),
	inputs=_mention("Alzheimer", Facet.CONSENT_CODE),

	_rm("Alzheimer's", Facet.CONSENT_CODE),
	_rm("Alzheimer Disease", Facet.CONSENT_CODE),

	- "consented for diabetes only" → `compute_consent_eligibility(purpose="disease", disease="diabetes", disease_only=True)` → returns only DS-DIAB\* codes, not GRU/HMB
	- "consented for diabetes only" → `compute_consent_eligibility(purpose="disease", disease="diabetes", disease_only=True)` → returns only DS-DIAB* codes, not GRU/HMB

	normalized = lower.rstrip("'s").rstrip("'")
	normalized = lower.removesuffix("'s").removesuffix("'")

+    # NPU / CADM / IRU: these are restriction/modifier codes, not primary
+    # consent categories. As standalone bases they are not considered to
+    # grant any research use, so they are always ineligible here.
+    if base in ("NPU", "CADM", "IRU"):
+        return False

-        reader = csv.DictReader(f, delimiter="\t")
+        reader = csv.DictReader(f, delimiter="\t")
+        required_columns = {"Disease abbrev", "Disease name"}
+        fieldnames = set(reader.fieldnames or [])
+        missing = required_columns - fieldnames
+        if missing:
+            raise ValueError(
+                f"Disease abbreviation TSV '{_DISEASE_TSV_PATH}' is missing "
+                f"required column(s): {', '.join(sorted(missing))}. "
+                f"Expected columns at least: {', '.join(sorted(required_columns))}."
+            )

Comments

Conversation

NoopDog commented Feb 22, 2026

Summary

Before/After

Key design decisions

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant