feat: add variable-level search with intent detection#197
Conversation
Support searching for measured variables (not just studies) via natural language. The extract agent now detects query intent (study/variable/auto), a new DuckDB variables table stores ~450k variable rows from llm-concepts, and the API branches on intent to return variable results with dbGaP links. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8a9adaf to
9f47af8
Compare
There was a problem hiding this comment.
Pull request overview
This PR implements variable-level search functionality, allowing researchers to search for specific measured variables (with dbGaP PHV IDs and descriptions) in addition to the existing study-level search. The feature adds intent detection to distinguish between study queries ("diabetes datasets on AnVIL") and variable queries ("what variables measure chocolate consumption?"), with an "auto" fallback for ambiguous queries that prompts user clarification.
Changes:
- Adds a new DuckDB
variablestable (~450k rows) populated from llm-concepts JSON with per-variable metadata (PHV ID, name, description, concept, dataset/table context) - Implements query intent detection in the extract agent with three modes: "study", "variable", and "auto" (ambiguous)
- Extends the API to branch on intent: study results for study queries, variable results for variable queries
- Updates the frontend to render a variable results table with clickable dbGaP links for both variables and studies
- Adds 9 new intent detection test cases to eval_extract.py and Makefile commands for local development
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/PRD-variable-search.md | Product requirements document detailing the variable search feature design and scope |
| backend/concept_search/store.py | Adds variables table schema, batch loading method, query method, and index creation |
| backend/concept_search/index.py | Extends _load_measurement_concepts to populate variable rows from llm-concepts JSON |
| backend/concept_search/models.py | Adds Intent type and intent field to ExtractResult and QueryModel |
| backend/concept_search/api_models.py | Adds VariableResult model and extends SearchResponse with intent and variable fields |
| backend/concept_search/api.py | Implements intent branching logic, variable URL construction, and variable result building |
| backend/concept_search/pipeline.py | Propagates intent from extract result through to final QueryModel |
| backend/concept_search/EXTRACT_PROMPT.md | Adds intent detection guidance with heuristics and examples for study/variable/auto classification |
| backend/concept_search/eval_extract.py | Adds intent scoring and 9 new test cases covering variable, study, and ambiguous intent queries |
| backend/Makefile | Adds local development commands: start, stop, restart, db-reload, and evals |
| app/components/Chat/chat.tsx | Adds Variable interface and conditional rendering of variable results table vs study table based on intent |
Comments suppressed due to low confidence (1)
backend/concept_search/api.py:264
- The error response for timeout and general exceptions doesn't include the new required fields added to SearchResponse in this PR. The response is missing
intent,totalVariables, andvariablesfields. While the frontend might handle this gracefully due to default values, it's inconsistent with the updated SearchResponse model. Consider adding these fields with appropriate defaults (e.g.,intent="study",total_variables=0,variables=[]) to ensure consistency.
_log_json(event="rate_limited", ip=client_ip, query=request.query)
return JSONResponse(
content={"detail": "Too many requests — please try again later."},
status_code=429,
)
_log_json(event="search_request", query=request.query)
t_start = time.monotonic()
# Run the 3-agent LLM pipeline (semaphore + timeout)
try:
async with _pipeline_semaphore:
query_model = await asyncio.wait_for(
run_pipeline(request.query), timeout=60.0
)
except (TimeoutError, asyncio.CancelledError):
_log_json(event="search_timeout", query=request.query)
elapsed_ms = int((time.monotonic() - t_start) * 1000)
return SearchResponse(
message="Search timed out — please try a simpler query.",
query=QueryModel(mentions=[]),
studies=[],
timing=SearchTiming(
lookup_ms=0,
pipeline_ms=elapsed_ms,
total_ms=elapsed_ms,
),
total_studies=0,
)
except Exception as exc:
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "LEFT JOIN studies s ON v.study_id = s.db_gap_id " | ||
| f"WHERE v.concept_lower IN ({placeholders}) " # noqa: S608 | ||
| "ORDER BY v.concept, v.study_id, v.variable_name " | ||
| f"LIMIT {limit}" |
There was a problem hiding this comment.
The LIMIT clause uses direct string interpolation of the limit parameter without validation. While the default is 100, if this method is ever called with user input for the limit parameter, it could lead to SQL injection. Consider using a parameterized query or at least validating that limit is a positive integer before interpolation.
There was a problem hiding this comment.
Won't fix — limit is a Python int with default 100, never derived from user input. Adding validation for an internal parameter would be over-engineering.
| concept: str | ||
| dataset_id: str | ||
| db_gap_url: str | ||
| description: str |
There was a problem hiding this comment.
The PRD specifies the field name should be variableDescription (line 106 of PRD), but the implementation uses description. This inconsistency means the API response field will be serialized as description (camelCase: description) instead of the expected variableDescription. Update the field name to match the PRD specification.
| description: str | |
| description: str = Field(alias="variableDescription") |
There was a problem hiding this comment.
Won't fix — the PRD is a design document, not a spec. description is cleaner and consistent with how other models name this field. The frontend already consumes it as description.
| Full URL to the variable page on dbGaP. | ||
| """ | ||
| # Extract the numeric portion from the phv ID (strip "phv" prefix and version) | ||
| phv_num = phv_id.split(".")[0].replace("phv", "") |
There was a problem hiding this comment.
If phv_id is empty or malformed (e.g., doesn't contain a dot), phv_id.split(".")[0] could fail or produce unexpected results. The function should validate that phv_id is well-formed before attempting to extract the numeric portion. Consider adding a guard clause that returns an empty string or a fallback URL if the phv_id format is invalid.
| Full URL to the variable page on dbGaP. | |
| """ | |
| # Extract the numeric portion from the phv ID (strip "phv" prefix and version) | |
| phv_num = phv_id.split(".")[0].replace("phv", "") | |
| Full URL to the variable page on dbGaP. If the provided ``phv_id`` is | |
| empty or not in the expected format, an empty string is returned. | |
| """ | |
| # Validate phv_id format before attempting to extract the numeric portion. | |
| if not phv_id or "." not in phv_id: | |
| return "" | |
| # Extract the numeric portion from the phv ID (strip "phv" prefix and version) | |
| phv_prefix = phv_id.split(".", 1)[0] | |
| phv_num = phv_prefix.replace("phv", "") | |
| # Ensure we have a non-empty numeric identifier; otherwise, bail out. | |
| if not phv_num.isdigit(): | |
| return "" |
| timing=SearchTiming( | ||
| lookup_ms=0, | ||
| pipeline_ms=elapsed_ms, | ||
| total_ms=elapsed_ms, | ||
| ), | ||
| total_studies=0, | ||
| ) | ||
| t_pipeline = time.monotonic() | ||
|
|
||
| # Deterministic study lookup | ||
| # Deterministic lookup — branch on intent | ||
| index = get_index() |
There was a problem hiding this comment.
When intent == "variable" but there are no MEASUREMENT facet mentions (or all MEASUREMENT mentions are excluded), the concepts list will be empty and no variable results will be returned. This could happen if a user asks a variable-level question about non-measurement facets (e.g., "what variables are in WGS studies?"). Consider adding a message or handling this case explicitly to guide the user that variable search requires measurement concepts.
There was a problem hiding this comment.
Won't fix — this correctly returns zero results, which is the expected behavior. The extract agent will always produce measurement mentions for variable-intent queries; this edge case only arises from malformed pipeline output.
| # Deterministic lookup — branch on intent | ||
| index = get_index() |
There was a problem hiding this comment.
The variable query only executes when isinstance(index.store, DuckDBStore) is True. If a different store implementation is used in the future, variable queries will silently return no results even though the intent was "variable". Consider logging a warning or returning an informative message when variable search is requested but the store doesn't support it.
There was a problem hiding this comment.
Won't fix — DuckDBStore is the only store implementation and there are no plans for alternatives. The isinstance check is a type-narrowing guard, not a feature gate.
| def query_variables( | ||
| self, | ||
| concepts: list[str], | ||
| limit: int = 100, | ||
| ) -> list[dict]: | ||
| """Return variables matching any of the given concept names. | ||
|
|
||
| Args: | ||
| concepts: Canonical concept names to match (OR-ed). | ||
| limit: Maximum number of variable rows to return. | ||
|
|
||
| Returns: | ||
| Variable dicts with study title joined from the studies table. | ||
| """ | ||
| if not concepts: | ||
| return [] | ||
| placeholders = ", ".join("?" for _ in concepts) | ||
| sql = ( | ||
| "SELECT v.concept, v.dataset_id, v.description, v.phv_id," | ||
| " v.study_id, v.table_name, v.variable_name," | ||
| " json_extract_string(s.raw_json, '$.title') AS study_title " | ||
| "FROM variables v " | ||
| "LEFT JOIN studies s ON v.study_id = s.db_gap_id " | ||
| f"WHERE v.concept_lower IN ({placeholders}) " # noqa: S608 | ||
| "ORDER BY v.concept, v.study_id, v.variable_name " | ||
| f"LIMIT {limit}" | ||
| ) | ||
| params = [c.lower() for c in concepts] | ||
| rows = self._conn.execute(sql, params).fetchall() | ||
| cols = [ | ||
| "concept", "datasetId", "description", "phvId", | ||
| "studyId", "tableName", "variableName", "studyTitle", | ||
| ] | ||
| return [dict(zip(cols, row)) for row in rows] | ||
|
|
There was a problem hiding this comment.
The new query_variables method and variables table functionality lack test coverage. The existing test suite (backend/tests/test_index.py) has comprehensive tests for query_studies and the DuckDB store, but there are no tests for the new variable search functionality. Add tests covering: variable table creation, variable batch loading, query_variables method with various concept inputs, variable URL generation, and integration with the API layer for variable intent queries.
There was a problem hiding this comment.
Agreed — will add variable store tests in a follow-up. The feature is covered by 32 extract evals and manual integration testing.
| ), | ||
| total_studies=0, | ||
| ) | ||
| t_pipeline = time.monotonic() |
There was a problem hiding this comment.
The import statement from .store import DuckDBStore is placed inside the function at line 283, creating a local import. This is inconsistent with the standard pattern used throughout the codebase where imports are at the module level (e.g., line 30 in api.py imports from .index, and index.py line 13 imports from .store at module level). Move this import to the top of the file with the other imports for consistency and better performance.
There was a problem hiding this comment.
Fixed — moved to top-level import in afda40e.
| studies=[], | ||
| timing=SearchTiming( | ||
| lookup_ms=0, | ||
| pipeline_ms=elapsed_ms, | ||
| total_ms=elapsed_ms, | ||
| ), | ||
| total_studies=0, | ||
| ) | ||
| t_pipeline = time.monotonic() | ||
|
|
||
| # Deterministic study lookup | ||
| # Deterministic lookup — branch on intent | ||
| index = get_index() | ||
| intent = query_model.intent | ||
| studies: list[dict] = [] |
There was a problem hiding this comment.
When intent == "auto", the code falls through to the else branch (line 287-288) and executes a study query. According to the PRD and EXTRACT_PROMPT, when intent is "auto", a clarification message should be set and no search should be performed. The API should return empty results when intent is "auto" and let the message guide the user to clarify. Consider adding an explicit check: elif intent == "study": at line 287 to make this behavior clearer and avoid unintended study lookups for ambiguous queries.
There was a problem hiding this comment.
Fixed — added explicit if intent == "auto": pass branch in afda40e.
|
|
||
| <ResultCount> | ||
| Found {response.totalStudies}{" "} | ||
| {response.totalStudies === 1 ? "study" : "studies"} in {totalSeconds}s | ||
| </ResultCount> | ||
|
|
||
| {response.studies.length > 0 && ( | ||
| <StudyTable> | ||
| <Table size="small"> | ||
| <TableHead> | ||
| <TableRow> | ||
| <TableCell>Title</TableCell> | ||
| <TableCell>dbGaP Id</TableCell> | ||
| <TableCell>Platform</TableCell> | ||
| <TableCell>Focus / Disease</TableCell> | ||
| <TableCell>Data Type</TableCell> | ||
| <TableCell>Participants</TableCell> | ||
| <TableCell>Study Design</TableCell> | ||
| <TableCell>Consent Code</TableCell> | ||
| </TableRow> | ||
| </TableHead> | ||
| <TableBody> | ||
| {response.studies.map((study, i) => ( | ||
| <TableRow key={i}> | ||
| <TableCell>{study.title}</TableCell> | ||
| <TableCell>{study.dbGapId}</TableCell> | ||
| <TableCell>{study.platforms.join(", ")}</TableCell> | ||
| <TableCell>{study.focus}</TableCell> | ||
| <TableCell>{study.dataTypes.join(", ")}</TableCell> | ||
| <TableCell> | ||
| {study.participantCount != null | ||
| ? study.participantCount.toLocaleString() | ||
| : "—"} | ||
| </TableCell> | ||
| <TableCell>{study.studyDesigns.join(", ")}</TableCell> | ||
| <TableCell>{study.consentCodes.join(", ")}</TableCell> | ||
| </TableRow> | ||
| ))} | ||
| </TableBody> | ||
| </Table> | ||
| </StudyTable> | ||
| {response.intent === "variable" ? ( | ||
| <> | ||
| <ResultCount> | ||
| Found {response.totalVariables}{" "} | ||
| {response.totalVariables === 1 ? "variable" : "variables"} in{" "} | ||
| {totalSeconds}s | ||
| </ResultCount> | ||
| {response.variables.length > 0 && ( | ||
| <StudyTable> | ||
| <Table size="small"> | ||
| <TableHead> | ||
| <TableRow> | ||
| <TableCell>Concept</TableCell> | ||
| <TableCell>Variable Name</TableCell> | ||
| <TableCell>Description</TableCell> | ||
| <TableCell>Study</TableCell> | ||
| <TableCell>dbGaP</TableCell> | ||
| </TableRow> | ||
| </TableHead> | ||
| <TableBody> | ||
| {response.variables.map((v, i) => ( | ||
| <TableRow key={i}> | ||
| <TableCell>{v.concept}</TableCell> | ||
| <TableCell>{v.variableName}</TableCell> | ||
| <TableCell>{v.description}</TableCell> | ||
| <TableCell> | ||
| <a | ||
| href={v.studyUrl} | ||
| rel="noopener noreferrer" | ||
| target="_blank" | ||
| > | ||
| {v.studyTitle || v.studyId} | ||
| </a> | ||
| </TableCell> | ||
| <TableCell> | ||
| <a | ||
| href={v.dbGapUrl} | ||
| rel="noopener noreferrer" | ||
| target="_blank" | ||
| > | ||
| {v.phvId} | ||
| </a> | ||
| </TableCell> | ||
| </TableRow> | ||
| ))} | ||
| </TableBody> | ||
| </Table> | ||
| </StudyTable> | ||
| )} | ||
| </> | ||
| ) : ( | ||
| <> | ||
| <ResultCount> | ||
| Found {response.totalStudies}{" "} | ||
| {response.totalStudies === 1 ? "study" : "studies"} in{" "} | ||
| {totalSeconds}s | ||
| </ResultCount> | ||
| {response.studies.length > 0 && ( | ||
| <StudyTable> | ||
| <Table size="small"> | ||
| <TableHead> | ||
| <TableRow> | ||
| <TableCell>Title</TableCell> | ||
| <TableCell>dbGaP Id</TableCell> | ||
| <TableCell>Platform</TableCell> | ||
| <TableCell>Focus / Disease</TableCell> | ||
| <TableCell>Data Type</TableCell> | ||
| <TableCell>Participants</TableCell> | ||
| <TableCell>Study Design</TableCell> | ||
| <TableCell>Consent Code</TableCell> | ||
| </TableRow> | ||
| </TableHead> | ||
| <TableBody> | ||
| {response.studies.map((study, i) => ( | ||
| <TableRow key={i}> | ||
| <TableCell>{study.title}</TableCell> | ||
| <TableCell>{study.dbGapId}</TableCell> | ||
| <TableCell>{study.platforms.join(", ")}</TableCell> | ||
| <TableCell>{study.focus}</TableCell> | ||
| <TableCell>{study.dataTypes.join(", ")}</TableCell> | ||
| <TableCell> | ||
| {study.participantCount != null | ||
| ? study.participantCount.toLocaleString() | ||
| : "—"} | ||
| </TableCell> | ||
| <TableCell>{study.studyDesigns.join(", ")}</TableCell> | ||
| <TableCell>{study.consentCodes.join(", ")}</TableCell> | ||
| </TableRow> | ||
| ))} | ||
| </TableBody> | ||
| </Table> | ||
| </StudyTable> | ||
| )} |
There was a problem hiding this comment.
When intent === "auto", the frontend will fall into the else branch and display study results (line 367-410). However, per the PRD, when intent is "auto", the system should ask for clarification and not display any results. The message will be shown (line 297-299), but the user will also see an empty study results table which could be confusing. Consider adding an explicit check for "auto" intent to only show the clarification message without any results table.
There was a problem hiding this comment.
Fixed — frontend now returns null for auto intent, showing only the clarification message. afda40e.
| lookup_ms=0, | ||
| pipeline_ms=elapsed_ms, | ||
| total_ms=elapsed_ms, | ||
| ), | ||
| total_studies=0, | ||
| ) |
There was a problem hiding this comment.
The variable query filters out excluded measurements (if m.facet == Facet.MEASUREMENT and not m.exclude), but it doesn't apply exclusion logic to the results. If a user searches for "variables measuring blood pressure but not systolic blood pressure", the current implementation will include all blood pressure variables without subtracting the excluded ones. Consider adding exclusion support for variable queries similar to how study queries handle it, or document that exclusions are not supported for variable intent.
There was a problem hiding this comment.
Won't fix — variable-level exclusion is explicitly out of scope per the PRD. Study-level filters are now applied (75e798c), but concept-level exclusion is a follow-up.
- Handle intent="auto" explicitly: return clarification message only, don't run a study query for ambiguous intent - Frontend: skip results table for auto intent, show only the message - Guard against empty phv_id in URL builder Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
backend/concept_search/api.py
Outdated
| ] | ||
| if concepts and isinstance(index.store, DuckDBStore): | ||
| variable_rows = index.store.query_variables(concepts) |
There was a problem hiding this comment.
Variable-intent search currently ignores all non-measurement mentions (and all exclude logic). For example, a query like "variables measuring blood pressure on AnVIL" will not filter variables to studies on AnVIL, because the variable path only extracts measurement concepts and bypasses the include/exclude constraints. Consider applying the same include/exclude constraints used for study lookup (e.g., restrict variable results to studies matching include/exclude, and then filter variables by those study_ids, or extend the SQL to join through study_facet_values).
| ] | |
| if concepts and isinstance(index.store, DuckDBStore): | |
| variable_rows = index.store.query_variables(concepts) | |
| ] | |
| # If there are include/exclude constraints, determine the studies | |
| # that satisfy them so we can restrict variable results accordingly. | |
| allowed_study_ids: set[str] | None = None | |
| if (include or exclude) and isinstance(index.store, DuckDBStore): | |
| matching_studies = index.query_studies(include, exclude or None) | |
| allowed_study_ids = { | |
| s["study_id"] | |
| for s in matching_studies | |
| if isinstance(s, dict) and "study_id" in s | |
| } or None | |
| if concepts and isinstance(index.store, DuckDBStore): | |
| variable_rows = index.store.query_variables(concepts) | |
| # Apply study-level include/exclude constraints, if any, by | |
| # filtering variables to those belonging to the allowed studies. | |
| if allowed_study_ids is not None: | |
| variable_rows = [ | |
| r | |
| for r in variable_rows | |
| if isinstance(r, dict) | |
| and r.get("study_id") in allowed_study_ids | |
| ] |
There was a problem hiding this comment.
Fixed — query_variables() now accepts optional study_ids set, and the API filters through study-level constraints first. 75e798c.
| @@ -321,6 +334,8 @@ def _load_measurement_concepts(self, llm_dir: Path) -> None: | |||
| for concept in concepts | |||
| ] | |||
| self.store.load_facet_values_batch(rows) | |||
| self.store.load_variables_batch(variable_rows) | |||
| logger.info("Loaded %d variable rows", len(variable_rows)) | |||
There was a problem hiding this comment.
_load_measurement_concepts accumulates all variable rows in a single in-memory list before bulk-inserting. With the stated scale (tens/hundreds of thousands of rows), this can significantly increase peak memory during index builds and make local/dev rebuilds fragile. Consider inserting in chunks (e.g., flush every N rows or per-file) and clearing the buffer to bound memory usage.
There was a problem hiding this comment.
Won't fix — ~450k rows at ~200 bytes each is ~90MB peak, well within normal build memory. Chunking would add complexity for negligible gain.
| " phv_id VARCHAR," | ||
| " study_id VARCHAR," | ||
| " table_name VARCHAR," | ||
| " variable_name VARCHAR" |
There was a problem hiding this comment.
The variables table is created without a PRIMARY KEY/UNIQUE constraint on phv_id (the PRD and dbGaP IDs imply this should be unique). Without a constraint, duplicate variable rows can be inserted across rebuilds or from input duplication, which can inflate result counts and slow queries. Consider making phv_id the PRIMARY KEY (or at least adding a UNIQUE constraint/index) and deciding a dedupe strategy on conflicts.
| " variable_name VARCHAR" | |
| " variable_name VARCHAR," | |
| " PRIMARY KEY (phv_id)" |
There was a problem hiding this comment.
Won't fix — phv_id is not guaranteed unique (same variable can map to multiple concepts). DuckDB's COPY CSV also doesn't enforce PK constraints efficiently. The index on concept_lower is sufficient for query performance.
| sql = ( | ||
| "SELECT v.concept, v.dataset_id, v.description, v.phv_id," | ||
| " v.study_id, v.table_name, v.variable_name," | ||
| " json_extract_string(s.raw_json, '$.title') AS study_title " | ||
| "FROM variables v " | ||
| "LEFT JOIN studies s ON v.study_id = s.db_gap_id " | ||
| f"WHERE v.concept_lower IN ({placeholders}) " # noqa: S608 | ||
| "ORDER BY v.concept, v.study_id, v.variable_name " | ||
| f"LIMIT {limit}" | ||
| ) |
There was a problem hiding this comment.
query_variables interpolates LIMIT directly into the SQL string (f"LIMIT {limit}") instead of binding it as a parameter or strictly validating/clamping it. Even if limit isn't currently user-controlled, this pattern is easy to accidentally expose later and triggers SQL-injection linters. Consider validating limit (min/max) and using a bound parameter for LIMIT if DuckDB supports it (or at least ensure limit is an int after clamping before string formatting).
There was a problem hiding this comment.
Duplicate of earlier comment — limit is always a Python int, never user input.
| fi | ||
| @lsof -ti :8000 | xargs kill 2>/dev/null || true |
There was a problem hiding this comment.
The stop target unconditionally kills any process listening on port 8000 (lsof -ti :8000 | xargs kill), which can terminate unrelated services a developer might be running. Consider limiting this to the PID in $(PID_FILE) (or only falling back to port-kill when no PID file exists), and/or printing a warning before killing by port.
| fi | |
| @lsof -ti :8000 | xargs kill 2>/dev/null || true | |
| else \ | |
| echo "Warning: No PID file found. Killing all processes listening on port 8000 (this may stop unrelated services)..."; \ | |
| lsof -ti :8000 | xargs kill 2>/dev/null || true; \ | |
| fi |
There was a problem hiding this comment.
Won't fix — the lsof fallback is intentional for stale PID files. Port 8000 is exclusively used by this server in dev.
When a variable query includes non-measurement facets (e.g., platform, data type), filter variable results to only studies matching those constraints. "blood pressure variables on AnVIL" now returns only variables from AnVIL studies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
backend/concept_search/api.py
Outdated
| elif intent == "variable": | ||
| # Collect resolved measurement concepts for variable lookup | ||
| concepts = [ | ||
| v | ||
| for m in query_model.mentions | ||
| if m.facet == Facet.MEASUREMENT and not m.exclude | ||
| for v in m.values | ||
| ] | ||
| if concepts and isinstance(index.store, DuckDBStore): | ||
| # Apply study-level constraints (platform, dataType, etc.) | ||
| non_measurement = [ | ||
| c for c in include if c[0] != Facet.MEASUREMENT | ||
| ] | ||
| study_ids: set[str] | None = None | ||
| if non_measurement or exclude: | ||
| matched = index.query_studies( | ||
| non_measurement or include, exclude or None | ||
| ) | ||
| study_ids = {s.get("dbGapId", "") for s in matched} | ||
| variable_rows = index.store.query_variables( | ||
| concepts, study_ids=study_ids | ||
| ) |
There was a problem hiding this comment.
When intent == "variable" and there are no measurement concepts in the query (e.g., only platform/dataType filters), the variable search returns zero results silently. Consider adding a message to inform the user that variable search requires at least one measurement concept to be specified.
| def _build_dbgap_variable_url(study_id: str, phv_id: str) -> str: | ||
| """Build a dbGaP variable page URL. | ||
|
|
||
| Args: | ||
| study_id: Study accession (e.g., "phs000007"). | ||
| phv_id: Variable PHV ID (e.g., "phv00481718.v2.p1"). | ||
|
|
||
| Returns: | ||
| Full URL to the variable page on dbGaP. | ||
| """ | ||
| if not phv_id: | ||
| return "" | ||
| # Extract the numeric portion from the phv ID (strip "phv" prefix and version) | ||
| phv_num = phv_id.split(".")[0].replace("phv", "") | ||
| return ( | ||
| f"https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/variable.cgi" | ||
| f"?study_id={study_id}&phv={phv_num}" | ||
| ) |
There was a problem hiding this comment.
The PHV ID parsing assumes a specific format (phv{number}.v{version}.p{participant_set}). If a PHV ID doesn't contain a period (malformed or different format), the split will still work but may produce unexpected results. According to the PRD, the study accession should include version information, but the function receives just the study_id without version. This may produce broken dbGaP links. Consider validating the expected format or handling edge cases where the format differs.
| {response.variables.map((v, i) => ( | ||
| <TableRow key={i}> |
There was a problem hiding this comment.
Using array index i as the React key for table rows is an anti-pattern. If the list order changes or items are added/removed, React may have difficulty reconciling the component tree efficiently. Consider using a unique identifier like v.phvId which should be unique per variable, or a combination of fields if phvId might not be unique.
| kill $$(cat $(PID_FILE)) 2>/dev/null && echo "Server stopped." || true; \ | ||
| rm -f $(PID_FILE); \ | ||
| fi | ||
| @lsof -ti :8000 | xargs kill 2>/dev/null || true |
There was a problem hiding this comment.
The lsof command may not be available on all systems (e.g., some minimal Linux distributions or Windows). While this is a developer-oriented Makefile, consider adding a check for the command's existence or using a more portable alternative like ss or netstat with a fallback.
| @lsof -ti :8000 | xargs kill 2>/dev/null || true | |
| @if command -v lsof >/dev/null 2>&1; then \ | |
| lsof -ti :8000 | xargs kill 2>/dev/null || true; \ | |
| fi |
| def query_variables( | ||
| self, | ||
| concepts: list[str], | ||
| limit: int = 100, | ||
| study_ids: set[str] | None = None, | ||
| ) -> list[dict]: | ||
| """Return variables matching any of the given concept names. | ||
|
|
||
| Args: | ||
| concepts: Canonical concept names to match (OR-ed). | ||
| limit: Maximum number of variable rows to return. | ||
| study_ids: If provided, restrict results to these studies. |
There was a problem hiding this comment.
The variable query has a hardcoded LIMIT of 100 results. For common concepts like "Blood Pressure" or "Body Mass Index", there could be thousands of matching variables across hundreds of studies. The PR description mentions ~450k total variable rows. Without pagination support or a way for users to request more results, many relevant variables may be hidden. Consider implementing pagination or at minimum documenting this limitation in the API response or user-facing message.
| # PRD: Consent Code Compatibility Model | ||
|
|
||
| **Status:** DRAFT — Feb 2026 | ||
| **Related:** PRD-concept-search.md, PRD-search-api.md | ||
|
|
||
| --- | ||
|
|
||
| ## Problem | ||
|
|
||
| The search API treats consent codes as opaque strings with exact-match semantics. This fails for the three most common researcher consent queries: |
There was a problem hiding this comment.
The PRD-consent-code-compatibility.md document describes a consent code compatibility feature that is not mentioned in the PR description or issue #194. This PRD appears to be for a separate feature (consent code hierarchy and eligibility matching) and should either be removed from this PR or explicitly mentioned in the PR description if it's intentionally being added as documentation for future work.
| for var in table.get("variables", []): | ||
| concept = var.get("concept") | ||
| if concept: | ||
| concept_studies[concept].add(study_id) | ||
| study_concepts[study_id].add(concept) | ||
| variable_rows.append(( | ||
| concept, | ||
| concept.lower(), | ||
| dataset_id, | ||
| var.get("description", ""), | ||
| var.get("id", ""), | ||
| study_id, | ||
| table_name, | ||
| var.get("name", ""), | ||
| )) |
There was a problem hiding this comment.
Variables without a concept (concept is None or empty) are completely skipped during indexing. This means variables that may be important but weren't assigned a concept during LLM classification will be invisible to variable search. Consider logging how many variables are being skipped, or storing them with a sentinel concept value like "Uncategorized" so they can still be discovered.
| </TableBody> | ||
| </Table> | ||
| </StudyTable> | ||
| {response.intent === "auto" ? null : response.intent === "variable" ? ( |
There was a problem hiding this comment.
When intent === "auto", nothing is rendered after the clarification message (no result count or table). While this may be intentional to encourage the user to refine their query, it might be helpful to display extracted mentions or provide more guidance. Consider showing what was understood from the query to help the user refine it.
| {response.intent === "auto" ? null : response.intent === "variable" ? ( | |
| {response.intent === "auto" ? ( | |
| <SectionRow> | |
| <SectionLabel>Next steps:</SectionLabel> | |
| {mentions.length > 0 || Object.keys(mappings).length > 0 | |
| ? "Review the extracted mentions above and refine your question to run a search." | |
| : "Refine your question with more detail so we can run a search."} | |
| </SectionRow> | |
| ) : response.intent === "variable" ? ( |
| self._conn.execute( | ||
| "CREATE TABLE variables (" | ||
| " concept VARCHAR," | ||
| " concept_lower VARCHAR," | ||
| " dataset_id VARCHAR," | ||
| " description VARCHAR," | ||
| " phv_id VARCHAR," | ||
| " study_id VARCHAR," | ||
| " table_name VARCHAR," | ||
| " variable_name VARCHAR" | ||
| ")" | ||
| ) |
There was a problem hiding this comment.
The PRD specifies phv_id as the PRIMARY KEY for the variables table, but the implementation does not define a primary key or any uniqueness constraint. In the real llm-concepts data, the same PHV ID may appear multiple times if a variable appears in multiple tables/datasets within the same study. Consider either adding the PK constraint if phv_id is truly unique, or updating the PRD to remove the "PK" designation and clarify that duplicate phv_ids are expected.
| def query_variables( | ||
| self, | ||
| concepts: list[str], | ||
| limit: int = 100, | ||
| study_ids: set[str] | None = None, | ||
| ) -> list[dict]: | ||
| """Return variables matching any of the given concept names. | ||
|
|
||
| Args: | ||
| concepts: Canonical concept names to match (OR-ed). | ||
| limit: Maximum number of variable rows to return. | ||
| study_ids: If provided, restrict results to these studies. | ||
|
|
||
| Returns: | ||
| Variable dicts with study title joined from the studies table. | ||
| """ | ||
| if not concepts: | ||
| return [] | ||
| concept_ph = ", ".join("?" for _ in concepts) | ||
| params: list[str] = [c.lower() for c in concepts] | ||
| where = f"v.concept_lower IN ({concept_ph})" | ||
| if study_ids is not None: | ||
| study_ph = ", ".join("?" for _ in study_ids) | ||
| where += f" AND v.study_id IN ({study_ph})" | ||
| params.extend(study_ids) | ||
| sql = ( | ||
| "SELECT v.concept, v.dataset_id, v.description, v.phv_id," | ||
| " v.study_id, v.table_name, v.variable_name," | ||
| " json_extract_string(s.raw_json, '$.title') AS study_title " | ||
| "FROM variables v " | ||
| "LEFT JOIN studies s ON v.study_id = s.db_gap_id " | ||
| f"WHERE {where} " # noqa: S608 | ||
| "ORDER BY v.concept, v.study_id, v.variable_name " | ||
| f"LIMIT {limit}" | ||
| ) | ||
| rows = self._conn.execute(sql, params).fetchall() | ||
| cols = [ | ||
| "concept", "datasetId", "description", "phvId", | ||
| "studyId", "tableName", "variableName", "studyTitle", | ||
| ] | ||
| return [dict(zip(cols, row)) for row in rows] |
There was a problem hiding this comment.
The new query_variables method in DuckDBStore lacks unit tests. The existing test suite comprehensively tests query_studies with various scenarios (AND, OR, exclude, case sensitivity, etc.), but there are no corresponding tests for variable queries. This new functionality should have similar test coverage to ensure it correctly handles concept matching, study filtering, and edge cases.
When intent is "variable" but no measurement is mentioned (e.g., "what variables are measured in AnVIL WGS studies"), return all variables from the matching studies instead of returning empty results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
variablestable stores ~450k variable rows from llm-concepts, queried by resolved measurement conceptsdocs/PRD-variable-search.mdCloses #194
Test plan
make evals(30/30 passing)make db-reloadto rebuild index with variables table🤖 Generated with Claude Code