feat: Add NCBI Datasets API Integration (53 Tools)#40
feat: Add NCBI Datasets API Integration (53 Tools)#40benjibromberg wants to merge 1 commit intomims-harvard:dev_mar6from
Conversation
|
Tried to do my best here, but let me know if I missed anything that I can fix! |
|
Looks good to me, thank you! I will test these tools on my side and merge them ASAP! |
|
Hi @benjibromberg, sorry for the delay in merging. I am wondering can you please remove the tools that fail in test for now. We can have another pull request to figure out why some tools are not working and we can figure out ways to fix them. Thank you! |
Integrates pubmed-search-mcp v0.1.29 as local tools, following PR mims-harvard#40 pattern. Streamlined from 35+ to 25 core tools by removing duplicate functionality: - unified_search replaces multiple source-specific search tools - Removed redundant merge/expand tools (auto-executed internally) Categories (25 tools total): - Core Search (1): unified_search - main entry with auto multi-source - Query Intelligence (2): parse_pico, generate_queries - Article Exploration (5): fetch, related, citing, references, metrics - Fulltext (2): get_fulltext, get_text_mined_terms - NCBI Extended (7): gene/compound/clinvar search and details - Citation Network (2): build_tree, suggest_tree - Export (1): export_citations - Vision Search (2): analyze_figure, reverse_image_search - OpenURL/Institutional (3): configure, get_link, list_presets Files added: - src/tooluniverse/pubmed_search_tool.py (25 tool classes) - src/tooluniverse/data/pubmed_search_tools.json (tool configs) - src/tooluniverse/tools/pubmed_*.py (25 SDK files) - examples/pubmed_search_example.py - tests/unit/test_pubmed_search_tool.py Install: pip install tooluniverse[pubmed]
|
Hi @gasvn, thanks for the review. I've addressed the test failures in this updated push. Before: 39 test failures The 39 failures were actual bugs in our code, now fixed:
The remaining 11 xfail tests are upstream NCBI server issues, not our code:
These are marked A few options for how to handle these:
Happy to go whichever direction you prefer. The SARS2 protein tools would be easy to re-add later if NCBI fixes their endpoint. Also in this update:
|
|
Thank you @benjibromberg for the updates! I would suggest we work on 3 to remove any tools that fails because agents would expect all tools working well. And we can extend tools once the server issues have been fixed. Can you please change the pull request to dev_mar6 branch, so I can do some testing and revision before merging to main branch? Thank you! |
Spec-driven integration with NCBI Datasets API v2 providing 53 tools for gene, genome, taxonomy, and virus data retrieval. Key components: - 53 tool classes in ncbi_datasets_tool.py (auto-generated from OpenAPI spec) - JSON configs in ncbi_datasets_tools.json - OpenAPI-driven maintenance scripts (update, generate, validate, sync) - Parametrized test suite (all tests pass, 0 xfail) - Known test failures documented in KNOWN_TEST_FAILURES.md 3 tools removed due to upstream NCBI API server errors (500/504): - NCBIDatasetsVirusTaxonSars2ProteinTool - NCBIDatasetsVirusTaxonSars2ProteinTableTool - NCBIDatasetsVirusTaxonGenomeTableTool Test fixes in this commit: - Empty response handling for all tools (200 OK with no body) - Required parameter validation for all path-parameter tools - Coverage test prefix mapping (filter_ -> filter.) - Validator edge case for 0-param endpoints
|
@gasvn sounds good! I've made both changes:
Pushing the updated commit now. |
|
Also noting that the ruff lint failures from the earlier CI run are fixed in this push:
|
Summary
This PR adds comprehensive integration with the NCBI Datasets API v2, providing
53 tools for accessing gene data, genome assemblies, taxonomy information,
virus genomes, organelle data, and biosample records. The integration uses an
OpenAPI-driven approach where the OpenAPI specification serves as the single
source of truth for all parameters, endpoints, and validation.
Features
53 Tool Classes: Comprehensive coverage of NCBI Datasets API endpoints
100% OpenAPI Parameter Coverage: All parameters from the OpenAPI
specification are implemented in each tool
Automated Generation System: Configuration files and test definitions
are auto-generated from the OpenAPI specification, ensuring easy updates
when NCBI releases new API versions
Clean Test Suite: All 53 tools pass with 0 failures and 0 xfail
for NCBI tests) due to rate limiting to respect NCBI API limits
3 Tools Removed (upstream NCBI server errors): The following tools were
removed because their NCBI endpoints return 500/504 errors. They are
documented in
KNOWN_TEST_FAILURES.mdfor easy re-addition when NCBIfixes them:
NCBIDatasetsVirusTaxonSars2ProteinToolNCBIDatasetsVirusTaxonSars2ProteinTableToolNCBIDatasetsVirusTaxonGenomeTableToolComplete Documentation:
docs/tools/ncbi_datasets_tools.rst(774 lines)src/tooluniverse/data/specs/ncbi/README.mdexamples/ncbi_datasets_tool_example.pyTechnical Implementation
OpenAPI-Driven Architecture
The integration follows a specification-driven approach:
OpenAPI Specification:
src/tooluniverse/data/specs/ncbi/openapi3.docs.yamlAuto-Generation Scripts:
scripts/discover_and_generate.py: Discovers endpoints and generatestool classes
scripts/update_ncbi_json_from_openapi.py: Updates JSON configurationsfrom spec
scripts/sync_openapi_spec.py: Downloads latest spec from NCBI(
--checkmode for CI to detect when upstream spec changes)Tool Classes: All 53 tools in
src/tooluniverse/ncbi_datasets_tool.py@register_tooldecorators (auto-discovered by dynamic registry)required parameter validation)
Function Wrappers: 53 wrapper functions in
src/tooluniverse/tools/build_tools.pyfrom JSON configTest Results
Test Fixes Applied:
Upstream Compatibility
upstream/main@register_tool/auto_discover_toolsdynamic registryFiles Changed
Core Implementation
src/tooluniverse/ncbi_datasets_tool.py: 53 tool classessrc/tooluniverse/data/ncbi_datasets_tools.json: Tool configurationssrc/tooluniverse/tools/ncbi_datasets_*.py: 53 wrapper functionssrc/tooluniverse/default_config.py: Added ncbi_datasets config entrysrc/tooluniverse/scripts/openapi_validator.py: OpenAPI spec parser and validatorSpecifications and Maintenance
src/tooluniverse/data/specs/ncbi/: Complete directoryopenapi3.docs.yaml: Official OpenAPI specificationREADME.md: Maintenance guide for contributorsKNOWN_TEST_FAILURES.md: Documentation of removed tools and upstream issuesmaintain_ncbi_tools.py: Master maintenance orchestratorscripts/discover_and_generate.py: Auto-generation scriptscripts/update_ncbi_json_from_openapi.py: JSON config updaterscripts/sync_openapi_spec.py: Spec sync tool with CI check modeTests
tests/tools/test_ncbi_datasets_tool.py: Comprehensive test suiteDocumentation
docs/tools/ncbi_datasets_tools.rst: Complete user documentation (774 lines)examples/ncbi_datasets_tool_example.py: 13 working examplesAPI Key Support
Tools support optional API key authentication via
NCBI_API_KEYenvironmentvariable for enhanced rate limits (10 rps vs 5 rps default). See
docs/tools/ncbi_datasets_tools.rstfor setup instructions.Usage Example
Maintenance
Future updates to the NCBI Datasets API can be easily integrated:
python src/tooluniverse/data/specs/ncbi/scripts/sync_openapi_spec.py(fetch latest spec)python src/tooluniverse/data/specs/ncbi/maintain_ncbi_tools.py(regenerate and validate)git diff, run tests, then commitThe
--checkflag onsync_openapi_spec.pycan be used in CI to alertwhen the upstream spec has changed without failing the build.
See
src/tooluniverse/data/specs/ncbi/README.mdfor detailed maintenanceinstructions.
Checklist