Skip to content

feat: Add NCBI Datasets API Integration (53 Tools)#40

Open
benjibromberg wants to merge 1 commit intomims-harvard:dev_mar6from
benjibromberg:main
Open

feat: Add NCBI Datasets API Integration (53 Tools)#40
benjibromberg wants to merge 1 commit intomims-harvard:dev_mar6from
benjibromberg:main

Conversation

@benjibromberg
Copy link
Contributor

@benjibromberg benjibromberg commented Nov 12, 2025

Summary

This PR adds comprehensive integration with the NCBI Datasets API v2, providing
53 tools for accessing gene data, genome assemblies, taxonomy information,
virus genomes, organelle data, and biosample records. The integration uses an
OpenAPI-driven approach where the OpenAPI specification serves as the single
source of truth for all parameters, endpoints, and validation.

Features

  • 53 Tool Classes: Comprehensive coverage of NCBI Datasets API endpoints

    • 18 Gene tools (by ID, symbol, accession, taxon, locus tag)
    • 15 Genome tools (assembly reports, annotations, sequences)
    • 8 Taxonomy tools (metadata, lineage, related IDs)
    • 6 Virus tools (genome summaries, annotations, dataset reports)
    • 2 Organelle tools
    • 1 Biosample tool
    • 3 Download summary tools (preview before download)
    • 1 Utility tool (version information)
  • 100% OpenAPI Parameter Coverage: All parameters from the OpenAPI
    specification are implemented in each tool

  • Automated Generation System: Configuration files and test definitions
    are auto-generated from the OpenAPI specification, ensuring easy updates
    when NCBI releases new API versions

  • Clean Test Suite: All 53 tools pass with 0 failures and 0 xfail

    • All test data dynamically generated from OpenAPI specification
    • No hardcoded test values
    • Rate-limited at 0.25s between tests to respect NCBI API limits
    • Note: This PR extends test suite runtime (~4 minutes
      for NCBI tests) due to rate limiting to respect NCBI API limits
  • 3 Tools Removed (upstream NCBI server errors): The following tools were
    removed because their NCBI endpoints return 500/504 errors. They are
    documented in KNOWN_TEST_FAILURES.md for easy re-addition when NCBI
    fixes them:

    • NCBIDatasetsVirusTaxonSars2ProteinTool
    • NCBIDatasetsVirusTaxonSars2ProteinTableTool
    • NCBIDatasetsVirusTaxonGenomeTableTool
  • Complete Documentation:

    • User guide: docs/tools/ncbi_datasets_tools.rst (774 lines)
    • Maintenance guide: src/tooluniverse/data/specs/ncbi/README.md
    • 13 working examples in examples/ncbi_datasets_tool_example.py

Technical Implementation

OpenAPI-Driven Architecture

The integration follows a specification-driven approach:

  1. OpenAPI Specification: src/tooluniverse/data/specs/ncbi/openapi3.docs.yaml

    • Official NCBI Datasets API v2 specification
    • Single source of truth for all endpoints and parameters
    • Bundled locally (never fetched at runtime)
  2. Auto-Generation Scripts:

    • scripts/discover_and_generate.py: Discovers endpoints and generates
      tool classes
    • scripts/update_ncbi_json_from_openapi.py: Updates JSON configurations
      from spec
    • scripts/sync_openapi_spec.py: Downloads latest spec from NCBI
      (--check mode for CI to detect when upstream spec changes)
  3. Tool Classes: All 53 tools in src/tooluniverse/ncbi_datasets_tool.py

    • Registered via @register_tool decorators (auto-discovered by dynamic registry)
    • Support flexible parameters (single value or array)
    • Include comprehensive error handling (empty response handling,
      required parameter validation)
    • Support API key authentication for enhanced rate limits
  4. Function Wrappers: 53 wrapper functions in src/tooluniverse/tools/

    • Auto-generated by build_tools.py from JSON config
    • Minimal docstrings linking to official NCBI documentation
    • Full type hints and validation

Test Results

All 53 tools pass
- 0 failures
- 0 xfail

Test Fixes Applied:

  • Empty response handling for all tools (200 OK with no body)
  • Required parameter validation for all path-parameter tools
  • Coverage test prefix mapping (filter_ to filter.)
  • Validator edge case for 0-param endpoints

Upstream Compatibility

  • Rebased onto latest upstream/main
  • Uses upstream's @register_tool / auto_discover_tools dynamic registry
  • No conflicts with existing tools

Files Changed

Core Implementation

  • src/tooluniverse/ncbi_datasets_tool.py: 53 tool classes
  • src/tooluniverse/data/ncbi_datasets_tools.json: Tool configurations
  • src/tooluniverse/tools/ncbi_datasets_*.py: 53 wrapper functions
  • src/tooluniverse/default_config.py: Added ncbi_datasets config entry
  • src/tooluniverse/scripts/openapi_validator.py: OpenAPI spec parser and validator

Specifications and Maintenance

  • src/tooluniverse/data/specs/ncbi/: Complete directory
    • openapi3.docs.yaml: Official OpenAPI specification
    • README.md: Maintenance guide for contributors
    • KNOWN_TEST_FAILURES.md: Documentation of removed tools and upstream issues
    • maintain_ncbi_tools.py: Master maintenance orchestrator
    • scripts/discover_and_generate.py: Auto-generation script
    • scripts/update_ncbi_json_from_openapi.py: JSON config updater
    • scripts/sync_openapi_spec.py: Spec sync tool with CI check mode

Tests

  • tests/tools/test_ncbi_datasets_tool.py: Comprehensive test suite
    • All tests passing (0 failures, 0 xfail)
    • All test data from OpenAPI specification
    • Rate limiting to respect NCBI API limits

Documentation

  • docs/tools/ncbi_datasets_tools.rst: Complete user documentation (774 lines)
  • examples/ncbi_datasets_tool_example.py: 13 working examples

API Key Support

Tools support optional API key authentication via NCBI_API_KEY environment
variable for enhanced rate limits (10 rps vs 5 rps default). See
docs/tools/ncbi_datasets_tools.rst for setup instructions.

Usage Example

from tooluniverse import ToolUniverse

tu = ToolUniverse()
tu.load_tools()

# Get gene metadata by ID
result = tu.run({
    "name": "ncbi_datasets_gene_by_id",
    "arguments": {"gene_ids": 59067}
})

# Get taxonomy metadata
result = tu.run({
    "name": "ncbi_datasets_taxonomy_metadata",
    "arguments": {"taxons": "9606"}  # Human
})

Maintenance

Future updates to the NCBI Datasets API can be easily integrated:

  1. python src/tooluniverse/data/specs/ncbi/scripts/sync_openapi_spec.py (fetch latest spec)
  2. python src/tooluniverse/data/specs/ncbi/maintain_ncbi_tools.py (regenerate and validate)
  3. Review changes with git diff, run tests, then commit

The --check flag on sync_openapi_spec.py can be used in CI to alert
when the upstream spec has changed without failing the build.

See src/tooluniverse/data/specs/ncbi/README.md for detailed maintenance
instructions.

Checklist

  • All 53 tools implemented and tested
  • 100% OpenAPI parameter coverage
  • Clean test suite (0 failures, 0 xfail)
  • User documentation complete
  • Examples provided and tested
  • Maintenance guide included
  • Upstream compatibility verified (rebased onto latest main)
  • Code passes ruff linting
  • Removed tools with upstream API failures (documented for re-addition)

@benjibromberg
Copy link
Contributor Author

Tried to do my best here, but let me know if I missed anything that I can fix!

@gasvn
Copy link
Member

gasvn commented Nov 18, 2025

Looks good to me, thank you! I will test these tools on my side and merge them ASAP!

@gasvn
Copy link
Member

gasvn commented Dec 8, 2025

Hi @benjibromberg, sorry for the delay in merging. I am wondering can you please remove the tools that fail in test for now. We can have another pull request to figure out why some tools are not working and we can figure out ways to fix them. Thank you!

u9401066 added a commit to u9401066/ToolUniverse that referenced this pull request Jan 22, 2026
Integrates pubmed-search-mcp v0.1.29 as local tools, following PR mims-harvard#40 pattern.

Streamlined from 35+ to 25 core tools by removing duplicate functionality:
- unified_search replaces multiple source-specific search tools
- Removed redundant merge/expand tools (auto-executed internally)

Categories (25 tools total):
- Core Search (1): unified_search - main entry with auto multi-source
- Query Intelligence (2): parse_pico, generate_queries
- Article Exploration (5): fetch, related, citing, references, metrics
- Fulltext (2): get_fulltext, get_text_mined_terms
- NCBI Extended (7): gene/compound/clinvar search and details
- Citation Network (2): build_tree, suggest_tree
- Export (1): export_citations
- Vision Search (2): analyze_figure, reverse_image_search
- OpenURL/Institutional (3): configure, get_link, list_presets

Files added:
- src/tooluniverse/pubmed_search_tool.py (25 tool classes)
- src/tooluniverse/data/pubmed_search_tools.json (tool configs)
- src/tooluniverse/tools/pubmed_*.py (25 SDK files)
- examples/pubmed_search_example.py
- tests/unit/test_pubmed_search_tool.py

Install: pip install tooluniverse[pubmed]
@benjibromberg
Copy link
Contributor Author

Hi @gasvn, thanks for the review. I've addressed the test failures in this updated push.

Before: 39 test failures
After: 723 passed, 0 failures (11 xfail for upstream NCBI API issues)

The 39 failures were actual bugs in our code, now fixed:

  1. Empty response handling: All tools now gracefully handle 200 OK with empty body. This is defensive coding, not broken endpoints; they all return real data for valid inputs.
  2. Required parameter validation: All path-parameter tools now raise clear ValueError messages instead of crashing on None.
  3. Test infrastructure fixes: Coverage test prefix mapping and a validator edge case.

The remaining 11 xfail tests are upstream NCBI server issues, not our code:

  • 9 SARS2 protein filter tests: NCBI returns 500/504 when using most optional filter params on /virus/taxon/sars2/protein/{proteins}. Basic queries work fine; it's the filter combinations that trigger server errors.
  • 2 table_fields tests: The table_fields query param is rejected by virus table endpoints despite being in the OpenAPI spec.

These are marked xfail(strict=False) so they'll auto-detect when NCBI fixes them.

A few options for how to handle these:

  1. Keep all 56 tools as-is. The SARS2 protein tools work for core lookups; only certain filter combos hit NCBI bugs. The xfail markers keep CI green.
  2. Remove the 2 SARS2 protein tools (NCBIDatasetsVirusTaxonSars2ProteinTool + its table variant). These account for all 11 xfails. The other 54 tools pass 100% with zero xfails.
  3. Remove all tools with any xfail. Drops to 54 tools, completely clean test suite.

Happy to go whichever direction you prefer. The SARS2 protein tools would be easy to re-add later if NCBI fixes their endpoint.

Also in this update:

  • Rebased onto latest upstream/main
  • Added sync_openapi_spec.py with --check mode for CI spec-freshness alerts
  • Documented known upstream failures in KNOWN_TEST_FAILURES.md

@gasvn
Copy link
Member

gasvn commented Mar 6, 2026

Thank you @benjibromberg for the updates! I would suggest we work on 3 to remove any tools that fails because agents would expect all tools working well. And we can extend tools once the server issues have been fixed. Can you please change the pull request to dev_mar6 branch, so I can do some testing and revision before merging to main branch? Thank you!

@benjibromberg benjibromberg changed the base branch from main to dev_mar6 March 6, 2026 21:43
Spec-driven integration with NCBI Datasets API v2 providing 53 tools
for gene, genome, taxonomy, and virus data retrieval.

Key components:
- 53 tool classes in ncbi_datasets_tool.py (auto-generated from OpenAPI spec)
- JSON configs in ncbi_datasets_tools.json
- OpenAPI-driven maintenance scripts (update, generate, validate, sync)
- Parametrized test suite (all tests pass, 0 xfail)
- Known test failures documented in KNOWN_TEST_FAILURES.md

3 tools removed due to upstream NCBI API server errors (500/504):
- NCBIDatasetsVirusTaxonSars2ProteinTool
- NCBIDatasetsVirusTaxonSars2ProteinTableTool
- NCBIDatasetsVirusTaxonGenomeTableTool

Test fixes in this commit:
- Empty response handling for all tools (200 OK with no body)
- Required parameter validation for all path-parameter tools
- Coverage test prefix mapping (filter_ -> filter.)
- Validator edge case for 0-param endpoints
@benjibromberg
Copy link
Contributor Author

@gasvn sounds good! I've made both changes:

  1. Removed the 3 tools with upstream failures, bringing us to 53 tools with a completely clean test suite (0 xfail, 0 failures). The removed tools are documented in KNOWN_TEST_FAILURES.md for easy re-addition once NCBI fixes their endpoints:

    • NCBIDatasetsVirusTaxonSars2ProteinTool
    • NCBIDatasetsVirusTaxonSars2ProteinTableTool
    • NCBIDatasetsVirusTaxonGenomeTableTool
  2. Retargeted the PR to dev_mar6 for your testing and revision.

Pushing the updated commit now.

@benjibromberg benjibromberg changed the title feat: Add NCBI Datasets API Integration (56 Tools) feat: Add NCBI Datasets API Integration (53 Tools) Mar 6, 2026
@benjibromberg
Copy link
Contributor Author

Also noting that the ruff lint failures from the earlier CI run are fixed in this push:

  • Fixed unused variable in update_ncbi_json_from_openapi.py
  • Removed duplicate "ncbi_datasets" key in default_config.py
  • Regenerated wrapper functions via build_tools.py to fix parameter ordering (required params before optional)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants