Skip to content

Conversation

@lillythomas
Copy link
Collaborator

Summary

Add a MCP-compatible tool for semantic search over STAC collections using LanceDB and Ollama embeddings, enabling RAG-powered dataset discovery for the EIE agent.

What it does

  • Performs vector similarity search over STAC collection metadata using natural language queries (e.g., "NO2 air quality", "sea surface temperature")
  • Checks spatial and temporal overlap between user's requested extent and each collection's coverage
  • Returns matched collection IDs with spatial_overlap and temporal_overlap flags to help filter relevant datasets

How it does

  1. Added CollectionsRagTool class inheriting from BaseTool[CollectionsRagInputSchema, CollectionsRagOutputSchema]
  2. Implemented async _arun() method
    • Generates query embeddings via Ollama API (nomic-embed-text model)
    • Searches LanceDB vector index using cosine similarity
    • Computes spatial overlap between user bbox and collection extents
    • Computes temporal overlap between user datetime range and collection intervals
  3. Added configuration via CollectionsRagToolConfig
    • db_path: Path to LanceDB database (from COLLECTIONS_RAG_DB_PATH env var)
    • ollama_url: Ollama API URL (from OLLAMA_URL env var)
    • embedding_model: Model name (default: nomic-embed-text)
    • timeout: HTTP timeout (default: 60s)
  4. Added helper functions for bbox and temporal interval overlap checking

Files changed

  • akd_ext/tools/collections_rag.py — New tool implementation
  • akd_ext/tools/init.py — Added exports
  • pyproject.toml — Added lancedb dependency

Testing

import asyncio
from akd_ext.tools import CollectionsRagTool
from akd_ext.tools.collections_rag import CollectionsRagInputSchema

async def test():
    tool = CollectionsRagTool()
    result = await tool.arun(CollectionsRagInputSchema(
        query="NO2 air quality",
        bbox=[-124.41, 32.53, -114.13, 42.01],
        datetime="2021-10-01/2021-12-31"
    ))
    print("Collections:", result.collections)
    for m in result.matches:
        print(f"  - {m.id}: spatial={m.spatial_overlap}, temporal={m.temporal_overlap}")

asyncio.run(test())

Output:

Collections: ['omi-no2-2d', 'no2-monthly-diff', 'no2-monthly']
  - omi-no2-2d: spatial=True, temporal=False
  - no2-monthly-diff: spatial=True, temporal=True
  - no2-monthly: spatial=True, temporal=True

Environment variables required:

  • COLLECTIONS_RAG_DB_PATH — Path to LanceDB with collection embeddings
  • OLLAMA_URL — Ollama embeddings API URL (default: http://localhost:11434)

Note: The LanceDB index must be pre-built with collection embeddings before using this tool.

@lillythomas
Copy link
Collaborator Author

Thanks @NISH1001 for your review. I've addressed your current comments. Please let me know if you have further feedback and suggested changes.

@NISH1001
Copy link
Collaborator

@lillythomas thanks for the changes. I resolved the comments. Looks good. I will do another pass at it in detail if I missed anything. And then let you know. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants