Skip to content

Commit afe4e07

Browse files
Thuraabtechh3xxitSalman MohammedCopilotcubic-dev-ai[bot]
authored
Implement Embedding Search Plugin (#60)
* Added embedding search feature for utcp 1.0 * Update plugins/tool_search/embedding/pyproject.toml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update plugins/tool_search/embedding/README.md Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> * To be resolve * folder structure to be resolved * Correct folder placement done. * Description for values accepted by model_name * Resolved cubic suggestions * Update plugins/tool_search/in_mem_embeddings/tests/test_in_mem_embeddings_search.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * No change in core for implementing a plugin --------- Co-authored-by: Razvan Radulescu <43811028+h3xxit@users.noreply.github.com> Co-authored-by: Salman Mohammed <thuraabtec@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
1 parent 76b07db commit afe4e07

File tree

9 files changed

+988
-0
lines changed

9 files changed

+988
-0
lines changed
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# UTCP Embedding Search Plugin
2+
3+
This plugin registers the embedding-based semantic search strategy with UTCP 1.0 via entry points.
4+
5+
## Installation
6+
7+
```bash
8+
pip install utcp-embedding-search
9+
```
10+
11+
Optionally, for high-quality embeddings:
12+
13+
```bash
14+
pip install "utcp-in-mem-embeddings[embedding]"
15+
```
16+
17+
## How it works
18+
19+
When installed, this package exposes an entry point under `utcp.plugins` so the UTCP core can auto-discover and register the `in_mem_embeddings` strategy.
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# UTCP In-Memory Embeddings Search Plugin
2+
3+
This plugin registers the in-memory embedding-based semantic search strategy with UTCP 1.0 via entry points.
4+
5+
## Installation
6+
7+
```bash
8+
pip install utcp-in-mem-embeddings
9+
```
10+
11+
Optionally, for high-quality embeddings:
12+
13+
```bash
14+
pip install "utcp-in-mem-embeddings[embedding]"
15+
```
16+
17+
Or install the required dependencies directly:
18+
19+
```bash
20+
pip install "sentence-transformers>=2.2.0" "torch>=1.9.0"
21+
```
22+
23+
## Why are sentence-transformers and torch needed?
24+
25+
While the plugin works without these packages (using a simple character frequency-based fallback), installing them provides significant benefits:
26+
27+
- **Enhanced Semantic Understanding**: The `sentence-transformers` package provides pre-trained models that convert text into high-quality vector embeddings, capturing the semantic meaning of text rather than just keywords.
28+
29+
- **Better Search Results**: With these packages installed, the search can understand conceptual similarity between queries and tools, even when they don't share exact keywords.
30+
31+
- **Performance**: The default model (all-MiniLM-L6-v2) offers a good balance between quality and performance for semantic search applications.
32+
33+
- **Fallback Mechanism**: Without these packages, the plugin automatically falls back to a simpler text similarity method, which works but with reduced accuracy.
34+
35+
## How it works
36+
37+
When installed, this package exposes an entry point under `utcp.plugins` so the UTCP core can auto-discover and register the `in_mem_embeddings` strategy.
38+
39+
The embeddings are cached in memory for improved performance during repeated searches.
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
[build-system]
2+
requires = ["setuptools>=61.0"]
3+
build-backend = "setuptools.build_meta"
4+
5+
[project]
6+
name = "utcp-in-mem-embeddings"
7+
version = "1.0.0"
8+
authors = [
9+
{ name = "UTCP Contributors" },
10+
]
11+
description = "UTCP plugin providing in-memory embedding-based semantic tool search."
12+
readme = "README.md"
13+
requires-python = ">=3.10"
14+
dependencies = [
15+
"utcp>=1.0",
16+
]
17+
classifiers = [
18+
"Development Status :: 4 - Beta",
19+
"Intended Audience :: Developers",
20+
"Programming Language :: Python :: 3",
21+
"Operating System :: OS Independent",
22+
]
23+
license = "MPL-2.0"
24+
25+
[project.optional-dependencies]
26+
embedding = [
27+
"sentence-transformers>=2.2.0",
28+
"torch>=1.9.0",
29+
]
30+
31+
32+
[project.urls]
33+
Homepage = "https://utcp.io"
34+
Source = "https://github.com/universal-tool-calling-protocol/python-utcp"
35+
Issues = "https://github.com/universal-tool-calling-protocol/python-utcp/issues"
36+
37+
[project.entry-points."utcp.plugins"]
38+
in_mem_embeddings = "utcp_in_mem_embeddings:register"
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
from utcp.plugins.discovery import register_tool_search_strategy
2+
from utcp_in_mem_embeddings.in_mem_embeddings_search import InMemEmbeddingsSearchStrategyConfigSerializer
3+
4+
5+
def register():
6+
"""Entry point function to register the in-memory embeddings search strategy."""
7+
register_tool_search_strategy("in_mem_embeddings", InMemEmbeddingsSearchStrategyConfigSerializer())
Lines changed: 241 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,241 @@
1+
"""In-memory embedding-based semantic search strategy for UTCP tools.
2+
3+
This module provides a semantic search implementation that uses sentence embeddings
4+
to find tools based on meaning similarity rather than just keyword matching.
5+
Embeddings are cached in memory for improved performance.
6+
"""
7+
8+
import asyncio
9+
import logging
10+
from typing import List, Tuple, Optional, Literal, Dict, Any
11+
from concurrent.futures import ThreadPoolExecutor
12+
import numpy as np
13+
from pydantic import BaseModel, Field, PrivateAttr
14+
15+
from utcp.interfaces.tool_search_strategy import ToolSearchStrategy
16+
from utcp.data.tool import Tool
17+
from utcp.interfaces.concurrent_tool_repository import ConcurrentToolRepository
18+
from utcp.interfaces.serializer import Serializer
19+
20+
logger = logging.getLogger(__name__)
21+
22+
class InMemEmbeddingsSearchStrategy(ToolSearchStrategy):
23+
"""In-memory semantic search strategy using sentence embeddings.
24+
25+
This strategy converts tool descriptions and search queries into numerical
26+
embeddings and finds the most semantically similar tools using cosine similarity.
27+
Embeddings are cached in memory for improved performance during repeated searches.
28+
"""
29+
30+
tool_search_strategy_type: Literal["in_mem_embeddings"] = "in_mem_embeddings"
31+
32+
# Configuration parameters
33+
model_name: str = Field(
34+
default="all-MiniLM-L6-v2",
35+
description="Sentence transformer model name to use for embeddings. "
36+
"Accepts any model from Hugging Face sentence-transformers library. "
37+
"Popular options: 'all-MiniLM-L6-v2' (fast, good quality), "
38+
"'all-mpnet-base-v2' (slower, higher quality), "
39+
"'paraphrase-MiniLM-L6-v2' (paraphrase detection). "
40+
"See https://huggingface.co/sentence-transformers for full list."
41+
)
42+
similarity_threshold: float = Field(default=0.3, description="Minimum similarity score to consider a match")
43+
max_workers: int = Field(default=4, description="Maximum number of worker threads for embedding generation")
44+
cache_embeddings: bool = Field(default=True, description="Whether to cache tool embeddings for performance")
45+
46+
# Private attributes
47+
_embedding_model: Optional[Any] = PrivateAttr(default=None)
48+
_tool_embeddings_cache: Dict[str, np.ndarray] = PrivateAttr(default_factory=dict)
49+
_executor: Optional[ThreadPoolExecutor] = PrivateAttr(default=None)
50+
_model_loaded: bool = PrivateAttr(default=False)
51+
52+
def __init__(self, **data):
53+
super().__init__(**data)
54+
self._executor = ThreadPoolExecutor(max_workers=self.max_workers)
55+
56+
async def _ensure_model_loaded(self):
57+
"""Ensure the embedding model is loaded."""
58+
if self._model_loaded:
59+
return
60+
61+
try:
62+
# Import sentence-transformers here to avoid dependency issues
63+
from sentence_transformers import SentenceTransformer
64+
65+
# Load the model in a thread to avoid blocking
66+
loop = asyncio.get_running_loop()
67+
self._embedding_model = await loop.run_in_executor(
68+
self._executor,
69+
SentenceTransformer,
70+
self.model_name
71+
)
72+
self._model_loaded = True
73+
logger.info(f"Loaded embedding model: {self.model_name}")
74+
75+
except ImportError:
76+
logger.warning("sentence-transformers not available, falling back to simple text similarity")
77+
self._embedding_model = None
78+
self._model_loaded = True
79+
except Exception as e:
80+
logger.error(f"Failed to load embedding model: {e}")
81+
self._embedding_model = None
82+
self._model_loaded = True
83+
84+
async def _get_text_embedding(self, text: str) -> np.ndarray:
85+
"""Generate embedding for given text."""
86+
if not text:
87+
return np.zeros(384) # Default dimension for all-MiniLM-L6-v2
88+
89+
if self._embedding_model is None:
90+
# Fallback to simple text similarity
91+
return self._simple_text_embedding(text)
92+
93+
try:
94+
loop = asyncio.get_event_loop()
95+
embedding = await loop.run_in_executor(
96+
self._executor,
97+
self._embedding_model.encode,
98+
text
99+
)
100+
return embedding
101+
except Exception as e:
102+
logger.warning(f"Failed to generate embedding for text: {e}")
103+
return self._simple_text_embedding(text)
104+
105+
def _simple_text_embedding(self, text: str) -> np.ndarray:
106+
"""Simple fallback embedding using character frequency."""
107+
# Create a simple embedding based on character frequency
108+
# This is a fallback when sentence-transformers is not available
109+
embedding = np.zeros(384)
110+
text_lower = text.lower()
111+
112+
# Simple character frequency-based embedding
113+
for i, char in enumerate(text_lower):
114+
embedding[i % 384] += ord(char) / 1000.0
115+
116+
# Normalize
117+
norm = np.linalg.norm(embedding)
118+
if norm > 0:
119+
embedding = embedding / norm
120+
121+
return embedding
122+
123+
async def _get_tool_embedding(self, tool: Tool) -> np.ndarray:
124+
"""Get or generate embedding for a tool."""
125+
if not self.cache_embeddings or tool.name not in self._tool_embeddings_cache:
126+
# Create text representation of the tool
127+
tool_text = f"{tool.name} {tool.description} {' '.join(tool.tags)}"
128+
embedding = await self._get_text_embedding(tool_text)
129+
130+
if self.cache_embeddings:
131+
self._tool_embeddings_cache[tool.name] = embedding
132+
133+
return embedding
134+
135+
return self._tool_embeddings_cache[tool.name]
136+
137+
def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
138+
"""Calculate cosine similarity between two vectors."""
139+
try:
140+
dot_product = np.dot(a, b)
141+
norm_a = np.linalg.norm(a)
142+
norm_b = np.linalg.norm(b)
143+
144+
if norm_a == 0 or norm_b == 0:
145+
return 0.0
146+
147+
return dot_product / (norm_a * norm_b)
148+
except Exception as e:
149+
logger.warning(f"Error calculating cosine similarity: {e}")
150+
return 0.0
151+
152+
async def search_tools(
153+
self,
154+
tool_repository: ConcurrentToolRepository,
155+
query: str,
156+
limit: int = 10,
157+
any_of_tags_required: Optional[List[str]] = None
158+
) -> List[Tool]:
159+
"""Search for tools using semantic similarity.
160+
161+
Args:
162+
tool_repository: The tool repository to search within.
163+
query: The search query string.
164+
limit: Maximum number of tools to return.
165+
any_of_tags_required: Optional list of tags where one of them must be present.
166+
167+
Returns:
168+
List of Tool objects ranked by semantic similarity.
169+
"""
170+
if limit < 0:
171+
raise ValueError("limit must be non-negative")
172+
173+
# Ensure the embedding model is loaded
174+
await self._ensure_model_loaded()
175+
176+
# Get all tools
177+
tools: List[Tool] = await tool_repository.get_tools()
178+
179+
# Filter by required tags if specified
180+
if any_of_tags_required and len(any_of_tags_required) > 0:
181+
any_of_tags_required = [tag.lower() for tag in any_of_tags_required]
182+
tools = [
183+
tool for tool in tools
184+
if any(tag.lower() in any_of_tags_required for tag in tool.tags)
185+
]
186+
187+
if not tools:
188+
return []
189+
190+
# Generate query embedding
191+
query_embedding = await self._get_text_embedding(query)
192+
193+
# Calculate similarity scores for all tools
194+
tool_scores: List[Tuple[Tool, float]] = []
195+
196+
for tool in tools:
197+
try:
198+
tool_embedding = await self._get_tool_embedding(tool)
199+
similarity = self._cosine_similarity(query_embedding, tool_embedding)
200+
201+
if similarity >= self.similarity_threshold:
202+
tool_scores.append((tool, similarity))
203+
204+
except Exception as e:
205+
logger.warning(f"Error processing tool {tool.name}: {e}")
206+
continue
207+
208+
# Sort by similarity score (descending)
209+
sorted_tools = [
210+
tool for tool, score in sorted(
211+
tool_scores,
212+
key=lambda x: x[1],
213+
reverse=True
214+
)
215+
]
216+
217+
# Return up to 'limit' tools
218+
return sorted_tools[:limit] if limit > 0 else sorted_tools
219+
220+
async def __aenter__(self):
221+
"""Async context manager entry."""
222+
await self._ensure_model_loaded()
223+
return self
224+
225+
async def __aexit__(self, exc_type, exc_val, exc_tb):
226+
"""Async context manager exit."""
227+
if self._executor:
228+
self._executor.shutdown(wait=False)
229+
230+
231+
class InMemEmbeddingsSearchStrategyConfigSerializer(Serializer[InMemEmbeddingsSearchStrategy]):
232+
"""Serializer for InMemEmbeddingsSearchStrategy configuration."""
233+
234+
def to_dict(self, obj: InMemEmbeddingsSearchStrategy) -> dict:
235+
return obj.model_dump()
236+
237+
def validate_dict(self, data: dict) -> InMemEmbeddingsSearchStrategy:
238+
try:
239+
return InMemEmbeddingsSearchStrategy.model_validate(data)
240+
except Exception as e:
241+
raise ValueError(f"Invalid configuration: {e}") from e

0 commit comments

Comments
 (0)