Skip to content

Latest commit

 

History

History
550 lines (417 loc) · 15.4 KB

File metadata and controls

550 lines (417 loc) · 15.4 KB

Embedding Server Deep Dive 🧠

🎯 What is the Embedding Server?

The Embedding Server is a critical component of your LangGraph cluster that converts text into numerical vectors (embeddings). It runs on your rp-node (192.168.1.178:8081) and enables sophisticated AI capabilities like semantic search, document retrieval, and intelligent routing.

Why Do We Need Embeddings?

Embeddings solve a fundamental problem: computers understand numbers, not words. By converting text into vectors, we enable:

  • Semantic similarity: "car" and "automobile" have similar vectors even if the words are different
  • Contextual understanding: "bank" (financial) vs "bank" (river) have different vectors based on context
  • Intelligent routing: Route user questions to appropriate workflows based on meaning, not keywords
  • Memory retrieval: Find relevant past conversations based on semantic similarity

🏗️ Technical Architecture

Server Components

rp-node (192.168.1.178:8081)
├── FastAPI Web Server
├── Sentence-Transformers Models
│   ├── all-MiniLM-L6-v2 (default, 384 dimensions)
│   ├── all-mpnet-base-v2 (optional, 768 dimensions) 
│   └── paraphrase-multilingual-MiniLM-L12-v2 (optional)
├── ARM Optimization Layer
├── Batch Processing Engine
└── Health Monitoring

Hardware Optimization

Why rp-node for Embeddings?

  • ARM Cortex-A76: Efficient for embedding computations
  • 8GB RAM: Sufficient for multiple embedding models
  • Low Power: ARM chips are power-efficient for continuous operation
  • Dedicated: Separates embedding workload from LLM inference

Model Details

# Primary Model (Default)
all-MiniLM-L6-v2:
  - Size: ~23MB
  - Dimensions: 384
  - Speed: ~1000 texts/second on ARM
  - Use: General-purpose embeddings

# Optional High-Quality Model  
all-mpnet-base-v2:
  - Size: ~420MB
  - Dimensions: 768
  - Speed: ~200 texts/second on ARM
  - Use: High-accuracy semantic search

# Optional Multilingual Model
paraphrase-multilingual-MiniLM-L12-v2:
  - Size: ~420MB
  - Languages: 50+ languages
  - Use: Non-English text processing

🔌 API Endpoints

POST /embeddings

Purpose: Generate embeddings for text(s)

# Request
{
  "texts": ["Hello world", "This is a test"],
  "model": "default"
}

# Response
{
  "embeddings": [[0.1, -0.3, 0.8, ...], [0.2, -0.1, 0.9, ...]],
  "model": "default",
  "dimensions": 384,
  "processing_time": 0.045
}

GET /health

Purpose: Check server status

# Response
{
  "status": "healthy",
  "memory_usage_percent": 45.2,
  "cpu_usage_percent": 12.8,
  "available_models": ["default", "multilingual"],
  "architecture": "aarch64"
}

GET /models

Purpose: List available models

# Response
{
  "available_models": ["default", "multilingual"],
  "default_model": "default"
}

🔗 LangGraph Integration

LocalEmbeddings Class

class LocalEmbeddings(Embeddings):
    """Local embeddings from rp-node"""
    
    def __init__(self, embeddings_url: str = "http://192.168.1.178:8081"):
        self.embeddings_url = embeddings_url
    
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Convert multiple documents to embeddings"""
        response = requests.post(
            f"{self.embeddings_url}/embeddings",
            json={"texts": texts, "model": "default"},
            timeout=30
        )
        return response.json()["embeddings"]
    
    def embed_query(self, text: str) -> List[float]:
        """Convert single query to embedding"""
        return self.embed_documents([text])[0]

Usage in LangGraph Workflows

# Initialize embedding service
embeddings = LocalEmbeddings()

# In your workflow nodes
def enhanced_search_node(state: AgentState) -> AgentState:
    user_query = state["messages"][-1].content
    
    # Convert query to embedding
    query_embedding = embeddings.embed_query(user_query)
    
    # Find similar documents in your knowledge base
    similar_docs = vector_search(query_embedding, knowledge_base)
    
    # Enhance LLM prompt with relevant context
    enhanced_prompt = f"""
    User Question: {user_query}
    
    Relevant Context:
    {similar_docs}
    
    Please provide a detailed answer based on the context above.
    """
    
    response = llm.invoke(enhanced_prompt)
    
    return {
        **state,
        "messages": state["messages"] + [AIMessage(content=response)],
        "context": similar_docs,
        "tools_used": state.get("tools_used", []) + ["semantic_search"]
    }

🎯 Use Cases & Examples

1. Document RAG (Retrieval Augmented Generation)

Scenario: User asks about Jetson optimization, system finds relevant docs

# Knowledge Base
docs = [
    "Jetson Orin Nano setup requires nvpmodel -m 0 for maximum performance",
    "HAProxy load balancer runs on cpu-node port 9000 for LLM requests",
    "Redis cache is configured on cpu-node with password langgraph_redis_pass"
]

# User Question
user_question = "How do I optimize Jetson performance?"

# Process
1. Convert all docs to embeddingsStore in vector DB
2. Convert user question to embedding
3. Find most similar doc using cosine similarity
4. Send question + relevant doc to LLM
5. LLM provides contextually accurate answer

Result: Instead of generic advice, LLM gives specific answer about nvpmodel -m 0

2. Semantic Routing in LangGraph

Scenario: Route user requests to appropriate workflow based on meaning

# Workflow Types (with embeddings)
workflows = {
    "coding_help": "Programming questions, code generation, debugging",
    "system_admin": "Server configuration, deployment, monitoring", 
    "research": "Information gathering, web search, data analysis",
    "general_chat": "Casual conversation, greetings, simple questions"
}

# User Input
user_input = "How do I fix this Python error?"

# Process
1. Create embeddings for all workflow descriptions
2. Create embedding for user input
3. Calculate similarity scores
4. Route to highest-scoring workflow (coding_help)

# LangGraph Routing
def route_request(state: AgentState) -> str:
    user_message = state["messages"][-1].content
    
    # Get embedding for user message
    user_embedding = embeddings.embed_query(user_message)
    
    # Calculate similarity to each workflow type
    best_workflow = semantic_router.find_best_match(user_embedding)
    
    return best_workflow  # Returns "coding_help", "research", etc.

3. Conversation Memory Retrieval

Scenario: Remember relevant context from past conversation

# Conversation History (with embeddings stored)
history = [
    "I'm setting up a LangGraph cluster on my local network",
    "My Jetson Orin Nano is running at 192.168.1.177 with Ollama",
    "The cpu-node has 32GB RAM and runs the load balancer"
]

# Current Question  
current_question = "What was the IP address of my Jetson again?"

# Process
1. Convert current question to embedding
2. Compare with embeddings of all past messages
3. Find most similar past messages
4. Include relevant context in LLM prompt

# Result: LLM answers "192.168.1.177" based on conversation history

4. Smart Tool Selection

Scenario: Choose best tool for user request based on semantic understanding

# Available Tools (with embeddings)
tools = {
    "web_search": "Search internet for information, news, current events",
    "web_scrape": "Extract content from specific websites and web pages", 
    "execute_command": "Run shell commands, system operations, file operations",
    "llm_query": "Generate text, answer questions, creative writing"
}

# User Request
request = "Find the latest news about artificial intelligence"

# Process
1. Create embeddings for all tool descriptions
2. Create embedding for user request  
3. Find best matching tool (web_search)
4. Route request to web search workflow

# LangGraph Implementation
def select_tool_node(state: AgentState) -> str:
    user_request = state["messages"][-1].content
    request_embedding = embeddings.embed_query(user_request)
    
    best_tool = tool_selector.find_best_tool(request_embedding)
    
    if best_tool == "web_search":
        return "search_and_respond_node"
    elif best_tool == "web_scrape":
        return "scrape_and_respond_node" 
    else:
        return "standard_response_node"

⚡ Performance & Optimization

ARM Optimization Techniques

class EmbeddingsService:
    def get_embeddings(self, texts: List[str], model_name: str = "default"):
        model = self.models[model_name]
        
        # ARM optimization: Process in smaller batches
        batch_size = 16 if len(texts) > 16 else len(texts)
        
        if len(texts) <= batch_size:
            return model.encode(texts, batch_size=batch_size, show_progress_bar=False)
        
        # Process large requests in chunks for memory efficiency
        all_embeddings = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            batch_embeddings = model.encode(batch, batch_size=len(batch), show_progress_bar=False)
            all_embeddings.extend(batch_embeddings)
        
        return np.array(all_embeddings)

Performance Expectations

Model Batch Size Processing Time Memory Usage
all-MiniLM-L6-v2 1 text ~5ms ~50MB
all-MiniLM-L6-v2 16 texts ~25ms ~100MB
all-mpnet-base-v2 1 text ~15ms ~200MB
all-mpnet-base-v2 16 texts ~80ms ~400MB

🔄 HAProxy Integration

Load Balancing Embeddings

# HAProxy Configuration
frontend embeddings_frontend
    bind *:9002
    mode http
    default_backend embeddings_servers

backend embeddings_servers
    mode http
    balance roundrobin
    option httpchk GET /health
    
    server embeddings_primary 192.168.1.178:8081 check inter 30s fall 3 rise 2

Access Methods:

  • Direct: http://192.168.1.178:8081/embeddings
  • Load Balanced: http://192.168.1.81:9002/embeddings

Failover Strategy

If rp-node goes down:

  • HAProxy marks it unhealthy after 3 failed checks (90 seconds)
  • You could add backup embedding servers on other nodes
  • LangGraph falls back to simpler keyword-based routing

🛡️ Service Management

Systemd Service

# Service Configuration
/etc/systemd/system/embeddings-server.service:

[Unit]
Description=LangGraph Embeddings Server
After=network.target

[Service]
Type=simple
User=sanzad
WorkingDirectory=/home/sanzad/embeddings-server
Environment=PATH=/home/sanzad/embeddings-env/bin
ExecStart=/home/sanzad/embeddings-env/bin/python embeddings_server.py
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Management Commands

# Service Control
sudo systemctl start embeddings-server
sudo systemctl stop embeddings-server
sudo systemctl restart embeddings-server
sudo systemctl status embeddings-server

# Logs
sudo journalctl -u embeddings-server -f
sudo journalctl -u embeddings-server --since "1 hour ago"

# Health Check
curl http://192.168.1.178:8081/health

# Test Embedding
curl -X POST http://192.168.1.178:8081/embeddings \
  -H "Content-Type: application/json" \
  -d '{"texts": ["Hello world"], "model": "default"}'

🔧 Troubleshooting

Common Issues

1. Server Won't Start

# Check logs
sudo journalctl -u embeddings-server -n 50

# Common causes:
- Python environment not activated
- Missing dependencies (pip install -r requirements.txt)
- Port 8081 already in use (netstat -ln | grep 8081)
- Insufficient memory (free -h)

2. Slow Response Times

# Check system resources
htop  # Look for high CPU/memory usage
free -h  # Check available memory

# Optimize:
- Reduce batch sizes in code
- Use lighter model (all-MiniLM-L6-v2)
- Add more RAM to rp-node

3. Connection Errors

# Test connectivity
ping 192.168.1.178
telnet 192.168.1.178 8081

# Check firewall
sudo ufw status
sudo ufw allow 8081/tcp

🚀 Advanced Features

Custom Models

Add your own embedding models:

def load_custom_model(self):
    """Load custom embedding model"""
    try:
        self.models["custom"] = SentenceTransformer('/path/to/custom/model')
        logger.info("✅ Loaded custom model")
    except Exception as e:
        logger.error(f"Failed to load custom model: {e}")

Caching Layer

Add Redis caching for frequent embeddings:

import redis

class CachedEmbeddingsService:
    def __init__(self):
        self.redis_client = redis.Redis(host='192.168.1.81', port=6379)
        
    def get_embeddings_cached(self, texts: List[str]):
        # Check cache first
        cache_key = hash(tuple(texts))
        cached = self.redis_client.get(f"embed:{cache_key}")
        
        if cached:
            return json.loads(cached)
            
        # Generate new embeddings
        embeddings = self.generate_embeddings(texts)
        
        # Cache for 1 hour
        self.redis_client.setex(f"embed:{cache_key}", 3600, json.dumps(embeddings))
        
        return embeddings

Vector Database Integration

Connect to ChromaDB for persistent vector storage:

import chromadb

class VectorStore:
    def __init__(self):
        self.client = chromadb.Client()
        self.collection = self.client.create_collection("langgraph_docs")
        
    def add_documents(self, texts: List[str], embeddings: List[List[float]]):
        self.collection.add(
            embeddings=embeddings,
            documents=texts,
            ids=[f"doc_{i}" for i in range(len(texts))]
        )
        
    def search(self, query_embedding: List[float], n_results: int = 5):
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=n_results
        )
        return results

🎯 Why This Architecture Works

Benefits of Local Embeddings

  1. Privacy: Your text never leaves your network
  2. Zero Cost: No API fees for embedding generation
  3. Low Latency: Direct network access (sub-10ms)
  4. Reliability: No external dependencies
  5. Customization: Add domain-specific models
  6. Scalability: Add more embedding servers as needed

Integration with LangGraph Workflows

The embedding server enables sophisticated AI patterns:

  • RAG Workflows: Retrieve relevant documents for accurate answers
  • Semantic Routing: Intelligent workflow selection based on meaning
  • Memory Systems: Long-term conversation context retrieval
  • Tool Selection: Choose appropriate tools based on semantic understanding
  • Content Organization: Automatically categorize and organize information

🏆 Conclusion

The embedding server is not just "another service" - it's the semantic intelligence layer of your LangGraph cluster. It transforms your AI system from basic pattern matching to true understanding of meaning and context.

Key Takeaways:

  • Runs on ARM-optimized hardware (rp-node) for efficiency
  • Provides multiple embedding models for different use cases
  • Integrates seamlessly with LangGraph through LocalEmbeddings class
  • Enables advanced AI patterns like RAG, semantic routing, and memory
  • Maintains privacy and zero-cost operation
  • Scales with your needs

Without embeddings, your LangGraph would be limited to keyword matching and simple routing. With embeddings, it becomes a truly intelligent system that understands meaning, context, and relationships - just like human cognition! 🧠✨