The Embedding Server is a critical component of your LangGraph cluster that converts text into numerical vectors (embeddings). It runs on your rp-node (192.168.1.178:8081) and enables sophisticated AI capabilities like semantic search, document retrieval, and intelligent routing.
Embeddings solve a fundamental problem: computers understand numbers, not words. By converting text into vectors, we enable:
- Semantic similarity: "car" and "automobile" have similar vectors even if the words are different
- Contextual understanding: "bank" (financial) vs "bank" (river) have different vectors based on context
- Intelligent routing: Route user questions to appropriate workflows based on meaning, not keywords
- Memory retrieval: Find relevant past conversations based on semantic similarity
rp-node (192.168.1.178:8081)
├── FastAPI Web Server
├── Sentence-Transformers Models
│ ├── all-MiniLM-L6-v2 (default, 384 dimensions)
│ ├── all-mpnet-base-v2 (optional, 768 dimensions)
│ └── paraphrase-multilingual-MiniLM-L12-v2 (optional)
├── ARM Optimization Layer
├── Batch Processing Engine
└── Health Monitoring
Why rp-node for Embeddings?
- ARM Cortex-A76: Efficient for embedding computations
- 8GB RAM: Sufficient for multiple embedding models
- Low Power: ARM chips are power-efficient for continuous operation
- Dedicated: Separates embedding workload from LLM inference
# Primary Model (Default)
all-MiniLM-L6-v2:
- Size: ~23MB
- Dimensions: 384
- Speed: ~1000 texts/second on ARM
- Use: General-purpose embeddings
# Optional High-Quality Model
all-mpnet-base-v2:
- Size: ~420MB
- Dimensions: 768
- Speed: ~200 texts/second on ARM
- Use: High-accuracy semantic search
# Optional Multilingual Model
paraphrase-multilingual-MiniLM-L12-v2:
- Size: ~420MB
- Languages: 50+ languages
- Use: Non-English text processingPurpose: Generate embeddings for text(s)
# Request
{
"texts": ["Hello world", "This is a test"],
"model": "default"
}
# Response
{
"embeddings": [[0.1, -0.3, 0.8, ...], [0.2, -0.1, 0.9, ...]],
"model": "default",
"dimensions": 384,
"processing_time": 0.045
}Purpose: Check server status
# Response
{
"status": "healthy",
"memory_usage_percent": 45.2,
"cpu_usage_percent": 12.8,
"available_models": ["default", "multilingual"],
"architecture": "aarch64"
}Purpose: List available models
# Response
{
"available_models": ["default", "multilingual"],
"default_model": "default"
}class LocalEmbeddings(Embeddings):
"""Local embeddings from rp-node"""
def __init__(self, embeddings_url: str = "http://192.168.1.178:8081"):
self.embeddings_url = embeddings_url
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Convert multiple documents to embeddings"""
response = requests.post(
f"{self.embeddings_url}/embeddings",
json={"texts": texts, "model": "default"},
timeout=30
)
return response.json()["embeddings"]
def embed_query(self, text: str) -> List[float]:
"""Convert single query to embedding"""
return self.embed_documents([text])[0]# Initialize embedding service
embeddings = LocalEmbeddings()
# In your workflow nodes
def enhanced_search_node(state: AgentState) -> AgentState:
user_query = state["messages"][-1].content
# Convert query to embedding
query_embedding = embeddings.embed_query(user_query)
# Find similar documents in your knowledge base
similar_docs = vector_search(query_embedding, knowledge_base)
# Enhance LLM prompt with relevant context
enhanced_prompt = f"""
User Question: {user_query}
Relevant Context:
{similar_docs}
Please provide a detailed answer based on the context above.
"""
response = llm.invoke(enhanced_prompt)
return {
**state,
"messages": state["messages"] + [AIMessage(content=response)],
"context": similar_docs,
"tools_used": state.get("tools_used", []) + ["semantic_search"]
}Scenario: User asks about Jetson optimization, system finds relevant docs
# Knowledge Base
docs = [
"Jetson Orin Nano setup requires nvpmodel -m 0 for maximum performance",
"HAProxy load balancer runs on cpu-node port 9000 for LLM requests",
"Redis cache is configured on cpu-node with password langgraph_redis_pass"
]
# User Question
user_question = "How do I optimize Jetson performance?"
# Process
1. Convert all docs to embeddings → Store in vector DB
2. Convert user question to embedding
3. Find most similar doc using cosine similarity
4. Send question + relevant doc to LLM
5. LLM provides contextually accurate answerResult: Instead of generic advice, LLM gives specific answer about nvpmodel -m 0
Scenario: Route user requests to appropriate workflow based on meaning
# Workflow Types (with embeddings)
workflows = {
"coding_help": "Programming questions, code generation, debugging",
"system_admin": "Server configuration, deployment, monitoring",
"research": "Information gathering, web search, data analysis",
"general_chat": "Casual conversation, greetings, simple questions"
}
# User Input
user_input = "How do I fix this Python error?"
# Process
1. Create embeddings for all workflow descriptions
2. Create embedding for user input
3. Calculate similarity scores
4. Route to highest-scoring workflow (coding_help)
# LangGraph Routing
def route_request(state: AgentState) -> str:
user_message = state["messages"][-1].content
# Get embedding for user message
user_embedding = embeddings.embed_query(user_message)
# Calculate similarity to each workflow type
best_workflow = semantic_router.find_best_match(user_embedding)
return best_workflow # Returns "coding_help", "research", etc.Scenario: Remember relevant context from past conversation
# Conversation History (with embeddings stored)
history = [
"I'm setting up a LangGraph cluster on my local network",
"My Jetson Orin Nano is running at 192.168.1.177 with Ollama",
"The cpu-node has 32GB RAM and runs the load balancer"
]
# Current Question
current_question = "What was the IP address of my Jetson again?"
# Process
1. Convert current question to embedding
2. Compare with embeddings of all past messages
3. Find most similar past messages
4. Include relevant context in LLM prompt
# Result: LLM answers "192.168.1.177" based on conversation historyScenario: Choose best tool for user request based on semantic understanding
# Available Tools (with embeddings)
tools = {
"web_search": "Search internet for information, news, current events",
"web_scrape": "Extract content from specific websites and web pages",
"execute_command": "Run shell commands, system operations, file operations",
"llm_query": "Generate text, answer questions, creative writing"
}
# User Request
request = "Find the latest news about artificial intelligence"
# Process
1. Create embeddings for all tool descriptions
2. Create embedding for user request
3. Find best matching tool (web_search)
4. Route request to web search workflow
# LangGraph Implementation
def select_tool_node(state: AgentState) -> str:
user_request = state["messages"][-1].content
request_embedding = embeddings.embed_query(user_request)
best_tool = tool_selector.find_best_tool(request_embedding)
if best_tool == "web_search":
return "search_and_respond_node"
elif best_tool == "web_scrape":
return "scrape_and_respond_node"
else:
return "standard_response_node"class EmbeddingsService:
def get_embeddings(self, texts: List[str], model_name: str = "default"):
model = self.models[model_name]
# ARM optimization: Process in smaller batches
batch_size = 16 if len(texts) > 16 else len(texts)
if len(texts) <= batch_size:
return model.encode(texts, batch_size=batch_size, show_progress_bar=False)
# Process large requests in chunks for memory efficiency
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
batch_embeddings = model.encode(batch, batch_size=len(batch), show_progress_bar=False)
all_embeddings.extend(batch_embeddings)
return np.array(all_embeddings)| Model | Batch Size | Processing Time | Memory Usage |
|---|---|---|---|
| all-MiniLM-L6-v2 | 1 text | ~5ms | ~50MB |
| all-MiniLM-L6-v2 | 16 texts | ~25ms | ~100MB |
| all-mpnet-base-v2 | 1 text | ~15ms | ~200MB |
| all-mpnet-base-v2 | 16 texts | ~80ms | ~400MB |
# HAProxy Configuration
frontend embeddings_frontend
bind *:9002
mode http
default_backend embeddings_servers
backend embeddings_servers
mode http
balance roundrobin
option httpchk GET /health
server embeddings_primary 192.168.1.178:8081 check inter 30s fall 3 rise 2Access Methods:
- Direct:
http://192.168.1.178:8081/embeddings - Load Balanced:
http://192.168.1.81:9002/embeddings
If rp-node goes down:
- HAProxy marks it unhealthy after 3 failed checks (90 seconds)
- You could add backup embedding servers on other nodes
- LangGraph falls back to simpler keyword-based routing
# Service Configuration
/etc/systemd/system/embeddings-server.service:
[Unit]
Description=LangGraph Embeddings Server
After=network.target
[Service]
Type=simple
User=sanzad
WorkingDirectory=/home/sanzad/embeddings-server
Environment=PATH=/home/sanzad/embeddings-env/bin
ExecStart=/home/sanzad/embeddings-env/bin/python embeddings_server.py
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target# Service Control
sudo systemctl start embeddings-server
sudo systemctl stop embeddings-server
sudo systemctl restart embeddings-server
sudo systemctl status embeddings-server
# Logs
sudo journalctl -u embeddings-server -f
sudo journalctl -u embeddings-server --since "1 hour ago"
# Health Check
curl http://192.168.1.178:8081/health
# Test Embedding
curl -X POST http://192.168.1.178:8081/embeddings \
-H "Content-Type: application/json" \
-d '{"texts": ["Hello world"], "model": "default"}'1. Server Won't Start
# Check logs
sudo journalctl -u embeddings-server -n 50
# Common causes:
- Python environment not activated
- Missing dependencies (pip install -r requirements.txt)
- Port 8081 already in use (netstat -ln | grep 8081)
- Insufficient memory (free -h)2. Slow Response Times
# Check system resources
htop # Look for high CPU/memory usage
free -h # Check available memory
# Optimize:
- Reduce batch sizes in code
- Use lighter model (all-MiniLM-L6-v2)
- Add more RAM to rp-node3. Connection Errors
# Test connectivity
ping 192.168.1.178
telnet 192.168.1.178 8081
# Check firewall
sudo ufw status
sudo ufw allow 8081/tcpAdd your own embedding models:
def load_custom_model(self):
"""Load custom embedding model"""
try:
self.models["custom"] = SentenceTransformer('/path/to/custom/model')
logger.info("✅ Loaded custom model")
except Exception as e:
logger.error(f"Failed to load custom model: {e}")Add Redis caching for frequent embeddings:
import redis
class CachedEmbeddingsService:
def __init__(self):
self.redis_client = redis.Redis(host='192.168.1.81', port=6379)
def get_embeddings_cached(self, texts: List[str]):
# Check cache first
cache_key = hash(tuple(texts))
cached = self.redis_client.get(f"embed:{cache_key}")
if cached:
return json.loads(cached)
# Generate new embeddings
embeddings = self.generate_embeddings(texts)
# Cache for 1 hour
self.redis_client.setex(f"embed:{cache_key}", 3600, json.dumps(embeddings))
return embeddingsConnect to ChromaDB for persistent vector storage:
import chromadb
class VectorStore:
def __init__(self):
self.client = chromadb.Client()
self.collection = self.client.create_collection("langgraph_docs")
def add_documents(self, texts: List[str], embeddings: List[List[float]]):
self.collection.add(
embeddings=embeddings,
documents=texts,
ids=[f"doc_{i}" for i in range(len(texts))]
)
def search(self, query_embedding: List[float], n_results: int = 5):
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=n_results
)
return results- Privacy: Your text never leaves your network
- Zero Cost: No API fees for embedding generation
- Low Latency: Direct network access (sub-10ms)
- Reliability: No external dependencies
- Customization: Add domain-specific models
- Scalability: Add more embedding servers as needed
The embedding server enables sophisticated AI patterns:
- RAG Workflows: Retrieve relevant documents for accurate answers
- Semantic Routing: Intelligent workflow selection based on meaning
- Memory Systems: Long-term conversation context retrieval
- Tool Selection: Choose appropriate tools based on semantic understanding
- Content Organization: Automatically categorize and organize information
The embedding server is not just "another service" - it's the semantic intelligence layer of your LangGraph cluster. It transforms your AI system from basic pattern matching to true understanding of meaning and context.
Key Takeaways:
- Runs on ARM-optimized hardware (rp-node) for efficiency
- Provides multiple embedding models for different use cases
- Integrates seamlessly with LangGraph through LocalEmbeddings class
- Enables advanced AI patterns like RAG, semantic routing, and memory
- Maintains privacy and zero-cost operation
- Scales with your needs
Without embeddings, your LangGraph would be limited to keyword matching and simple routing. With embeddings, it becomes a truly intelligent system that understands meaning, context, and relationships - just like human cognition! 🧠✨