This project includes a fully integrated, self-hosted LangFuse deployment on Amazon EKS for comprehensive agent observability. LangFuse provides distributed tracing, performance metrics, and cost tracking for all LLM-powered agents in the system.
The self-hosted LangFuse deployment includes:
| Component | Purpose | Status |
|---|---|---|
| LangFuse Web | Main application server and UI | β Enabled |
| LangFuse Worker | Background job processor | β Enabled |
| PostgreSQL | Primary data storage for traces, metrics, and configuration | β Enabled |
| ClickHouse Cluster | Time-series analytics for high-volume trace data (3-node sharded cluster) | β Enabled |
| Redis | Caching layer and queue management | β Enabled |
| S3 (MinIO) | Object storage for media and large payloads | β Enabled |
| ZooKeeper | Distributed coordination for ClickHouse cluster | β Enabled |
Note: The full production-ready stack is deployed by default, providing scalability and high performance for trace analytics.
LangFuse is deployed using:
- Helm Chart: Official LangFuse Helm chart from
https://langfuse.github.io/langfuse-helm - Terraform Module: Custom
langfuse.tfmodule that manages the Helm release - Kubernetes Namespace: Deployed in a dedicated
langfusenamespace
Update your terraform.tfvars:
# Core LangFuse enablement
enable_langfuse = true
# Enable persistent storage (recommended for production)
enable_langfuse_persistence = true
# Optional: Configure LangFuse API keys after initial setup
# langfuse_public_key = "pk-lf-xxxxxxxx"
# langfuse_secret_key = "sk-lf-xxxxxxxx"cd infra
# Initialize and apply Terraform
terraform init
terraform applyThis will deploy:
- LangFuse web and worker pods
- PostgreSQL with persistent storage
- ClickHouse 3-node sharded cluster
- Redis for caching
- MinIO for S3-compatible storage
- ZooKeeper for cluster coordination
- Kubernetes secrets for agent integration
Check all components are running:
kubectl get pods -n langfuse
# Expected output:
NAME READY STATUS RESTARTS AGE
langfuse-clickhouse-shard0-0 1/1 Running 0 4h
langfuse-clickhouse-shard0-1 1/1 Running 0 4h
langfuse-clickhouse-shard0-2 1/1 Running 0 4h
langfuse-postgresql-0 1/1 Running 0 4h
langfuse-redis-primary-0 1/1 Running 0 4h
langfuse-s3-xxxxxxxxx-xxxxx 1/1 Running 0 4h
langfuse-web-xxxxxxxxx-xxxxx 1/1 Running 0 4h
langfuse-worker-xxxxxxxxx-xxxxx 1/1 Running 0 4h
langfuse-zookeeper-0 1/1 Running 0 4h# Port-forward to access LangFuse UI
kubectl port-forward -n langfuse svc/langfuse 3000:3000
# Open browser to http://localhost:3000First-time setup:
- Create your admin account
- Navigate to Settings β API Keys
- Create a new API key pair
- Save the public and secret keys
Update terraform.tfvars with your API keys:
langfuse_public_key = "pk-lf-xxxxxxxx"
langfuse_secret_key = "sk-lf-xxxxxxxx"Apply the configuration:
terraform applyThis creates a langfuse-credentials secret that agents automatically use.
# Rebuild agent images with LangFuse integration
./build-images.sh admin hr finance
# Deploy agents - they'll automatically detect LangFuse
./deploy-helm.sh -m demoEach agent includes the langfuse_config.py utility that:
- Reads credentials from environment variables (injected via Kubernetes secrets)
- Connects to LangFuse at
http://langfuse.langfuse.svc.cluster.local:3000 - Automatically instruments all LLM calls and agent interactions
| Agent | Framework | LangFuse Integration |
|---|---|---|
| Admin Agent | Strands | β Full trace instrumentation |
| HR Agent | CrewAI | β Task and tool tracking |
| Finance Agent | LangGraph | β Graph execution tracing |
# Check all pods are running
kubectl get pods -n langfuse
# Check services
kubectl get svc -n langfuseSend test queries through the UI:
# Port-forward the UI
kubectl port-forward svc/agents-ui-app-service 8501:80
# Open http://localhost:8501 and send queries like:
# - "What is the name of employee EMP0002?"
# - "How many vacation days does EMP0001 have?"
# - "What is the salary of EMP0003?"Access the LangFuse dashboard:
kubectl port-forward -n langfuse svc/langfuse 3000:3000
# Open http://localhost:3000Navigate to view traces:
-
Traces Tab: See all agent interactions
- Click on any trace to see the full conversation flow
- View the Admin β HR/Finance agent routing
- See LLM calls with token counts
-
Dashboard Tab: View aggregated metrics
- Request volume over time
- Latency percentiles (P50, P90, P99)
- Error rates and success rates
- Token usage and costs
-
Sessions Tab: Track complete user conversations
- See how queries flow through multiple agents
- Understand the full context of multi-turn conversations
To analyze specific agents:
-
Filter by Agent Name:
- In Traces view, use the filter dropdown
- Select
metadata.agent_nameornamefield - Choose specific agent (admin, hr, finance)
-
Filter by Time Range:
- Use the time selector in top-right
- View last hour, day, week, or custom range
-
Filter by Status:
- Success vs. Error traces
- High latency traces (>2s)
-
Search Capabilities:
- Search by user input text
- Search by agent response content
- Search by error messages
| Metric | Description | Where to Find |
|---|---|---|
| Latency | Response time distribution | Dashboard β Latency chart |
| Throughput | Requests per minute/hour | Dashboard β Request volume |
| Token Usage | Input/output tokens per request | Traces β Individual trace details |
| Cost | Estimated LLM costs | Dashboard β Cost tracking |
| Error Rate | Failed requests percentage | Dashboard β Success rate |
| Agent Utilization | Which agents handle most queries | Dashboard β Group by metadata.agent_name |
Monitor these indicators:
- P99 Latency > 5s: Consider optimizing agent logic
- High Error Rate: Check agent logs for issues
- Token Spike: Review prompts for efficiency
- Uneven Distribution: Admin agent routing may need tuning
- 3-node sharded cluster for horizontal scaling
- Handles high-volume trace ingestion
- Provides fast analytics queries
- Managed by ZooKeeper for coordination
- Caches frequently accessed data
- Manages background job queues
- Improves dashboard performance
- Stores large trace payloads
- Archives historical data
- Provides S3-compatible API
- Check secret is created:
kubectl get secret langfuse-credentials -n default
kubectl get secret langfuse-credentials -n default -o yaml | base64 -d- Verify agent environment variables:
kubectl describe pod <agent-pod-name> | grep LANGFUSE- Check agent logs for LangFuse connection:
kubectl logs <agent-pod-name> | grep -i langfuse# Check PostgreSQL status
kubectl logs -n langfuse langfuse-postgresql-0
# Check ClickHouse cluster
kubectl logs -n langfuse langfuse-clickhouse-shard0-0
# Check disk usage if using persistence
kubectl exec -n langfuse langfuse-postgresql-0 -- df -h /bitnami/postgresql# Check if port is already in use
lsof -i :3000
# Use alternative port
kubectl port-forward -n langfuse svc/langfuse 3001:3000For production access without port-forwarding:
# Option 1: LoadBalancer
langfuse_service_type = "LoadBalancer"
# Option 2: Ingress (requires ingress controller)
langfuse_ingress_enabled = true
langfuse_ingress_hostname = "langfuse.yourdomain.com"- Rotate API Keys Regularly: Generate new keys quarterly
- Use AWS Secrets Manager: Store LangFuse keys securely
- Enable RBAC: Restrict namespace access
- Network Policies: Limit traffic to LangFuse namespace
- Backup Strategy:
- Regular PostgreSQL backups
- ClickHouse data replication
- Persistent volume snapshots
Monitor resource usage and scale as needed:
# Check resource usage
kubectl top pods -n langfuse
# Scale web replicas if needed
kubectl scale deployment langfuse-web -n langfuse --replicas=3