A distributed consensus system that simulates AWS Lambda functions running locally using Docker containers with AWS Runtime Interface Emulator (RIE). The system implements a simplified Raft-like consensus algorithm for maintaining a consistent global count across five Java-based Lambda functions that communicate exclusively through SQS queues.
- Overview
- Architecture
- Prerequisites
- Quick Start
- Configuration
- Usage
- Monitoring
- Testing
- Troubleshooting
- Performance
- Development
The Lambda Consensus Federation demonstrates:
- Distributed Consensus: Five Lambda functions maintain consensus on a global count value
- Fault Tolerance: Automatic recovery when individual Lambda containers are restarted
- Message-Based Communication: All communication happens through local SQS queues
- Quorum-Based Recovery: Minimum 3 out of 5 nodes required for recovery operations
- Performance Optimization: Connection pooling, retry logic, and graceful shutdown handling
- ✅ 5-Node Consensus: Distributed agreement across multiple Lambda instances
- ✅ SQS Communication: Local SQS queues for inter-Lambda messaging
- ✅ Automatic Recovery: Failed nodes recover state when restarted
- ✅ Comprehensive Logging: Structured JSON logging for all operations
- ✅ Performance Monitoring: Built-in metrics and performance tracking
- ✅ CLI Tools: Command-line utilities for testing and monitoring
- ✅ Docker Integration: Complete containerized environment
┌─────────────────────────────────────────────────────────────────┐
│ Docker Environment │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────┤
│ Lambda 1 │ Lambda 2 │ Lambda 3 │ Lambda 4 │Lambda 5 │
│ Java + RIE │ Java + RIE │ Java + RIE │ Java + RIE │Java+RIE │
└─────────────┴─────────────┴─────────────┴─────────────┴─────────┘
│ │ │ │ │
└─────────────┼─────────────┼─────────────┼─────────────┘
│ │ │
┌─────────────────────────────────────────────────────────────────┐
│ Local SQS (ElasticMQ) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Queue-1 │ │Queue-2 │ │Queue-3 │ │Queue-4 │ │Queue-5 │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────────────┘
- Increment Request: External trigger initiates count increment
- Proposal Phase: Receiving Lambda proposes new count to all peers
- Voting Phase: Each Lambda validates and votes on the proposal
- Commit Phase: If majority accepts, all Lambdas update their count
- Recovery Phase: Restarted Lambdas request current state from peers
- Docker: Version 20.0 or higher
- Docker Compose: Version 2.0 or higher
- Java: JDK 21 (for local development)
- Maven: Version 3.8 or higher (for building)
- Memory: Minimum 4GB RAM (8GB recommended)
- CPU: 2+ cores recommended
- Disk: 2GB free space
- Network: Docker networking enabled
# Clone the repository
git clone <repository-url>
cd lambda-consensus-federation
# Build the project
mvn clean package
# Verify the build
ls -la target/lambda-consensus-federation-1.0.0.jar
# Start all services
docker-compose up -d
# Verify all containers are running
docker-compose ps
# Check logs
docker-compose logs -f
# Send increment request to any Lambda
./scripts/consensus-cli.sh increment lambda-node-1
# Check current count across all nodes
./scripts/consensus-cli.sh status
# Monitor logs in real-time
./scripts/view-logs.sh
# Restart a single node
./scripts/restart-node.sh lambda-node-3
# Verify recovery
./scripts/consensus-cli.sh status
Each Lambda container supports the following environment variables:
Variable | Description | Default |
---|---|---|
NODE_ID |
Unique identifier for the Lambda instance | lambda-node-{timestamp} |
SQS_ENDPOINT |
Local SQS endpoint URL | http://localhost:9324 |
KNOWN_NODES |
Comma-separated list of peer node IDs | lambda-node-1,lambda-node-2,... |
LOG_LEVEL |
Logging verbosity level | INFO |
The docker-compose.yml
file defines:
- 5 Lambda containers: Each with unique NODE_ID
- ElasticMQ service: Local SQS implementation
- Trigger service: Sends random increment requests
- Shared network: For inter-container communication
- Volume mounts: For log aggregation
ElasticMQ configuration in elasticmq.conf
:
include classpath("application.conf")
node-address {
protocol = http
host = "*"
port = 9324
context-path = ""
}
rest-sqs {
enabled = true
bind-port = 9324
bind-hostname = "0.0.0.0"
sqs-limits = strict
}
The system includes several CLI utilities:
# Send increment request
./scripts/consensus-cli.sh increment <node-id>
# Check status of all nodes
./scripts/consensus-cli.sh status
# Send batch requests for load testing
./scripts/consensus-cli.sh batch 10
# Monitor specific node
./scripts/consensus-cli.sh monitor lambda-node-1
# Start the consensus system
./scripts/start-consensus.sh
# Stop the consensus system
./scripts/stop-consensus.sh
# Restart specific node
./scripts/restart-node.sh <node-id>
# View aggregated logs
./scripts/view-logs.sh
# Run integration tests
./scripts/run-integration-tests.sh
// Create consensus request
ConsensusRequest request = new ConsensusRequest(
MessageType.INCREMENT_REQUEST,
"source-node",
"target-node",
42L,
"proposal-123",
Map.of("metadata", "value")
);
// Send via SQS handler
SQSMessageHandler handler = new SQSMessageHandlerImpl("node-1");
boolean success = handler.sendMessage("lambda-node-2", request);
# Increment count
curl -X POST http://localhost:8080/increment \
-H "Content-Type: application/json" \
-d '{"targetNode": "lambda-node-1"}'
# Get status
curl http://localhost:8080/status
All operations are logged in structured JSON format:
{
"timestamp": "2024-01-15T10:30:45.123Z",
"level": "INFO",
"logger": "ConsensusLambdaHandler",
"nodeId": "lambda-node-1",
"operation": "CONSENSUS_OPERATION",
"proposalId": "prop-123",
"proposedValue": 42,
"phase": "commit",
"metadata": {
"success": true,
"duration": 1250
}
}
Built-in performance tracking includes:
- Consensus Duration: Time to complete consensus operations
- Message Counts: Sent/received message statistics
- Success Rates: Operation success/failure ratios
- Recovery Times: Node recovery duration metrics
Use the built-in log analyzer:
# Analyze consensus performance
java -cp target/lambda-consensus-federation-1.0.0.jar \
com.example.consensus.logging.LogAnalyzerCLI \
--analyze-consensus logs/
# Generate performance report
java -cp target/lambda-consensus-federation-1.0.0.jar \
com.example.consensus.logging.LogAnalyzerCLI \
--performance-report logs/
# Monitor all nodes
./scripts/view-logs.sh | grep "CONSENSUS_OPERATION"
# Monitor specific operations
docker-compose logs -f lambda-node-1 | jq '.operation'
# Watch consensus success rate
watch -n 5 './scripts/consensus-cli.sh status'
# Run all unit tests
mvn test
# Run specific test class
mvn test -Dtest=ConsensusLambdaHandlerTest
# Run with coverage
mvn test jacoco:report
# Run integration tests
mvn test -Dtest="*IntegrationTest"
# Run specific integration test
mvn test -Dtest=MultiNodeConsensusIntegrationTest
# Run with Docker containers
./scripts/run-integration-tests.sh
# Full system test
mvn test -Dtest=EndToEndSystemTest
# Chaos testing
mvn test -Dtest=AdvancedChaosTest
# Performance testing
mvn test -Dtest=ConsensusPerformanceIntegrationTest
# Generate load with CLI
./scripts/consensus-cli.sh batch 100
# Concurrent load testing
for i in {1..5}; do
./scripts/consensus-cli.sh batch 20 &
done
wait
Symptoms: Docker containers exit immediately or fail to start
Solutions:
# Check Docker daemon
docker info
# Verify image build
docker-compose build --no-cache
# Check resource limits
docker system df
docker system prune
Symptoms: "Connection refused" or SQS timeout errors
Solutions:
# Verify ElasticMQ is running
docker-compose ps elasticmq
# Check SQS endpoint
curl http://localhost:9324/
# Restart SQS service
docker-compose restart elasticmq
Symptoms: Nodes don't reach consensus or have different count values
Solutions:
# Check node status
./scripts/consensus-cli.sh status
# Verify all nodes are reachable
for i in {1..5}; do
docker-compose exec lambda-node-$i echo "Node $i OK"
done
# Check for network partitions
docker network ls
docker network inspect lambda-consensus-federation_consensus-network
Symptoms: Restarted nodes don't recover or get stuck in recovery state
Solutions:
# Check quorum availability
./scripts/consensus-cli.sh status | grep -c "IDLE"
# Manually trigger recovery
./scripts/restart-node.sh lambda-node-1
# Verify recovery logs
docker-compose logs lambda-node-1 | grep "RECOVERY"
Enable debug logging:
# Set debug level
export LOG_LEVEL=DEBUG
# Restart with debug logging
docker-compose down
docker-compose up -d
# View debug logs
docker-compose logs -f | grep "DEBUG"
# Check system resources
docker stats
# Monitor message queue lengths
curl http://localhost:9324/ | grep "ApproximateNumberOfMessages"
# Analyze consensus timing
grep "consensus_duration" logs/*.log | sort -n
# Check memory usage
docker stats --format "table {{.Container}}\t{{.MemUsage}}\t{{.MemPerc}}"
# Increase container memory limits
# Edit docker-compose.yml:
# mem_limit: 1g
# Find consensus failures
grep "consensus.*failed" logs/*.log
# Analyze recovery patterns
grep "RECOVERY" logs/*.log | cut -d' ' -f1-3 | sort | uniq -c
# Check message flow
grep "MESSAGE.*SENT\|MESSAGE.*RECEIVED" logs/*.log | head -20
Typical performance characteristics:
Metric | Value | Notes |
---|---|---|
Consensus Latency | < 5 seconds | 95th percentile |
Throughput | 1 op/10 seconds | Sequential operations |
Recovery Time | < 30 seconds | With 3+ nodes available |
Memory Usage | ~512MB/node | Including JVM overhead |
CPU Usage | < 10% idle | Burst during consensus |
# Increase message batch size
export SQS_MAX_MESSAGES=10
# Reduce polling wait time for faster response
export SQS_WAIT_TIME=5
# In Dockerfile, add JVM options:
ENV JAVA_OPTS="-Xmx512m -Xms256m -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
The system automatically uses optimized connection pooling:
- SQS Client: Reused connections with timeout management
- Thread Pool: Configurable pool size for message processing
- Queue Caching: URL caching to reduce SQS API calls
The current implementation supports exactly 5 nodes. To scale:
- Update
KNOWN_NODES
environment variable - Modify quorum calculations in
ConsensusManagerImpl
- Add additional containers to
docker-compose.yml
# In docker-compose.yml
services:
lambda-node-1:
mem_limit: 1g
cpus: '0.5'
# Full build with tests
mvn clean package
# Skip tests for faster build
mvn clean package -DskipTests
# Build Docker images
docker-compose build
src/
├── main/java/com/example/consensus/
│ ├── handler/ # Lambda function handlers
│ ├── manager/ # Consensus algorithm implementation
│ ├── messaging/ # SQS message handling
│ ├── model/ # Data models and serialization
│ ├── state/ # Node state management
│ ├── logging/ # Structured logging and metrics
│ ├── cli/ # Command-line utilities
│ └── trigger/ # Trigger service for testing
└── test/java/com/example/consensus/
├── integration/ # Integration tests
├── handler/ # Handler unit tests
├── manager/ # Manager unit tests
└── ... # Other test packages
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature
- Commit changes:
git commit -m 'Add amazing feature'
- Push to branch:
git push origin feature/amazing-feature
- Open a Pull Request
- Java: Follow Google Java Style Guide
- Logging: Use structured logging with appropriate levels
- Testing: Maintain >80% code coverage
- Documentation: Update README for new features
# Update version
mvn versions:set -DnewVersion=1.1.0
# Build and test
mvn clean package
# Create release
git tag -a v1.1.0 -m "Release version 1.1.0"
git push origin v1.1.0
This project is licensed under the MIT License - see the LICENSE file for details.
For questions, issues, or contributions:
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Wiki
Happy Consensus Building! 🚀