A production-ready Apache Spark cluster configuration for ARM64 architecture, featuring Docker containers and Kubernetes deployment.
- 🐳 Multi-stage Docker build optimized for ARM64
- 🔄 Automated builds with GitHub Actions for both Spark and Hive images
- 🔒 Traefik ingress with TLS support
- 📊 Enhanced logging with rotation
- 🏥 Health checks and monitoring
- 🔄 Stateful master node
- 📦 Helm chart for easy deployment
- 🔧 Makefile for common tasks
- ✅ Stable release with verified master-worker connectivity
- 🐝 Integrated Hive Metastore and HiveServer2 support
- 🔄 Automated Hive image builds and deployments
The project is now in a stable state with:
- Verified master-worker connectivity
- Proper service discovery
- Resource cleanup
- Health monitoring
- Persistent logging
- Functional Hive Metastore and HiveServer2 services
- Integrated Spark-Hive connectivity
- Docker with ARM64 support
- Kubernetes cluster
- Helm 3.x
- kubectl
- make
To run the local testing environment, you need:
-
Required JARs:
# Create directories for JARs mkdir -p $HOME/spark-hive-jars $HOME/spark-extra-jars # Download Hive JARs (version 2.3.9) curl -L https://repo1.maven.org/maven2/org/apache/hive/hive-common/2.3.9/hive-common-2.3.9.jar -o $HOME/spark-hive-jars/ curl -L https://repo1.maven.org/maven2/org/apache/hive/hive-cli/2.3.9/hive-cli-2.3.9.jar -o $HOME/spark-hive-jars/ curl -L https://repo1.maven.org/maven2/org/apache/hive/hive-metastore/2.3.9/hive-metastore-2.3.9.jar -o $HOME/spark-hive-jars/ curl -L https://repo1.maven.org/maven2/org/apache/hive/hive-exec/2.3.9/hive-exec-2.3.9-core.jar -o $HOME/spark-hive-jars/ curl -L https://repo1.maven.org/maven2/org/apache/hive/hive-serde/2.3.9/hive-serde-2.3.9.jar -o $HOME/spark-hive-jars/ curl -L https://repo1.maven.org/maven2/org/apache/hive/hive-jdbc/2.3.9/hive-jdbc-2.3.9.jar -o $HOME/spark-hive-jars/ # Download AWS JARs curl -L https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar -o $HOME/spark-extra-jars/ curl -L https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar -o $HOME/spark-extra-jars/
-
Required Services:
- PostgreSQL server (for Hive metastore)
- MinIO or S3-compatible storage
- Apache Spark (for spark-shell)
- Beeline client (for Hive Server2 testing)
-
Network Access:
- PostgreSQL port (default: 5432)
- MinIO/S3 port (default: 9000)
- Hive Server2 port (default: 10000)
-
Environment Variables:
# PostgreSQL export POSTGRES_HOST=your-postgres-host export POSTGRES_PORT=5432 export POSTGRES_USER=hive export POSTGRES_PASSWORD=hive # MinIO/S3 export AWS_ENDPOINT_URL=http://your-minio-host:9000 export AWS_ACCESS_KEY_ID=your-access-key export AWS_SECRET_ACCESS_KEY=your-secret-key export MINIO_BUCKET=your-bucket-name
-
Testing Tools:
- netcat (for port testing)
- curl (for downloading JARs)
- beeline (for Hive Server2 testing)
- sbt (for Scala testing)
- Clone the repository:
git clone https://github.com/openbiocure/spark-arm.git
cd spark-arm
- Use the stable release:
git checkout stable
- Build and deploy:
make all
Or deploy step by step:
make build # Build Docker image
make push # Push to registry
make deploy # Deploy to Kubernetes
The following environment variables can be configured in spark-arm/values.yaml
:
# Master configuration
master:
resources:
limits:
cpu: "1"
memory: "1Gi"
requests:
cpu: "500m"
memory: "512Mi"
# Worker configuration
worker:
replicaCount: 2
cores: "2"
memory: "2048m"
resources:
limits:
cpu: "2"
memory: "2Gi"
requests:
cpu: "1"
memory: "1Gi"
The cluster uses persistent storage for logs:
storage:
className: local-path
size: 10Gi
accessMode: ReadWriteOnce
Traefik ingress is configurable with TLS:
ingress:
enabled: true
className: "traefik"
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: "websecure"
traefik.ingress.kubernetes.io/router.tls: "true"
make build # Build Spark Docker image
make build-hive # Build Hive Docker image
make push # Push Spark image to registry
make push-hive # Push Hive image to registry
make deploy # Deploy to Kubernetes
make undeploy # Remove deployment
make logs # View logs
make test # Test cluster readiness
make clean # Clean up artifacts
make help # Show all commands
# Build Spark for ARM64
docker build --platform linux/arm64 -t spark-arm:latest -f docker/Dockerfile .
# Build Hive for ARM64
docker build --platform linux/arm64 -t hive-arm:latest -f hive/Dockerfile hive
# Test cluster readiness
make test
# View logs
make logs
To test the Spark and Hive integration locally, you can use the following spark-shell command:
spark-shell \
--master "local[*]" \
--jars \
$HOME/spark-hive-jars/hive-common-2.3.9.jar,\
$HOME/spark-hive-jars/hive-cli-2.3.9.jar,\
$HOME/spark-hive-jars/hive-metastore-2.3.9.jar,\
$HOME/spark-hive-jars/hive-exec-2.3.9-core.jar,\
$HOME/spark-hive-jars/hive-serde-2.3.9.jar,\
$HOME/spark-hive-jars/hive-jdbc-2.3.9.jar,\
$HOME/spark-extra-jars/hadoop-aws-3.3.4.jar,\
$HOME/spark-extra-jars/aws-java-sdk-bundle-1.12.262.jar \
--conf spark.sql.catalogImplementation=hive \
--conf javax.jdo.option.ConnectionURL=jdbc:postgresql://172.16.14.112:5432/hive \
--conf javax.jdo.option.ConnectionDriverName=org.postgresql.Driver \
--conf javax.jdo.option.ConnectionUserName=hive \
--conf javax.jdo.option.ConnectionPassword=hive \
--conf spark.sql.warehouse.dir=s3a://test/warehouse \
--conf spark.hadoop.fs.s3a.endpoint=http://172.16.14.201:9000 \
--conf spark.hadoop.fs.s3a.access.key=iglIu8yZXZRFipZQDEFI \
--conf spark.hadoop.fs.s3a.secret.key=J0lqSKgQKKJBfnNnwMHBhinFy1iMxmKGIKh4h6oP \
--conf spark.hadoop.fs.s3a.path.style.access=true \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
Once the spark-shell is running, you can test the connectivity with the following commands:
- Test Hive Metastore Connection:
// List all databases
spark.sql("SHOW DATABASES").show()
// Create a test database
spark.sql("CREATE DATABASE IF NOT EXISTS test_db")
// Use the test database
spark.sql("USE test_db")
// Create a test table
spark.sql("""
CREATE TABLE IF NOT EXISTS test_table (
id INT,
name STRING,
value DOUBLE
)
""")
// Insert some test data
spark.sql("""
INSERT INTO test_table VALUES
(1, 'test1', 1.1),
(2, 'test2', 2.2)
""")
// Query the table
spark.sql("SELECT * FROM test_table").show()
- Test S3/MinIO Connection:
// List files in the warehouse directory
spark.sql("SHOW CREATE TABLE test_table").show(false)
// Check if we can write to S3
spark.sql("""
CREATE TABLE IF NOT EXISTS s3_test (
id INT,
data STRING
) LOCATION 's3a://test/warehouse/s3_test'
""")
// Insert data to S3
spark.sql("""
INSERT INTO s3_test VALUES
(1, 's3_test1'),
(2, 's3_test2')
""")
// Verify data in S3
spark.sql("SELECT * FROM s3_test").show()
- Test Hive Server2 Connection (from another terminal):
# Using beeline client
beeline -u jdbc:hive2://localhost:10000
# Or using the Hive CLI
hive --service cli
If you encounter any connection issues:
- Verify PostgreSQL is running and accessible:
nc -zv 172.16.14.112 5432
- Verify MinIO is running and accessible:
nc -zv 172.16.14.201 9000
- Check the logs:
tail -f /opt/spark/logs/spark-*.log
- Verify all required JARs are present in the specified directories
- Master Node: StatefulSet with single replica
- Worker Nodes: Deployment with configurable replicas
- Storage: PersistentVolume for logs
- Networking: Traefik ingress with TLS
- Monitoring: Health checks and logging
The Spark cluster is deployed using StatefulSets with the following characteristics:
- Master pod runs as a single replica
- Worker pods run with configurable replicas (default: 3)
- Pods are scheduled on Linux nodes using a basic node selector
- No affinity rules or topology constraints are enforced, allowing flexible pod placement
- Each pod has its own persistent volume for logs
The pods use the following probe configurations for health monitoring:
- Startup Probe:
- Master: 5 seconds initial delay
- Worker: 10 seconds initial delay
- 3-second period
- 2-second timeout
- Liveness/Readiness Probes:
- 10 seconds initial delay
- 5-second period
- 3-second timeout
- 3 failure threshold
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Apache Spark
- Apache Hadoop
- Kubernetes
- Traefik
- GitHub Actions