Skip to content

Commit daaa778

Browse files
authored
feat: Add Kubernetes deployment support (#14)
* feat: Add Kubernetes deployment support with persistent state - Add production-ready K8s manifests for deployment - Implement persistent volumes for state preservation - Create ConfigMap for non-sensitive configuration - Add secrets template for sensitive credentials - Include comprehensive deployment documentation - Update .gitignore to exclude k8s/secrets.yaml Enables deployment to Kubernetes clusters while maintaining circuit breaker state, last run tracking, and BigQuery cache across pod restarts. * chore: Add comprehensive Kubernetes patterns to .gitignore - Add kubeconfig and .kube/ directory patterns - Add Helm chart artifacts (*.tgz, Chart.lock) - Add Kustomize build outputs - Add temporary Kubernetes manifest patterns - Future-proof for additional K8s tooling
1 parent 8ad0436 commit daaa778

File tree

6 files changed

+450
-0
lines changed

6 files changed

+450
-0
lines changed

.gitignore

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,29 @@ credentials.json
1111
**/private/
1212
.secrets
1313

14+
# Kubernetes secrets (never commit actual secrets!)
15+
k8s/secrets.yaml
16+
17+
# Kubernetes configuration files
18+
*.kubeconfig
19+
kubeconfig
20+
.kube/
21+
22+
# Helm charts
23+
**/charts/*.tgz
24+
**/requirements.lock
25+
**/Chart.lock
26+
27+
# Kustomize builds
28+
k8s/build/
29+
kustomization-build.yaml
30+
31+
# Temporary Kubernetes manifests
32+
*.tmp.yaml
33+
*.backup.yaml
34+
k8s/*.tmp
35+
k8s/*.bak
36+
1437
# Ignore virtual environments
1538
venv/
1639

k8s/README.md

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
# Service Quality Oracle - Kubernetes Deployment
2+
3+
This directory contains Kubernetes manifests for deploying the Service Quality Oracle with persistent state management.
4+
5+
## Prerequisites
6+
7+
- Kubernetes cluster (version 1.19+)
8+
- `kubectl` configured to access your cluster
9+
- Docker image published to `ghcr.io/graphprotocol/service-quality-oracle`
10+
- **Storage class configured** (see Storage Configuration below)
11+
12+
## Quick Start
13+
14+
### 1. Create Secrets (Required)
15+
16+
```bash
17+
# Copy the example secrets file
18+
cp k8s/secrets.yaml.example k8s/secrets.yaml
19+
20+
# Edit with your actual credentials
21+
# IMPORTANT: Never commit secrets.yaml to version control
22+
nano k8s/secrets.yaml
23+
```
24+
25+
**Required secrets:**
26+
- **`google-credentials`**: Service account JSON for BigQuery access
27+
- **`blockchain-private-key`**: Private key for Arbitrum Sepolia transactions
28+
- **`arbitrum-api-key`**: API key for Arbiscan contract verification
29+
- **`slack-webhook-url`**: Webhook URL for operational notifications
30+
31+
### 2. Configure Storage (Required)
32+
33+
```bash
34+
# Check available storage classes
35+
kubectl get storageclass
36+
37+
# If you see a default storage class (marked with *), skip to step 3
38+
# Otherwise, edit persistent-volume-claim.yaml and uncomment the appropriate storageClassName
39+
```
40+
41+
**Common storage classes by platform:**
42+
- **AWS EKS**: `gp2`, `gp3`, `ebs-csi`
43+
- **Google GKE**: `standard`, `ssd`
44+
- **Azure AKS**: `managed-premium`, `managed`
45+
- **Local/Development**: `hostpath`, `local-path`
46+
47+
### 3. Deploy to Kubernetes
48+
49+
```bash
50+
# Apply all manifests
51+
kubectl apply -f k8s/
52+
53+
# Verify deployment
54+
kubectl get pods -l app=service-quality-oracle
55+
kubectl get pvc -l app=service-quality-oracle
56+
```
57+
58+
### 4. Monitor Deployment
59+
60+
```bash
61+
# Check pod status
62+
kubectl describe pod -l app=service-quality-oracle
63+
64+
# View logs
65+
kubectl logs -l app=service-quality-oracle -f
66+
67+
# Check persistent volumes
68+
kubectl get pv
69+
```
70+
71+
## Architecture
72+
73+
### Persistent Storage
74+
75+
The service uses **two persistent volumes** to maintain state across pod restarts:
76+
77+
- **`service-quality-oracle-data` (5GB)**: Circuit breaker state, last run tracking, BigQuery cache, CSV outputs
78+
- **`service-quality-oracle-logs` (2GB)**: Application logs
79+
80+
**Mount points:**
81+
- `/app/data` → Critical state files (circuit breaker, cache, outputs)
82+
- `/app/logs` → Application logs
83+
84+
### Configuration Management
85+
86+
**Non-sensitive configuration**`ConfigMap` (`configmap.yaml`)
87+
**Sensitive credentials**`Secret` (`secrets.yaml`)
88+
89+
This separation provides:
90+
- ✅ Easy configuration updates without rebuilding images
91+
- ✅ Secure credential management with base64 encoding
92+
- ✅ Clear separation of concerns
93+
94+
### Resource Allocation
95+
96+
**Requests (guaranteed):**
97+
- CPU: 250m (0.25 cores)
98+
- Memory: 512M
99+
100+
**Limits (maximum):**
101+
- CPU: 1000m (1.0 core)
102+
- Memory: 1G
103+
104+
## State Persistence Benefits
105+
106+
With persistent volumes, the service maintains:
107+
108+
1. **Circuit breaker state** → Prevents infinite restart loops
109+
2. **Last run tracking** → Enables proper catch-up logic
110+
3. **BigQuery cache** → Dramatic performance improvement (30s vs 5min restarts)
111+
4. **CSV audit artifacts** → Regulatory compliance and debugging
112+
113+
## Health Checks
114+
115+
The deployment uses **file-based health checks** (same as docker-compose):
116+
117+
**Liveness probe:** Checks `/app/healthcheck` file modification time
118+
**Readiness probe:** Verifies `/app/healthcheck` file exists
119+
120+
## Troubleshooting
121+
122+
### Pod Won't Start
123+
124+
```bash
125+
# Check events
126+
kubectl describe pod -l app=service-quality-oracle
127+
128+
# Common issues:
129+
# - Missing secrets
130+
# - PVC provisioning failures
131+
# - Image pull errors
132+
```
133+
134+
### Check Persistent Storage
135+
136+
```bash
137+
# Verify PVCs are bound
138+
kubectl get pvc
139+
140+
# Check if volumes are mounted correctly
141+
kubectl exec -it deployment/service-quality-oracle -- ls -la /app/data
142+
```
143+
144+
### Debug Configuration
145+
146+
```bash
147+
# Check environment variables
148+
kubectl exec -it deployment/service-quality-oracle -- env | grep -E "(BIGQUERY|BLOCKCHAIN)"
149+
150+
# Verify secrets are mounted
151+
kubectl exec -it deployment/service-quality-oracle -- ls -la /etc/secrets
152+
```
153+
154+
## Security Best Practices
155+
156+
**Secrets never committed** to version control
157+
**Service account** with minimal BigQuery permissions
158+
**Private key** stored in Kubernetes secrets (base64 encoded)
159+
**Resource limits** prevent resource exhaustion
160+
**Read-only filesystem** where possible
161+
162+
## Production Considerations
163+
164+
- **Backup strategy** for persistent volumes
165+
- **Monitoring** and alerting setup
166+
- **Log aggregation** (ELK stack, etc.)
167+
- **Network policies** for additional security
168+
- **Pod disruption budgets** for maintenance
169+
- **Horizontal Pod Autoscaler** (if needed for scaling)
170+
171+
## Next Steps
172+
173+
1. **Test deployment** in staging environment
174+
2. **Verify state persistence** across pod restarts
175+
3. **Set up monitoring** and alerting
176+
4. **Configure backup** for persistent volumes
177+
5. **Enable quality checking** after successful validation

k8s/configmap.yaml

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
apiVersion: v1
2+
kind: ConfigMap
3+
metadata:
4+
name: service-quality-oracle-config
5+
labels:
6+
app: service-quality-oracle
7+
data:
8+
# BigQuery Configuration
9+
BIGQUERY_LOCATION_ID: "US"
10+
BIGQUERY_PROJECT_ID: "graph-mainnet"
11+
BIGQUERY_DATASET_ID: "internal_metrics"
12+
BIGQUERY_TABLE_ID: "metrics_indexer_attempts"
13+
BIGQUERY_CURATION_TABLE_ID: "metrics_curator_signals"
14+
BIGQUERY_CURATOR_MAINNET_TABLE_ID: "curator_name_signal_dimensions_daily"
15+
BIGQUERY_CURATOR_ARBITRUM_TABLE_ID: "curator_name_signal_dimensions_arbitrum_daily"
16+
BIGQUERY_SUBGRAPH_LOOKUP_TABLE_ID: "subgraph_version_id_lookup"
17+
BIGQUERY_ANALYSIS_PERIOD_DAYS: "28"
18+
19+
# Blockchain Configuration (Arbitrum Sepolia)
20+
BLOCKCHAIN_CONTRACT_ADDRESS: "0x6d5550698F930210c3f50efe744bF51C55D791f6"
21+
BLOCKCHAIN_FUNCTION_NAME: "allowIndexers"
22+
BLOCKCHAIN_CHAIN_ID: "421614"
23+
BLOCK_EXPLORER_URL: "https://sepolia.arbiscan.io"
24+
TX_TIMEOUT_SECONDS: "30"
25+
26+
# RPC Provider URLs (Arbitrum Sepolia)
27+
BLOCKCHAIN_RPC_URL_1: "https://arbitrum-sepolia.drpc.org"
28+
BLOCKCHAIN_RPC_URL_2: "https://sepolia-rollup.arbitrum.io/rpc"
29+
BLOCKCHAIN_RPC_URL_3: "https://api.zan.top/arb-sepolia"
30+
BLOCKCHAIN_RPC_URL_4: "https://arbitrum-sepolia.gateway.tenderly.co"
31+
32+
# Scheduling Configuration
33+
SCHEDULED_RUN_TIME: "10:00"
34+
35+
# Subgraph URLs
36+
SUBGRAPH_URL_PRE_PRODUCTION: "https://api.studio.thegraph.com/query/110664/issuance-eligibility-oracle/v0.1.4"
37+
SUBGRAPH_URL_PRODUCTION: "https://gateway.thegraph.com/api/subgraphs/id/"
38+
39+
# Processing Configuration
40+
BATCH_SIZE: "125"
41+
MAX_AGE_BEFORE_DELETION: "120"
42+
43+
# Caching Configuration
44+
CACHE_MAX_AGE_MINUTES: "30"
45+
FORCE_BIGQUERY_REFRESH: "false"
46+
47+
# Eligibility Criteria
48+
MIN_ONLINE_DAYS: "5"
49+
MIN_SUBGRAPHS: "10"
50+
MAX_LATENCY_MS: "5000"
51+
MAX_BLOCKS_BEHIND: "50000"
52+
MIN_CURATION_SIGNAL: "500"
53+
54+
# Runtime Configuration
55+
RUN_ON_STARTUP: "true"

k8s/deployment.yaml

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
apiVersion: apps/v1
2+
kind: Deployment
3+
metadata:
4+
name: service-quality-oracle
5+
labels:
6+
app: service-quality-oracle
7+
spec:
8+
replicas: 1 # Single instance due to state management
9+
selector:
10+
matchLabels:
11+
app: service-quality-oracle
12+
template:
13+
metadata:
14+
labels:
15+
app: service-quality-oracle
16+
spec:
17+
containers:
18+
- name: service-quality-oracle
19+
image: ghcr.io/graphprotocol/service-quality-oracle:latest
20+
envFrom:
21+
# Load all non-sensitive configuration from ConfigMap
22+
- configMapRef:
23+
name: service-quality-oracle-config
24+
env:
25+
# Secrets from Kubernetes Secret
26+
- name: GOOGLE_APPLICATION_CREDENTIALS
27+
valueFrom:
28+
secretKeyRef:
29+
name: service-quality-oracle-secrets
30+
key: google-credentials
31+
- name: BLOCKCHAIN_PRIVATE_KEY
32+
valueFrom:
33+
secretKeyRef:
34+
name: service-quality-oracle-secrets
35+
key: blockchain-private-key
36+
- name: ETHERSCAN_API_KEY
37+
valueFrom:
38+
secretKeyRef:
39+
name: service-quality-oracle-secrets
40+
key: etherscan-api-key
41+
- name: ARBITRUM_API_KEY
42+
valueFrom:
43+
secretKeyRef:
44+
name: service-quality-oracle-secrets
45+
key: arbitrum-api-key
46+
- name: STUDIO_API_KEY
47+
valueFrom:
48+
secretKeyRef:
49+
name: service-quality-oracle-secrets
50+
key: studio-api-key
51+
- name: STUDIO_DEPLOY_KEY
52+
valueFrom:
53+
secretKeyRef:
54+
name: service-quality-oracle-secrets
55+
key: studio-deploy-key
56+
- name: SLACK_WEBHOOK_URL
57+
valueFrom:
58+
secretKeyRef:
59+
name: service-quality-oracle-secrets
60+
key: slack-webhook-url
61+
volumeMounts:
62+
- name: data-volume
63+
mountPath: /app/data
64+
- name: logs-volume
65+
mountPath: /app/logs
66+
resources:
67+
requests:
68+
memory: "512M" # Match docker-compose reservations
69+
cpu: "250m"
70+
limits:
71+
memory: "1G" # Match docker-compose limits
72+
cpu: "1000m" # Match docker-compose '1.0' cpus
73+
# Use file-based healthcheck like docker-compose (not HTTP)
74+
livenessProbe:
75+
exec:
76+
command:
77+
- python
78+
- -c
79+
- "import os, time; assert os.path.exists('/app/healthcheck') and time.time() - os.path.getmtime('/app/healthcheck') < 300, 'Healthcheck failed'"
80+
initialDelaySeconds: 60 # Match docker-compose start_period
81+
periodSeconds: 120 # Match docker-compose interval (5m -> 300s, but use 2m for faster detection)
82+
timeoutSeconds: 30 # Match docker-compose timeout
83+
failureThreshold: 3 # Match docker-compose retries
84+
readinessProbe:
85+
exec:
86+
command:
87+
- python
88+
- -c
89+
- "import os; assert os.path.exists('/app/healthcheck'), 'Healthcheck file missing'"
90+
initialDelaySeconds: 10
91+
periodSeconds: 30
92+
volumes:
93+
- name: data-volume
94+
persistentVolumeClaim:
95+
claimName: service-quality-oracle-data
96+
- name: logs-volume
97+
persistentVolumeClaim:
98+
claimName: service-quality-oracle-logs
99+
restartPolicy: Always

0 commit comments

Comments
 (0)