This guide documents the complete installation of a 4-node Kubernetes cluster on Turing RK1 boards using Talos Linux.
Use the deploy-talos-cluster.sh script for automated deployment:
# From repository root
./scripts/deploy-talos-cluster.sh prereq # Check prerequisites
./scripts/deploy-talos-cluster.sh deploy # Full automated deployment| Command | Description |
|---|---|
prereq |
Check prerequisites (tools, BMC access, image) |
deploy |
Full deployment (all phases) |
download-image |
Download Talos image from factory |
flash |
Flash nodes with Talos image |
boot |
Power on and boot all nodes |
generate |
Generate cluster configurations |
apply |
Apply configurations to nodes |
bootstrap |
Bootstrap Kubernetes cluster |
kubeconfig |
Get kubeconfig for kubectl access |
longhorn |
Install Longhorn storage |
status |
Show cluster status |
reset |
Reset cluster (DESTRUCTIVE!) |
./scripts/deploy-talos-cluster.sh power-status # Check BMC power status
./scripts/deploy-talos-cluster.sh power-on # Power on all nodes
./scripts/deploy-talos-cluster.sh power-off # Power off all nodes
./scripts/deploy-talos-cluster.sh uart 1 # View node 1 UART outputAfter deployment, check cluster health with:
./scripts/talos-cluster-status.shThis script auto-detects the cluster type and displays:
- Node reachability and health
- Kubernetes nodes and conditions
- Pod status summary by namespace
- LoadBalancer and Ingress resources
- Longhorn storage status
- Recent warning events
For step-by-step manual installation, continue reading below.
- Prerequisites
- Hardware Overview
- Network Configuration
- BMC Access
- Talos Image Preparation
- Flashing Nodes
- Boot Order Fix (NVMe vs eMMC)
- NVMe Filesystem Mismatch
- Cluster Bootstrap
- Adding Worker Nodes
- Storage Setup
- Ingress Configuration
- Monitoring Setup
- Management Tools
- Verification
- Troubleshooting
Install the following on your workstation:
# Talos CLI
curl -sL https://talos.dev/install | sh
# or
brew install siderolabs/tap/talosctl
# Kubernetes CLI
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl && sudo mv kubectl /usr/local/bin/
# Helm
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
# Turing Pi CLI
# Download from: https://github.com/turing-machines/tpi/releases| Node | Role | Hostname | IP Address | Storage |
|---|---|---|---|---|
| Node 1 | Control Plane | turing-cp1 | 10.10.88.73 | 31GB eMMC + 500GB NVMe |
| Node 2 | Worker | turing-w1 | 10.10.88.74 | 31GB eMMC + 500GB NVMe |
| Node 3 | Worker | turing-w2 | 10.10.88.75 | 31GB eMMC + 500GB NVMe |
| Node 4 | Worker | turing-w3 | 10.10.88.76 | 31GB eMMC + 500GB NVMe |
Hardware Specifications (per RK1 node):
- SoC: Rockchip RK3588 (8-core ARM64)
- RAM: 16GB or 32GB
- eMMC: 32GB (system disk - /dev/mmcblk0)
- NVMe: 500GB Crucial P3 (data disk - /dev/nvme0n1)
- NPU: 6 TOPS (not currently supported in Talos)
| Purpose | IP Range |
|---|---|
| BMC | 10.10.88.70 |
| Cluster Nodes | 10.10.88.73-76 |
| MetalLB Pool | 10.10.88.80-99 |
| Kubernetes API | 10.10.88.73:6443 |
| Service | IP |
|---|---|
| Ingress Controller | 10.10.88.80 |
| Portainer Agent | 10.10.88.81 |
Store BMC credentials in environment variables (do not commit to git):
# Add to ~/.bashrc or ~/.zshrc (not tracked by git)
export TPI_USERNAME=root
export TPI_PASSWORD="<your-bmc-password>"
export TPI_HOSTNAME=10.10.88.70IMPORTANT: Always use environment variables when running TPI commands remotely:
# With env vars set, run commands normally
tpi info
tpi power status
tpi flash -n 1 --image-url "https://example.com/image.raw.xz"Or source from a local env file (gitignored):
# Create .env.local (add to .gitignore)
source .env.local
tpi info# System info
tpi info
# Power operations
tpi power status # Check all nodes
tpi power on -n 1 # Power on node 1
tpi power on -n 1,2,3,4 # Power on all nodes
tpi power off -n 1 # Power off node 1
# Flash firmware (via URL - recommended)
tpi flash -n 1 --image-url "https://example.com/image.raw.xz"
# Flash firmware (local file on BMC)
tpi flash -n 1 -i /mnt/sdcard/image.raw
# UART access
tpi uart -n 1 get # Get UART bufferssh root@10.10.88.70
# Use password from your credentials store| Node | Serial Device | Baud Rate |
|---|---|---|
| Node 1 | /dev/ttyS2 | 115200 |
| Node 2 | /dev/ttyS3 | 115200 |
| Node 3 | /dev/ttyS4 | 115200 |
| Node 4 | /dev/ttyS5 | 115200 |
Create a custom Talos schematic with required extensions:
# talos-schematic.yaml
overlay:
name: turingrk1
image: siderolabs/sbc-rockchip
customization:
systemExtensions:
officialExtensions:
- siderolabs/iscsi-tools
- siderolabs/util-linux-toolscurl -s -X POST --data-binary @talos-schematic.yaml \
https://factory.talos.dev/schematics | jq -r '.id'
# Current schematic ID:
# 85f683902139269fbc5a7f64ea94a694d31e0b3d94347a225223fcbd042083aeCurrent Image (Talos v1.11.6):
https://factory.talos.dev/image/85f683902139269fbc5a7f64ea94a694d31e0b3d94347a225223fcbd042083ae/v1.11.6/metal-arm64.raw.xz
To download locally:
mkdir -p images/latest
wget -O images/latest/metal-arm64.raw.xz \
"https://factory.talos.dev/image/85f683902139269fbc5a7f64ea94a694d31e0b3d94347a225223fcbd042083ae/v1.11.6/metal-arm64.raw.xz"Download the image locally first, then flash using --image-path:
# Ensure TPI_USERNAME, TPI_PASSWORD, TPI_HOSTNAME env vars are set
# Download image locally
wget -O /tmp/talos-rk1-v1.11.6.raw.xz \
"https://factory.talos.dev/image/85f683902139269fbc5a7f64ea94a694d31e0b3d94347a225223fcbd042083ae/v1.11.6/metal-arm64.raw.xz"
# Flash control plane (node 1)
tpi flash -n 1 --image-path /tmp/talos-rk1-v1.11.6.raw.xz
# Flash worker nodes
for node in 2 3 4; do
echo "Flashing node $node..."
tpi flash -n $node --image-path /tmp/talos-rk1-v1.11.6.raw.xz
done
# Power on all nodes after flashing
for node in 1 2 3 4; do tpi power on -n $node; doneIf the node is running Ubuntu or another OS with SSH access:
# SSH to node
ssh ubuntu@<node-ip>
# Download Talos image
wget https://factory.talos.dev/image/85f683902139269fbc5a7f64ea94a694d31e0b3d94347a225223fcbd042083ae/v1.11.6/metal-arm64.raw.xz
# Decompress
xz -d metal-arm64.raw.xz
# Flash to eMMC (DESTROYS CURRENT OS!)
sudo dd if=metal-arm64.raw of=/dev/mmcblk0 bs=4M status=progress
# Sync and shutdown
sudo sync
sudo shutdown -h now# Copy image to BMC SD card
scp images/latest/metal-arm64.raw root@10.10.88.70:/mnt/sdcard/
# SSH to BMC and flash
ssh root@10.10.88.70
tpi flash -n 1 -i /mnt/sdcard/metal-arm64.rawexport # Ensure TPI_USERNAME, TPI_PASSWORD, TPI_HOSTNAME env vars are set
tpi power on -n 1,2,3,4
# Wait for nodes to boot (2-3 minutes)
sleep 180# Check if Talos maintenance port is open
for ip in 10.10.88.73 10.10.88.74 10.10.88.75 10.10.88.76; do
nc -zv $ip 50000 2>&1 | grep -q succeeded && echo "$ip: Maintenance mode OK"
doneIf a node boots from NVMe instead of eMMC, the NVMe likely has bootable content that U-Boot prioritizes. This results in the node running an old OS instead of freshly flashed Talos.
- Node boots into Ubuntu instead of Talos after flashing
- SSH port 22 is open instead of Talos port 50000
lsblkshows root filesystem on mmcblk0 but node runs wrong OS
-
Power off the node:
# Ensure TPI_USERNAME, TPI_PASSWORD, TPI_HOSTNAME env vars are set tpi power off -n 1 -
Flash Ubuntu temporarily (to get SSH access):
# Ensure TPI_USERNAME, TPI_PASSWORD, TPI_HOSTNAME env vars are set \ tpi flash -n 1 --image-url "https://firmware.turingpi.com/turing-rk1/ubuntu_22.04_rockchip_linux/v2.1.0/ubuntu-22.04.5-v2.1.0.img"
-
SSH to BMC and open serial console:
ssh root@10.10.88.70 picocom /dev/ttyS2 -b 115200 # For Node 1 -
From another terminal, power on the node:
# Ensure TPI_USERNAME, TPI_PASSWORD, TPI_HOSTNAME env vars are set tpi power on -n 1 -
In picocom session, interrupt U-Boot:
- Press spacebar when you see "Hit any key to stop autoboot"
-
Set boot order to eMMC first:
=> setenv boot_targets "mmc0 nvme0" => saveenv => boot -
Login to Ubuntu and wipe NVMe:
# Login: ubuntu / ubuntu (will force password change) # Set new password when prompted sudo wipefs -a /dev/nvme0n1 sudo shutdown -h now
-
Flash Talos:
# Ensure TPI_USERNAME, TPI_PASSWORD, TPI_HOSTNAME env vars are set \ tpi flash -n 1 --image-url "https://factory.talos.dev/image/85f683902139269fbc5a7f64ea94a694d31e0b3d94347a225223fcbd042083ae/v1.11.6/metal-arm64.raw.xz"
-
Power on - node should now boot Talos from eMMC:
# Ensure TPI_USERNAME, TPI_PASSWORD, TPI_HOSTNAME env vars are set tpi power on -n 1
If nodes were previously running Ubuntu or another OS, the NVMe drives may have ext4 partitions. Talos expects XFS filesystem for its disk mounts, causing boot failures.
[talos] volume status ... "error": "filesystem type mismatch: ext4 != xfs"
[talos] controller failed ... "error": "error writing kubelet PKI: read-only file system"
The node boots but kubelet fails to start, and the node won't join the cluster.
Option A: From a working cluster (recommended)
If you have at least one working node (e.g., control plane), use it to wipe the NVMe on problem nodes:
export TALOSCONFIG=/path/to/talosconfig
# Wipe NVMe on each worker node
talosctl --endpoints 10.10.88.73 --nodes 10.10.88.74 wipe disk nvme0n1
talosctl --endpoints 10.10.88.73 --nodes 10.10.88.75 wipe disk nvme0n1
talosctl --endpoints 10.10.88.73 --nodes 10.10.88.76 wipe disk nvme0n1
# Apply config (will reboot to create new XFS partitions)
talosctl --endpoints 10.10.88.73 apply-config --nodes 10.10.88.74 --file worker2-final.yamlOption B: Remove NVMe config temporarily
If all nodes are failing, temporarily remove the NVMe disk config from worker configs:
- Edit worker configs to remove
machine.disksandmachine.kubelet.extraMounts - Apply configs - nodes will boot without NVMe
- Once nodes are up, wipe NVMe using talosctl
- Re-add disk config and apply again
# After nodes are running without disk config:
talosctl --endpoints 10.10.88.73 --nodes 10.10.88.74 wipe disk nvme0n1
# Then apply full config with disk mounts
talosctl --endpoints 10.10.88.73 apply-config --nodes 10.10.88.74 --file worker-with-nvme.yaml# Check volume status
talosctl --endpoints 10.10.88.73 --nodes 10.10.88.74 get discoveredvolumes | grep nvme
# Should show:
# nvme0n1 disk 500 GB gpt
# nvme0n1p1 partition 500 GB xfs <-- Must be XFS, not ext4
# Check mount
talosctl --endpoints 10.10.88.73 --nodes 10.10.88.74 mounts | grep longhorn
# Should show /var/lib/longhorn mounted from nvme0n1p1Generate cluster secrets once and keep them secure:
mkdir -p cluster-config
cd cluster-config
talosctl gen secrets -o secrets.yaml# Generate control plane config
talosctl gen config turing-cluster https://10.10.88.73:6443 \
--with-secrets secrets.yaml \
--output-types controlplane \
--output controlplane.yaml
# Generate worker config
talosctl gen config turing-cluster https://10.10.88.73:6443 \
--with-secrets secrets.yaml \
--output-types worker \
--output worker.yamlcontrolplane-patch.yaml:
machine:
network:
hostname: turing-cp1
interfaces:
- interface: eth0
dhcp: true
install:
disk: /dev/mmcblk0
disks:
- device: /dev/nvme0n1
partitions:
- mountpoint: /var/lib/longhorn
kubelet:
extraMounts:
- destination: /var/lib/longhorn
type: bind
source: /var/lib/longhorn
options:
- bind
- rshared
- rw
nodeLabels:
node.kubernetes.io/exclude-from-external-load-balancers: ""
cluster:
allowSchedulingOnControlPlanes: trueworker-patch.yaml (template - adjust hostname per node):
machine:
network:
hostname: turing-w1 # Change for each worker: w1, w2, w3
interfaces:
- interface: eth0
dhcp: true
install:
disk: /dev/mmcblk0
disks:
- device: /dev/nvme0n1
partitions:
- mountpoint: /var/lib/longhorn
kubelet:
extraMounts:
- destination: /var/lib/longhorn
type: bind
source: /var/lib/longhorn
options:
- bind
- rshared
- rw# Control plane
talosctl machineconfig patch controlplane.yaml --patch @controlplane-patch.yaml \
--output controlplane-node1.yaml
# Workers (create separate patches for each with unique hostnames)
talosctl machineconfig patch worker.yaml --patch @worker2-patch.yaml --output worker2-final.yaml
talosctl machineconfig patch worker.yaml --patch @worker3-patch.yaml --output worker3-final.yaml
talosctl machineconfig patch worker.yaml --patch @worker4-patch.yaml --output worker4-final.yaml# Verify node is in maintenance mode
nc -zv 10.10.88.73 50000
# Apply config
talosctl apply-config --insecure --nodes 10.10.88.73 --file controlplane-node1.yaml# Set endpoints and node
talosctl config endpoint 10.10.88.73
talosctl config node 10.10.88.73
# Or specify talosconfig location
export TALOSCONFIG=$(pwd)/talosconfigWait for node to finish applying config (~2 minutes), then:
talosctl bootstrap --nodes 10.10.88.73talosctl kubeconfig --force# Check Talos health
talosctl health --wait-timeout 5m
# Check Kubernetes
kubectl get nodes -o wide
kubectl get pods -AFor nodes in maintenance mode (port 50000 open, no TLS required):
# Node 2
talosctl apply-config --insecure --nodes 10.10.88.74 --file worker2-final.yaml
# Node 3
talosctl apply-config --insecure --nodes 10.10.88.75 --file worker3-final.yaml
# Node 4
talosctl apply-config --insecure --nodes 10.10.88.76 --file worker4-final.yaml# Watch nodes join
kubectl get nodes -w
# Expected output:
# NAME STATUS ROLES AGE VERSION
# turing-cp1 Ready control-plane 10m v1.34.1
# turing-w1 Ready <none> 2m v1.34.1
# turing-w2 Ready <none> 2m v1.34.1
# turing-w3 Ready <none> 2m v1.34.1If a worker was configured with a different cluster's secrets:
# Reflash the node
# Ensure TPI_USERNAME, TPI_PASSWORD, TPI_HOSTNAME env vars are set \
tpi flash -n 2 --image-url "https://factory.talos.dev/image/85f683902139269fbc5a7f64ea94a694d31e0b3d94347a225223fcbd042083ae/v1.11.6/metal-arm64.raw.xz"
# Power on
# Ensure TPI_USERNAME, TPI_PASSWORD, TPI_HOSTNAME env vars are set tpi power on -n 2
# Wait and apply config
sleep 120
talosctl apply-config --insecure --nodes 10.10.88.74 --file worker2-final.yamlSee STORAGE.md for detailed storage configuration.
# Create namespace
kubectl create namespace longhorn-system
kubectl label namespace longhorn-system pod-security.kubernetes.io/enforce=privileged
# Install Longhorn
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.7.2/deploy/longhorn.yaml
# Wait for deployment
kubectl -n longhorn-system rollout status deploy/longhorn-driver-deployer
# Create NVMe storage class
cat <<EOF | kubectl apply -f -
apiVersion: storage.longhorn.io/v1beta2
kind: StorageClass
metadata:
name: longhorn-nvme
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
numberOfReplicas: "2"
staleReplicaTimeout: "2880"
diskSelector: "nvme"
dataLocality: "best-effort"
EOFSee NETWORKING.md for detailed networking setup.
# Install MetalLB
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.9/config/manifests/metallb-native.yaml
# Wait for pods
kubectl wait --namespace metallb-system \
--for=condition=ready pod \
--selector=app=metallb \
--timeout=90s
# Configure IP pool
cat <<EOF | kubectl apply -f -
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: default-pool
namespace: metallb-system
spec:
addresses:
- 10.10.88.80-10.10.88.99
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: default
namespace: metallb-system
spec:
ipAddressPools:
- default-pool
EOF# Create and label namespace
kubectl create namespace ingress-nginx
kubectl label namespace ingress-nginx pod-security.kubernetes.io/enforce=privileged
# Install NGINX Ingress Controller (cloud provider version for LoadBalancer support)
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.12.0-beta.0/deploy/static/provider/cloud/deploy.yaml
# Wait for controller to be ready
kubectl wait --namespace ingress-nginx \
--for=condition=ready pod \
--selector=app.kubernetes.io/component=controller \
--timeout=120s
# Verify LoadBalancer IP assigned (should be 10.10.88.80)
kubectl get svc -n ingress-nginx ingress-nginx-controllerSee MONITORING.md for detailed monitoring configuration.
# Add Prometheus Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Create namespace
kubectl create namespace monitoring
kubectl label namespace monitoring pod-security.kubernetes.io/enforce=privileged
# Install kube-prometheus-stack with values file
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring \
-f cluster-config/prometheus-values.yaml \
--wait --timeout 10m
# Verify deployment
kubectl get pods -n monitoring
kubectl get ingress -n monitoringSave as cluster-config/prometheus-values.yaml:
grafana:
enabled: true
adminPassword: admin
ingress:
enabled: true
ingressClassName: nginx
hosts:
- grafana.local
persistence:
enabled: true
storageClassName: longhorn
size: 5Gi
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: longhorn
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 2Gi
ingress:
enabled: true
ingressClassName: nginx
hosts:
- prometheus.local
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: longhorn
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 2Gi
ingress:
enabled: true
ingressClassName: nginx
hosts:
- alertmanager.local
# Disable components not accessible on Talos
kubeEtcd:
enabled: false
kubeScheduler:
enabled: false
kubeControllerManager:
enabled: falseAdd to /etc/hosts:
10.10.88.80 grafana.local prometheus.local alertmanager.local longhorn.local
| Service | URL | Credentials |
|---|---|---|
| Grafana | http://grafana.local | admin / admin |
| Prometheus | http://prometheus.local | - |
| Alertmanager | http://alertmanager.local | - |
kubectl apply -f https://downloads.portainer.io/ce2-22/portainer-agent-k8s-nodeport.yaml
kubectl label namespace portainer pod-security.kubernetes.io/enforce=privileged
kubectl patch svc portainer-agent -n portainer -p '{"spec":{"type":"LoadBalancer"}}'Connection URL: 10.10.88.81:9001
# Nodes
kubectl get nodes -o wide
# System pods
kubectl get pods -A
# Storage
kubectl get nodes.longhorn.io -n longhorn-system
# Services with external IPs
kubectl get svc -A --field-selector spec.type=LoadBalancer
# Ingress
kubectl get ingress -ANAME STATUS ROLES AGE VERSION
turing-cp1 Ready control-plane 1h v1.34.1
turing-w1 Ready <none> 1h v1.34.1
turing-w2 Ready <none> 1h v1.34.1
turing-w3 Ready <none> 1h v1.34.1
If too many authentication attempts lock out the BMC:
Exceeded allowed authentication attempts. Access blocked for Xm Ys
Solution: Wait for the lockout period to expire, then retry with correct credentials.
# Check port 50000
nc -zv <node-ip> 50000
# If "connection refused" - Talos not running or wrong IP
# If "tls: certificate required" - node already configured# Must reflash the node
# Ensure TPI_USERNAME, TPI_PASSWORD, TPI_HOSTNAME env vars are set \
tpi flash -n <node> --image-url "https://factory.talos.dev/image/85f683902139269fbc5a7f64ea94a694d31e0b3d94347a225223fcbd042083ae/v1.11.6/metal-arm64.raw.xz"See Boot Order Fix section.
# System logs
talosctl -n <node-ip> dmesg
# Service logs
talosctl -n <node-ip> logs kubelet
talosctl -n <node-ip> logs etcd
# All services status
talosctl -n <node-ip> services# Check for PodSecurity issues
kubectl describe pod <pod-name> -n <namespace>
# Label namespace as privileged if needed
kubectl label namespace <ns> pod-security.kubernetes.io/enforce=privileged# Check Longhorn status
kubectl get volumes.longhorn.io -n longhorn-system
# Check NVMe mounts on node
talosctl -n <node-ip> mounts | grep nvme| Component | Version |
|---|---|
| Talos | v1.11.6 |
| Kubernetes | v1.34.1 |
| Longhorn | v1.7.2 |
| MetalLB | v0.14.9 |
| Ingress NGINX | v1.12.0-beta.0 |
| kube-prometheus-stack | latest (Helm) |
| File | Purpose |
|---|---|
cluster-config/secrets.yaml |
Cluster secrets (KEEP SECURE!) |
cluster-config/talosconfig |
Talos CLI configuration |
cluster-config/controlplane-node1.yaml |
Control plane config |
cluster-config/worker*-final.yaml |
Worker configs |
talos-schematic.yaml |
Image customization |