Skip to content

Latest commit

 

History

History
446 lines (341 loc) · 16.7 KB

File metadata and controls

446 lines (341 loc) · 16.7 KB

Turing RK1 Kubernetes Cluster

GitHub Release Lint CodeQL License: MIT Talos K3s Kubernetes

A 4-node bare-metal Kubernetes cluster built on Turing RK1 compute modules, supporting both Talos Linux and K3s on Armbian distributions. Designed for edge computing, AI/ML workloads with NPU acceleration, and distributed storage.

Choose Your Distribution

Distribution Best For NPU/GPU Shell Access
Talos Linux Production, Security No API only
K3s on Armbian Development, AI/ML Yes SSH

See docs/COMPARISON.md for detailed feature comparison.

Quick Start

# Talos Linux (automated deployment)
./scripts/deploy-talos-cluster.sh prereq    # Check prerequisites
./scripts/deploy-talos-cluster.sh deploy    # Full deployment

# K3s on Armbian
./scripts/setup-k3s-node.sh   # Run on each node
./scripts/deploy-k3s-cluster.sh  # Deploy from workstation

# Check cluster status (works with both distributions)
./scripts/talos-cluster-status.sh           # Auto-detects and shows health summary

Note: This project is under active development. See CONTRIBUTING.md for how to get involved.

Hardware Summary

Turing Pi 2 Board

Component Specification
Form Factor Mini-ITX
Node Slots 4x CM4/RK1 compatible
BMC Integrated management controller
Networking Gigabit Ethernet per node
Storage NVMe slot per node

Turing RK1 Compute Modules (x4)

Component Specification
SoC Rockchip RK3588
CPU 4x Cortex-A76 @ 2.4GHz + 4x Cortex-A55 @ 1.8GHz
RAM 16GB / 32GB LPDDR4X
GPU Mali-G610 MP4
NPU 6 TOPS (INT8) - see limitations
eMMC 32GB (system disk)
NVMe 500GB Crucial P3 (worker nodes)

Cluster Topology

┌─────────────────────────────────────────────────────────────┐
│                    Turing Pi 2 BMC                          │
│                     10.10.88.70                             │
├─────────────┬─────────────┬─────────────┬───────────────────┤
│   Node 1    │   Node 2    │   Node 3    │      Node 4       │
│ Control Pl. │   Worker    │   Worker    │      Worker       │
│ 10.10.88.73 │ 10.10.88.74 │ 10.10.88.75 │   10.10.88.76     │
│   32GB eMMC │ 32GB + 500GB│ 32GB + 500GB│  32GB + 500GB     │
└─────────────┴─────────────┴─────────────┴───────────────────┘

Total Resources

Resource Amount
CPU Cores 32 (8 per node)
RAM 64-128GB
Storage (eMMC) 128GB
Storage (NVMe) 1.5TB
Network 4x 1Gbps

Software Stack

Operating System

Component Version Notes
Talos Linux v1.11.6 Immutable, API-driven Kubernetes OS
Linux Kernel 6.12.62 Mainline kernel (ARM64)

Kubernetes Components

Component Version Purpose
Kubernetes v1.34.1 Container orchestration
containerd v2.1.5 Container runtime
etcd Bundled Distributed key-value store

Storage

Component Version Purpose
Longhorn Latest Distributed block storage
CSI Driver Longhorn Persistent volume provisioning

Networking

Component Version Purpose
Flannel Bundled Pod networking (CNI)
MetalLB Latest LoadBalancer for bare-metal
NGINX Ingress Latest HTTP/HTTPS ingress controller

Monitoring

Component Version Purpose
Prometheus Latest Metrics collection & alerting
Grafana Latest Visualization & dashboards
Alertmanager Latest Alert routing & management
Node Exporter Latest Host-level metrics
kube-state-metrics Latest Kubernetes state metrics

Management

Component Version Purpose
Portainer Agent v2.33.6 Remote cluster management
talosctl v1.11.6 Talos node management
kubectl v1.34.x Kubernetes CLI
Helm v3.x Package manager

Cluster Capabilities

What This Cluster Can Do

Container Orchestration

  • Run containerized workloads across 4 nodes
  • Automatic pod scheduling and load balancing
  • Rolling updates and rollbacks
  • Health monitoring and self-healing

Distributed Storage

  • ~1.5TB distributed storage via Longhorn
  • Volume replication across nodes (configurable 1-3 replicas)
  • Snapshots and backups
  • Dynamic volume provisioning
  • High-performance NVMe-backed storage class

Networking

  • LoadBalancer services via MetalLB (10.10.88.80-89)
  • HTTP/HTTPS ingress with NGINX
  • TLS termination
  • Path and host-based routing

Edge Computing

  • Low-power ARM64 architecture (~10W per node)
  • Compact form factor (Mini-ITX)
  • Suitable for remote/edge deployments

Development & Testing

  • Full Kubernetes API compatibility
  • Helm chart deployment
  • GitOps-ready
  • Multi-architecture image support (arm64)

AI/ML Workloads (CPU)

  • ARM64-optimized inference
  • NumPy, ONNX Runtime, PyTorch (CPU)
  • ~12 GFLOPS matrix operations per node
  • Distributed training/inference across nodes

Monitoring & Observability

  • Full cluster metrics via Prometheus
  • Pre-configured Grafana dashboards
  • Node, pod, and container-level monitoring
  • Alerting with Alertmanager
  • External Docker host monitoring support
  • Longhorn storage metrics integration

Limitations & Known Issues

NPU Not Available (Talos Only)

Issue Status Details
RK3588 NPU inaccessible Talos: Not Supported Talos uses mainline Linux kernel which lacks Rockchip's proprietary RKNPU driver
K3s/Armbian: Supported BSP kernel includes full NPU support

Impact: On Talos, the 6 TOPS NPU in each RK3588 cannot be used for hardware-accelerated AI inference.

Solutions:

  1. Use K3s on Armbian - Full NPU support with RKNN SDK (see docs/INSTALLATION-K3S.md)
  2. Use CPU-based inference on Talos (ONNX Runtime, TensorFlow Lite)
  3. Wait for mainline NPU driver (in kernel review)

GPU Not Available (Talos Only)

Issue Status Details
Mali-G610 GPU inaccessible Talos: Not Supported No GPU driver/passthrough in Talos
K3s/Armbian: Supported OpenCL and Vulkan available

Impact: On Talos, no GPU acceleration for graphics or compute workloads. K3s on Armbian provides full GPU support.

Storage Limitations

Issue Status Details
Control plane has no NVMe By Design Only workers have NVMe; CP uses eMMC only
Single replica risk Configurable Default 3 replicas; 2-replica mode loses redundancy if node fails

Network Limitations

Issue Status Details
No native LoadBalancer Mitigated MetalLB provides L2 LoadBalancer functionality
Single network interface Hardware Each node has only 1x 1Gbps NIC

Talos-Specific Considerations

Issue Details
Immutable filesystem Cannot install packages; must use extensions or containers
No SSH access Nodes managed via talosctl API only
Privileged namespaces Many add-ons require pod-security.kubernetes.io/enforce=privileged label

Known Bugs

Issue Status Workaround
PodSecurity warnings on deploy Expected Label namespaces as privileged
MetalLB speaker pods require privileges Expected Namespace is pre-labeled

Network Configuration

IP Allocation

Resource IP Address Port(s)
BMC 10.10.88.70 22 (SSH)
Control Plane 10.10.88.73 6443 (API)
Worker 1 10.10.88.74 -
Worker 2 10.10.88.75 -
Worker 3 10.10.88.76 -
Ingress Controller 10.10.88.80 80, 443
Portainer Agent 10.10.88.81 9001
Available Pool 10.10.88.82-89 -

Internal Networks

Network CIDR Purpose
Pod Network 10.244.0.0/16 Container IPs
Service Network 10.96.0.0/12 ClusterIP services

Quick Access

Management URLs

Service URL Notes
Kubernetes API https://10.10.88.73:6443 Use kubeconfig
Grafana http://grafana.local Default: admin/admin
Prometheus http://prometheus.local Metrics & queries
Alertmanager http://alertmanager.local Alert management
Longhorn UI http://longhorn.local Storage management
Portainer Your Portainer instance Connect agent: 10.10.88.81:9001

Add to /etc/hosts:

10.10.88.80  grafana.local prometheus.local alertmanager.local longhorn.local

CLI Access

# Set environment variables
export TALOSCONFIG=/path/to/cluster-config/talosconfig
export KUBECONFIG=/path/to/cluster-config/kubeconfig

# Verify cluster
kubectl get nodes
talosctl health

BMC Access Setup

The deployment scripts require access to the Turing Pi BMC. Configure credentials by copying the example file:

cp .env.example .env
# Edit .env with your BMC credentials

Required variables in .env:

Variable Description Default
TPI_HOSTNAME BMC IP address 10.10.88.70
TPI_USERNAME BMC login username -
TPI_PASSWORD BMC login password -
USE_LOCAL_TPI Use local tpi CLI (1) or SSH to BMC (0) 1

Test BMC connectivity:

./scripts/wipe-cluster.sh status

Documentation Map

Primary Documentation

Document Path Description
Docs Index docs/README.md Documentation overview
Talos Installation docs/INSTALLATION.md Talos Linux setup guide
K3s Installation docs/INSTALLATION-K3S.md K3s on Armbian setup guide
Distribution Comparison docs/COMPARISON.md Talos vs K3s feature matrix
Architecture Diagrams docs/ARCHITECTURE.md Visual cluster architecture (Mermaid)
Storage Guide docs/STORAGE.md Longhorn and NVMe configuration
Networking Guide docs/NETWORKING.md MetalLB and Ingress setup
Monitoring Guide docs/MONITORING.md Prometheus, Grafana & external monitoring
Quick Reference docs/QUICKREF.md Command cheatsheet

Configuration Files

File Path Description
Talos Config cluster-config/talosconfig Talos CLI configuration
Kubeconfig cluster-config/kubeconfig Kubernetes access
Cluster Secrets cluster-config/secrets.yaml Keep secure!
MetalLB Config cluster-config/metallb-config.yaml IP pool configuration
Ingress Config cluster-config/ingress-config.yaml Ingress rules
Portainer Agent cluster-config/portainer-agent.yaml Agent deployment
Prometheus Values cluster-config/prometheus-values.yaml Monitoring stack config
External Scrape cluster-config/external-scrape-config.yaml Docker host monitoring

Reference Documentation

Document Path Description
Cluster Plan CLUSTER_PLAN.md Original deployment plan
Talos Schematic talos-schematic.yaml Custom image configuration

External Resources

Resource URL
Talos Documentation https://www.talos.dev/docs/
K3s Documentation https://docs.k3s.io/
Longhorn Documentation https://longhorn.io/docs/
Turing Pi Documentation https://docs.turingpi.com/
MetalLB Documentation https://metallb.io/
NGINX Ingress https://kubernetes.github.io/ingress-nginx/
Prometheus Documentation https://prometheus.io/docs/
Grafana Documentation https://grafana.com/docs/
RKNN SDK (NPU) https://github.com/airockchip/rknn-toolkit2
RKLLM (LLM inference) https://github.com/airockchip/rknn-llm

Directory Structure

turing-rk1-cluster/
├── README.md                 # This file
├── CLUSTER_PLAN.md           # Deployment planning document
├── .env.example              # Environment variables template
├── talos-schematic.yaml      # Talos image customization
├── cluster-config/           # Cluster configurations
│   ├── talosconfig           # Talos CLI config
│   ├── kubeconfig            # Kubernetes access
│   ├── secrets.yaml          # Cluster secrets (sensitive!)
│   ├── controlplane.yaml     # Control plane config (Talos)
│   ├── worker.yaml           # Worker config (Talos)
│   ├── metallb-config.yaml   # MetalLB IP pool
│   ├── ingress-config.yaml   # Ingress rules
│   ├── prometheus-values.yaml # Monitoring stack config
│   ├── external-scrape-config.yaml # External targets
│   └── *.yaml                # Other configurations
├── scripts/                  # Automation scripts
│   ├── deploy-talos-cluster.sh # Automated Talos deployment
│   ├── talos-cluster-status.sh # Cluster health and status checker
│   ├── setup-k3s-node.sh     # Armbian node preparation
│   ├── deploy-k3s-cluster.sh # K3s cluster deployment
│   └── wipe-cluster.sh       # Cluster reset/migration tool
├── docs/                     # Documentation
│   ├── README.md             # Docs index
│   ├── INSTALLATION.md       # Talos setup guide
│   ├── INSTALLATION-K3S.md   # K3s on Armbian setup guide
│   ├── COMPARISON.md         # Talos vs K3s comparison
│   ├── ARCHITECTURE.md       # Cluster architecture diagrams
│   ├── STORAGE.md            # Storage guide
│   ├── NETWORKING.md         # Network guide
│   ├── MONITORING.md         # Monitoring guide
│   └── QUICKREF.md           # Quick reference
├── images/                   # Talos images
│   └── latest/
│       └── metal-arm64.raw   # Current Talos image
└── repo/                     # Submodules/repos
    ├── sbc-rockchip/         # Talos Rockchip overlay
    ├── rknn-toolkit2/        # RKNN SDK v2.3.2 (for K3s)
    ├── rknn-llm/             # RKLLM v1.2.3 (for K3s)
    └── rknn_model_zoo/       # Pre-built models (for K3s)

Security Notes

  1. Secrets Protection: cluster-config/secrets.yaml contains cluster credentials. Keep it secure and never commit to public repositories.

  2. BMC Access: The BMC (10.10.88.70) has full control over all nodes. Restrict network access appropriately.

  3. Privileged Workloads: Many add-ons require privileged namespace labels. Review security implications before deploying untrusted workloads.

  4. Network Segmentation: Consider isolating the cluster network (10.10.88.x) from untrusted networks.


Contributing

This is a personal homelab cluster. Configuration files and documentation are provided as-is for reference.

License

Configuration files and documentation are provided under MIT license. Third-party components retain their original licenses.