This Ansible-based project provisions Erhhung's high-availability Kubernetes cluster at home named homelab
, and deploys services for monitoring various IoT appliances, as well as for deploying other personal projects, including self-hosted LLMs and RAG pipelines that enable multi-source hybrid searches and agentic automations using local knowledge base containing vast amounts of personal and sensor data.
The approach taken on all service deployments is to treat the clusters as a production environment (to the extent possible with limited resources and scaling capacity across a few mini PCs). That means TLS everywhere and requiring authenticated user access, scraping metrics, and configuring dashboards and alerts.
The top-level Ansible playbook main.yml
run by play.sh
will provision 5 VM hosts (rancher
and k8s1
..k8s4
)
in the existing XCP-ng Home
pool, all running Ubuntu Server 24.04 Minimal without customizations besides basic networking
and authorized SSH key for the user erhhung
.
A single-node K3s Kubernetes cluster will be installed on host rancher
alongside with Rancher Server on that cluster, and a 4-node RKE2 Kubernetes cluster with a high-availability control plane using virtual IPs will be installed on hosts
k8s1
..k8s4
. Longhorn and NFS storage provisioners will be installed in each cluster to manage a pool of LVM logical volumes on each node, and to expand the overall storage capacity on the QNAP NAS.
All cluster services will be provisioned with TLS certificates from Erhhung's private CA server at pki.fourteeners.local
or its faster mirror at cosmos.fourteeners.local
.
- K3s Kubernetes Cluster — lightweight Kubernetes distro for resource-constrained environments
- Install on the
rancher
host using the official install script
- Install on the
- Rancher Cluster Manager — provision (or import), manage, and monitor Kubernetes clusters
- Install on K3s cluster using the
rancher
Helm chart
- Install on K3s cluster using the
- RKE2 Kubernetes Cluster — Kubernetes distribution with focus on security and compliance
- Install on hosts
k8s1
-k8s4
using the RKE2 Ansible Role with HA mode enabled
- Install on hosts
- Certificate Manager — X.509 certificate management for Kubernetes
- Install on K3s and RKE clusters using the
cert-manager
Helm chart - Connect to Step CA
pki.fourteeners.local
using thestep-issuer
Helm chart - Connect to Step CA
pki.fourteeners.local
as an ACMEClusterIssuer
- Install on K3s and RKE clusters using the
- Wave Config Monitor — ensure pods run with up-to-date
ConfigMaps
andSecrets
- Install on K3s and RKE clusters using the
wave
Helm chart
- Install on K3s and RKE clusters using the
- Longhorn Block Storage — distributed block storage for Kubernetes
- Install on main RKE cluster using the
longhorn
Helm chart
- Install on main RKE cluster using the
- NFS Dynamic Provisioner — create persistent volumes on NFS shares
- Install on K3s and RKE clusters using the
nfs-subdir-external-provisioner
Helm chart
- Install on K3s and RKE clusters using the
- Harbor Container Registry — private OCI container and Helm chart registry
- Install on K3s cluster using the
harbor
Helm chart
- Install on K3s cluster using the
- MinIO Object Storage — S3-compatible object storage with console
- Install on main RKE cluster using the MinIO Operator and MinIO Tenant Helm charts
- Velero Backup & Restore — back up and restore persistent volumes
- Install on main RKE cluster using the
velero
Helm chart - Install Velero Dashboard using the
velero-ui
Helm chart
- Install on main RKE cluster using the
- Node Feature Discovery — label nodes with available hardware features, like GPUs
- Install on K3s and RKE clusters using the
node-feature-discovery
Helm chart - Install Intel Device Plugins using the
intel-device-plugins-operator
Helm chart - Install NVIDIA GPU Operator on RKE cluster ... when I procure an NVIDIA card :(
- Install on K3s and RKE clusters using the
- OpenSearch Logging Stack — aggregate and filter logs using OpenSearch and Fluent Bit
- Install on main RKE cluster using the
opensearch
andopensearch-dashboards
Helm charts - Instal Fluent Bit using the
fluent-operator
Helm chart andFluentBit
CR
- Install on main RKE cluster using the
- PostgreSQL Database — SQL database used by Keycloak and other applications
- Install on main RKE cluster using Bitnami's
postgresql-ha
Helm chart
- Install on main RKE cluster using Bitnami's
- Keycloak IAM & OIDC Provider — identity and access management and OpenID Connect provider
- Install on main RKE cluster using the
keycloakx
Helm chart
- Install on main RKE cluster using the
- Valkey Key/Value Store — Redis-compatible key/value store
- Install on main RKE cluster using the
valkey-cluster
Helm chart
- Install on main RKE cluster using the
- Prometheus Monitoring Stack — Prometheus (via Operator), Thanos sidecar, and Grafana
- Install on main RKE cluster using the
kube-prometheus-stack
Helm chart - Add authentication to Prometheus and Alertmanager UIs using
oauth2-proxy
sidecar - Install other Thanos components using Bitnami's
thanos
Helm chart for global querying - Enable the OTLP receiver endpoint for metrics (when needed)
- Install on main RKE cluster using the
- Istio Service Mesh with Kiali Console — secure, observe, trace, and route traffic between workloads
- Install on main RKE cluster using the
istioctl
CLI - Install Kiali using the
kiali-operator
Helm chart andKiali
CR
- Install on main RKE cluster using the
- Meshery Visual GitOps Platform — manage infrastructure visually and collaboratively
- Install on K3s cluster using the
meshery
Helm chart, along with
meshery-istio
andmeshery-nighthawk
adapters - Connect to main RKE cluster, along with Prometheus and Grafana
- Install on K3s cluster using the
- Argo CD Declarative GitOps — manage deployment of other applications in the main RKE cluster
- Install on main RKE cluster using the
argo-cd
Helm chart
- Install on main RKE cluster using the
- Kubernetes Metacontroller — enable easy creation of custom controllers
- Install on main RKE cluster using the
metacontroller
Helm chart
- Install on main RKE cluster using the
- Ollama LLM Server with Ollama CLI — run LLMs on Kubernetes cluster
- Install on an Intel GPU node using the
ollama
Helm chart and IPEX-LLM Ollama portable zip
- Install on an Intel GPU node using the
- Open WebUI AI Platform — extensible AI platform with Ollama integration and local RAG support
- Install on main RKE cluster using the
open-webui
Helm chart - Replace the default Chroma vector DB with Qdrant — install using the
qdrant
Helm chart
- Install on main RKE cluster using the
- Flowise Agentic Workflows — build AI agents using visual workflows
- Install on main RKE cluster using the
flowise
Helm chart
- Install on main RKE cluster using the
- OpenTelemetry Collector with Jaeger UI — telemetry collector agent and distributed tracing backend
- Install on main RKE cluster using the OpenTelemetry Collector Helm chart
- Install Jaeger using the Jaeger Helm chart
- Backstage Developer Portal — software catalog and developer portal
- NATS — high performance message queues (Kafka alternative) with JetStream for persistence
- Migrate manually provisioned certificates and secrets to ones issued by
cert-manager
with auto-rotation - Identify and upload additional sources of personal documents into Open WebUI knowledge base collections
- Automate creation of DNS records in pfSense via custom Ansible module that invokes pfSense REST APIs
The Ansible Vault password is stored in macOS Keychain under item "Home-K8s
" for account "ansible-vault
".
export ANSIBLE_CONFIG="./ansible.cfg"
VAULTFILE="group_vars/all/vault.yml"
ansible-vault create $VAULTFILE
ansible-vault edit $VAULTFILE
ansible-vault view $VAULTFILE
Some variables stored in Ansible Vault (there are many more)
Infrastructure Secrets | User Passwords |
---|---|
ansible_become_pass |
rancher_admin_pass |
github_access_token |
harbor_admin_pass |
age_secret_key |
minio_root_pass |
icloud_smtp.* |
minio_admin_pass |
k3s_token |
velero_admin_pass |
rke2_token |
opensearch_admin_pass |
stepca_provisioner_pass |
keycloak_admin_pass |
harbor_secret |
thanos_admin_pass |
minio_client_pass |
grafana_admin_pass |
velero_repo_pass |
argocd_admin_pass |
velero_passphrase |
openwebui_admin_pass |
dashboards_os_pass |
|
fluent_os_pass |
|
valkey_pass |
|
postgresql_pass |
|
keycloak_db_pass |
|
keycloak_smtp_pass |
|
monitoring_pass |
|
monitoring_oidc_client_secret.* |
|
alertmanager_smtp_pass |
|
slack_webhook_url |
|
oauth2_proxy_cookie_secret |
|
kiali_oidc_client_secret |
|
argocd_signing_key |
|
hass_access_token |
|
qdrant_api_key.* |
|
openwebui_secret_key |
|
pipelines_api_key |
|
openai_api_key |
All managed hosts are running Ubuntu 24.04 with SSH key from https://github.com/erhhung.keys already authorized.
Ansible will authenticate as user erhhung
using private key "~/.ssh/erhhung.pem
";
however, all privileged operations using sudo
will require the password stored in Vault.
-
Install required packages
1.1. Tools —
emacs
,jq
,yq
,git
, andhelm
1.2. Python — Pip packages in user virtualenv
1.3. Helm — Helm plugins: e.g.helm-diff
./play.sh packages
-
Configure system settings
2.1. Host — host name, time zone, and locale
2.2. Kernel —sysctl
params andpam_limits
2.3. Network — DNS servers and search domains
2.4. Login — customize login MOTD messages
2.5. Certs — add CA certificates to trust store./play.sh basics
-
Set up admin user's home directory
3.1. Dot files:
.bash_aliases
, etc.
3.2. Config files:htop
,fastfetch
./play.sh files
-
Install Rancher Server on single-node K3s cluster
./play.sh rancher
-
Provision Kubernetes cluster with RKE on 4 nodes
Install RKE2 with a single control plane node and 3 worker nodes, all permitting workloads,
or RKE2 in HA mode with 3 control plane nodes and 1 worker node, all permitting workloads
(in HA mode, the cluster will be accessible thru a virtual IP address courtesy ofkube-vip
)./play.sh cluster
-
Install
cert-manager
to automate certificate issuing
6.1. Connect to Step CA
pki.fourteeners.local
as aStepClusterIssuer
./play.sh certmanager
-
Install Wave to monitor
ConfigMaps
andSecrets
./play.sh wave
-
Install Longhorn dynamic PV provisioner
Install MinIO object storage in HA mode
Install Velero backup and restore tools
8.1. Create a pool of LVM logical volumes
8.2. Install Longhorn storage components
8.3. Install NFS dynamic PV provisioner
8.4. Install MinIO tenant using NFS PVs
8.5. Install Velero using MinIO as target
8.6. Install Velero Dashboard./play.sh storage minio velero
-
Install Harbor OCI & Helm registry
./play.sh harbor
-
Create resources from manifest files
IMPORTANT: Resource manifests must specify the namespaces they wished to be installed
into because the playbook simply applies each one without targeting a specific namespace./play.sh manifests
-
Install Node Feature Discovery to identify GPU nodes
11.1. Install Intel Device Plugins and
GpuDevicePlugin
./play.sh nodefeatures
-
Install OpenSearch cluster in HA mode
12.1. Configure the OpenSearch security plugin (users and roles) for downstream applications
12.2. Install OpenSearch Dashboards UI./play.sh opensearch
-
Install Fluent Bit to ingest logs into OpenSearch
./play.sh logging
-
Install PostgreSQL database in HA mode
14.1. Run initialization SQL script to create roles and databases for downstream applications
14.2. Create users in both PostgreSQL and Pgpool./play.sh postgresql
-
Install Keycloak IAM & OIDC provider
15.1. Bootstrap PostgreSQL database with realm
homelab
, usererhhung
, and OIDC clients./play.sh keycloak
-
Install Valkey key-value store in HA mode
16.1. Deploy 6 nodes in total: 3 primaries and 3 replicas
./play.sh valkey
-
Install Prometheus, Thanos, and Grafana in HA mode
17.1. Expose Prometheus & Alertmanager UIs via
oauth2-proxy
integration with Keycloak
17.2. Connect Thanos sidecars to MinIO to store scraped metrics in themetrics
bucket
17.3. Deploy and integrate other Thanos components with Prometheus and Alertmanager./play.sh monitoring thanos
-
Install Istio service mesh in ambient mode
./play.sh istio
-
Install Argo CD GitOps delivery in HA mode
19.1. Configure Argo CD components to use the Valkey cluster for their caching needs
./play.sh argocd
-
Install Metacontroller to create Operators
./play.sh metacontroller
-
Install Qdrant vector database in HA mode
./play.sh qdrant
-
Install Ollama LLM server with common models
Install Open WebUI AI platform with Pipelines
22.1. Create
Accounts
knowledge base, and thenAccounts
custom model that embeds that KB
22.2. NOTE: PopulateAccounts
KB by running./play.sh openwebui -t knowledge
separately./play.sh ollama openwebui
-
Create virtual Kubernetes clusters in RKE
./play.sh vclusters
Alternatively, run all playbooks automatically in order:
# pass options like -v and --step
./play.sh [ansible-playbook-opts]
# run all playbooks starting from "storage"
# ("storage" is a playbook tag in main.yml)
./play.sh storage-
Output from play.sh
will be logged in "ansible.log
".
The default Bash shell for VS Code terminal has been configured to load a custom .bash_profile
containing aliases for common Ansible commands as well as the play
function with completions for playbook tags.
Due to the dependency chain of the Prometheus monitoring stack (Keycloak and Valkey), the monitoring.yml
playbook must be run after most other playbooks. At the same time, those dependent services also want to create ServiceMonitor
resources that require the Prometheus Operator CRDs. Therefore, a second pass through all playbooks, starting with certmanager.yml
, is required to enable metrics collection on those services.
-
Shut down all/specific VMs
ansible-playbook shutdownvms.yml [-e targets={group|host|,...}]
-
Create/revert/delete VM snapshots
2.1. Create new snaphots
ansible-playbook snapshotvms.yml [-e targets={group|host|,...}] \ -e '{"desc":"text description"}'
2.2. Revert to snapshots
ansible-playbook snapshotvms.yml -e do=revert \ [-e targets={group|host|,...}] \ -e '{"desc":"text to search"}' \ [-e '{"date":"YYYY-mm-dd prefix"}']
2.3. Delete old snaphots
ansible-playbook snapshotvms.yml -e do=delete \ [-e targets={group|host|,...}] \ -e '{"desc":"text to search"}' \ -e '{"date":"YYYY-mm-dd prefix"}'
-
Start all/specific VMs
ansible-playbook startvms.yml [-e targets={group|host|,...}]
To expand the VM disk on a cluster node, the VM must be shut down
(attempting to resize the disk from Xen Orchestra will fail with
error: VDI in use
).
Once the VM disk has been expanded, restart the VM and SSH into the node to resize the partition and LV.
$ sudo su
# verify new size
$ lsblk /dev/xvda
# resize partition
$ parted /dev/xvda
) print
Warning: Not all of the space available to /dev/xvda appears to be used...
Fix/Ignore? Fix
) resizepart 3 100%
# confirm new size
) print
) quit
# sync with kernel
$ partprobe
# confirm new size
$ lsblk /dev/xvda3
# resize VG volume
$ pvresize /dev/xvda3
Physical volume "/dev/xvda3" changed
1 physical volume(s) resized...
# confirm new size
$ pvdisplay
# show LV volumes
$ lvdisplay
# set exact LV size (G=GiB)
$ lvextend -vrL 50G /dev/ubuntu-vg/ubuntu-lv
# or grow LV by percentage
$ lvextend -vrl +90%FREE /dev/ubuntu-vg/ubuntu-lv
Extending logical volume ubuntu-vg/ubuntu-lv to up to...
fsadm: Executing resize2fs /dev/mapper/ubuntu--vg-ubuntu--lv
The filesystem on /dev/mapper/ubuntu--vg-ubuntu--lv is now...
After expanding all desired disks, run ./diskfree.sh
to confirm available disk space on all cluster nodes.
rancher
-------
Filesystem Size Used Avail Use% Mounted on
/dev/xvda2 32G 18G 13G 60% /
k8s1
----
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv 50G 21G 27G 44% /
/dev/mapper/ubuntu--vg-data--lv 30G 781M 30G 3% /data
k8s2
----
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv 50G 22G 26G 47% /
/dev/mapper/ubuntu--vg-data--lv 30G 781M 30G 3% /data
k8s3
----
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv 50G 23G 25G 48% /
/dev/mapper/ubuntu--vg-data--lv 30G 1.2G 29G 4% /data
k8s4
----
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv 50G 27G 21G 57% /
/dev/mapper/ubuntu--vg-data--lv 30G 1.2G 29G 4% /data
Ansible's ad-hoc commands are useful in these scenarios.
-
Restart Kubernetes cluster services on all nodes
ansible rancher -m ansible.builtin.service -b -a "name=k3s state=restarted" ansible control_plane_ha -m ansible.builtin.service -b -a "name=rke2-server state=restarted" ansible workers_ha -m ansible.builtin.service -b -a "name=rke2-agent state=restarted"
NOTE: remove
_ha
suffix from target hosts if the RKE cluster was deployed in non-HA mode. -
All
kube-proxy
static pods on continuousCrashLoopBackOff
This turns out to be a Linux kernel bug in
linux-image-6.8.0-56-generic
and above (discovered on upgrade tolinux-image-6.8.0-57-generic
), causing this error in the container logs:ip6tables-restore v1.8.9 (nf_tables): unknown option "--xor-mark"
Current workaround is to downgrade to an earlier kernel.
# list installed kernel images ansible -v k8s_all -a 'bash -c "dpkg -l | grep linux-image"' # install working kernel image ansible -v k8s_all -b -a 'apt-get install -y linux-image-6.8.0-55-generic' # GRUB use working kernel image ansible -v rancher -m ansible.builtin.shell -b -a ' kernel="6.8.0-55-generic" dvuuid=$(blkid -s UUID -o value /dev/xvda2) menuid="gnulinux-advanced-$dvuuid>gnulinux-$kernel-advanced-$dvuuid" sed -Ei "s/^(GRUB_DEFAULT=).+$/\\1\"$menuid\"/" /etc/default/grub grep GRUB_DEFAULT /etc/default/grub ' ansible -v cluster -m ansible.builtin.shell -b -a ' kernel="6.8.0-55-generic" dvuuid=$(blkid -s UUID -o value /dev/mapper/ubuntu--vg-ubuntu--lv) menuid="gnulinux-advanced-$dvuuid>gnulinux-$kernel-advanced-$dvuuid" sed -Ei "s/^(GRUB_DEFAULT=).+$/\\1\"$menuid\"/" /etc/default/grub grep GRUB_DEFAULT /etc/default/grub ' # update /boot/grub/grub.cfg ansible -v k8s_all -b -a 'update-grub' # reboot nodes, one at a time ansible -v k8s_all -m ansible.builtin.reboot -b -a "post_reboot_delay=120" -f 1 # confirm working kernel image ansible -v k8s_all -a 'uname -r' # remove old backup kernels only # (keep latest non-working kernel # so upgrade won't install again) ansible -v k8s_all -b -a 'apt-get autoremove -y --purge'
-
StatefulSet pod stuck on
ContainerCreating
due toMountDevice failed
Pod lifecycle events show an error like:
MountVolume.MountDevice failed for volume "pvc-4151d201-437b-4ceb-bbf6-c227ea49e285" : kubernetes.io/csi: attacher.MountDevice failed to create dir "/var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/0bb8a8bc36ca16f14a425e5eaf35ed51af6096bf0302129a05394ce51393cecd/globalmount": mkdir /var/lib/kubelet/.../globalmount: file exists
Problem is described by this GitHub issue, which may be caused by restarting the node while a Longhorn volume backup is in-progress.
An effective workaround is to unmount that volume.
$ ssh k8s1 $ mount | grep pvc-4151d201-437b-4ceb-bbf6-c227ea49e285 /dev/longhorn/pvc-4151d201-437b-4ceb-bbf6-c227ea49e285 on /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/0bb8a8bc36ca16f14a425e5eaf35ed51af6096bf0302129a05394ce51393cecd/globalmount type xfs (rw,relatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota) /dev/longhorn/pvc-4151d201-437b-4ceb-bbf6-c227ea49e285 on /var/lib/kubelet/pods/06fc67d7-833f-4ecd-810f-77787fd703e6/volumes/kubernetes.io~csi/pvc-4151d201-437b-4ceb-bbf6-c227ea49e285/mount type xfs (rw,relatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota) $ sudo umount /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/0bb8a8bc36ca16f14a425e5eaf35ed51af6096bf0302129a05394ce51393cecd/globalmount
Then restart the pod, and it should run successfully.