Erhhung's Home Kubernetes Cluster

This Ansible-based project provisions Erhhung's high-availability Kubernetes cluster at home named homelab, and deploys services for monitoring various IoT appliances, as well as for deploying other personal projects, including self-hosted LLMs and RAG pipelines that enable multi-source hybrid searches and agentic automations using local knowledge base containing vast amounts of personal and sensor data.

The approach taken on all service deployments is to treat the clusters as a production environment (to the extent possible with limited resources and scaling capacity across a few mini PCs). That means TLS everywhere and requiring authenticated user access, scraping metrics, and configuring dashboards and alerts.

Overview

The top-level Ansible playbook main.yml run by play.sh will provision 5 VM hosts (rancher and k8s1..k8s4) in the existing XCP-ng Home pool, all running Ubuntu Server 24.04 Minimal without customizations besides basic networking and authorized SSH key for the user erhhung.

A single-node K3s Kubernetes cluster will be installed on host rancher alongside with Rancher Server on that cluster, and a 4-node RKE2 Kubernetes cluster with a high-availability control plane using virtual IPs will be installed on hosts k8s1..k8s4. Longhorn and NFS storage provisioners will be installed in each cluster to manage a pool of LVM logical volumes on each node, and to expand the overall storage capacity on the QNAP NAS.

All cluster services will be provisioned with TLS certificates from Erhhung's private CA server at pki.fourteeners.local or its faster mirror at cosmos.fourteeners.local.

Cluster Topology

Cluster Services

Service Endpoints

Service Endpoint	Description
https://rancher.fourteeners.local	Rancher Server console
https://harbor.fourteeners.local	Harbor OCI registry
https://velero.fourteeners.local	Velero Dashboard
https://minio.fourteeners.local	MinIO console
https://s3.fourteeners.local	MinIO S3 API
opensearch.fourteeners.local:9200	OpenSearch (HTTPS only)
https://kibana.fourteeners.local	OpenSearch Dashboards
postgres.fourteeners.local:5432	PostgreSQL via Pgpool (mTLS only)
https://sso.fourteeners.local	Keycloak IAM console
valkey.fourteeners.local:6379 valkey{1..6}.fourteeners.local:6379	Valkey cluster (mTLS only)
https://grafana.fourteeners.local	Grafana dashboards
https://metrics.fourteeners.local	Prometheus UI (Keycloak SSO)
https://alerts.fourteeners.local	Alertmanager UI (Keycloak SSO)
https://thanos.fourteeners.local	Thanos Query UI
https://rule.thanos.fourteeners.local https://store.thanos.fourteeners.local https://bucket.thanos.fourteeners.local https://compact.thanos.fourteeners.local	Thanos component status UIs
https://kiali.fourteeners.local	Kiali console (Keycloak SSO)
https://argocd.fourteeners.local	Argo CD console
https://qdrant.fourteeners.local	Qdrant dashboard
https://ollama.fourteeners.local	Ollama API server
https://openwebui.fourteeners.local	Open WebUI portal

Installation Sources

To-Do Tasks

Migrate manually provisioned certificates and secrets to ones issued by cert-manager with auto-rotation
Identify and upload additional sources of personal documents into Open WebUI knowledge base collections
Automate creation of DNS records in pfSense via custom Ansible module that invokes pfSense REST APIs

Ansible Vault

The Ansible Vault password is stored in macOS Keychain under item "Home-K8s" for account "ansible-vault".

export ANSIBLE_CONFIG="./ansible.cfg"
VAULTFILE="group_vars/all/vault.yml"

ansible-vault create $VAULTFILE
ansible-vault edit   $VAULTFILE
ansible-vault view   $VAULTFILE

Some variables stored in Ansible Vault (there are many more)

Infrastructure Secrets	User Passwords
`ansible_become_pass`	`rancher_admin_pass`
`github_access_token`	`harbor_admin_pass`
`age_secret_key`	`minio_root_pass`
`icloud_smtp.*`	`minio_admin_pass`
`k3s_token`	`velero_admin_pass`
`rke2_token`	`opensearch_admin_pass`
`stepca_provisioner_pass`	`keycloak_admin_pass`
`harbor_secret`	`thanos_admin_pass`
`minio_client_pass`	`grafana_admin_pass`
`velero_repo_pass`	`argocd_admin_pass`
`velero_passphrase`	`openwebui_admin_pass`
`dashboards_os_pass`
`fluent_os_pass`
`valkey_pass`
`postgresql_pass`
`keycloak_db_pass`
`keycloak_smtp_pass`
`monitoring_pass`
`monitoring_oidc_client_secret.*`
`alertmanager_smtp_pass`
`slack_webhook_url`
`oauth2_proxy_cookie_secret`
`kiali_oidc_client_secret`
`argocd_signing_key`
`hass_access_token`
`qdrant_api_key.*`
`openwebui_secret_key`
`pipelines_api_key`
`openai_api_key`

Connections

All managed hosts are running Ubuntu 24.04 with SSH key from https://github.com/erhhung.keys already authorized.

Ansible will authenticate as user erhhung using private key "~/.ssh/erhhung.pem";
however, all privileged operations using sudo will require the password stored in Vault.

Playbooks

Install required packages

1.1. Tools — emacs, jq, yq, git, and helm
1.2. Python — Pip packages in user virtualenv
1.3. Helm — Helm plugins: e.g. helm-diff
```
./play.sh packages
```

Configure system settings

2.1. Host — host name, time zone, and locale
2.2. Kernel — sysctl params and pam_limits
2.3. Network — DNS servers and search domains
2.4. Login — customize login MOTD messages
2.5. Certs — add CA certificates to trust store
```
./play.sh basics
```

Set up admin user's home directory

3.1. Dot files: .bash_aliases, etc.
3.2. Config files: htop, fastfetch
```
./play.sh files
```

Install Rancher Server on single-node K3s cluster
```
./play.sh rancher
```

Provision Kubernetes cluster with RKE on 4 nodes

Install RKE2 with a single control plane node and 3 worker nodes, all permitting workloads,
or RKE2 in HA mode with 3 control plane nodes and 1 worker node, all permitting workloads
(in HA mode, the cluster will be accessible thru a virtual IP address courtesy of kube-vip)
```
./play.sh cluster
```

Install cert-manager to automate certificate issuing

6.1. Connect to Step CA pki.fourteeners.local as a StepClusterIssuer
```
./play.sh certmanager
```

Install Wave to monitor ConfigMaps and Secrets
```
./play.sh wave
```

Install Longhorn dynamic PV provisioner
Install MinIO object storage in HA mode
Install Velero backup and restore tools

8.1. Create a pool of LVM logical volumes
8.2. Install Longhorn storage components
8.3. Install NFS dynamic PV provisioner
8.4. Install MinIO tenant using NFS PVs
8.5. Install Velero using MinIO as target
8.6. Install Velero Dashboard
```
./play.sh storage minio velero
```

Install Harbor OCI & Helm registry
```
./play.sh harbor
```

Create resources from manifest files

IMPORTANT: Resource manifests must specify the namespaces they wished to be installed
into because the playbook simply applies each one without targeting a specific namespace
```
./play.sh manifests
```

Install Node Feature Discovery to identify GPU nodes

11.1. Install Intel Device Plugins and GpuDevicePlugin
```
./play.sh nodefeatures
```

Install OpenSearch cluster in HA mode

12.1. Configure the OpenSearch security plugin (users and roles) for downstream applications
12.2. Install OpenSearch Dashboards UI
```
./play.sh opensearch
```

Install Fluent Bit to ingest logs into OpenSearch
```
./play.sh logging
```

Install PostgreSQL database in HA mode

14.1. Run initialization SQL script to create roles and databases for downstream applications
14.2. Create users in both PostgreSQL and Pgpool
```
./play.sh postgresql
```

Install Keycloak IAM & OIDC provider

15.1. Bootstrap PostgreSQL database with realm homelab, user erhhung, and OIDC clients
```
./play.sh keycloak
```

Install Valkey key-value store in HA mode

16.1. Deploy 6 nodes in total: 3 primaries and 3 replicas
```
./play.sh valkey
```

Install Prometheus, Thanos, and Grafana in HA mode

17.1. Expose Prometheus & Alertmanager UIs via oauth2-proxy integration with Keycloak
17.2. Connect Thanos sidecars to MinIO to store scraped metrics in the metrics bucket
17.3. Deploy and integrate other Thanos components with Prometheus and Alertmanager
```
./play.sh monitoring thanos
```

Install Istio service mesh in ambient mode
```
./play.sh istio
```

Install Argo CD GitOps delivery in HA mode

19.1. Configure Argo CD components to use the Valkey cluster for their caching needs
```
./play.sh argocd
```

Install Metacontroller to create Operators
```
./play.sh metacontroller
```

Install Qdrant vector database in HA mode
```
./play.sh qdrant
```

Install Ollama LLM server with common models
Install Open WebUI AI platform with Pipelines

22.1. Create Accounts knowledge base, and then Accounts custom model that embeds that KB
22.2. NOTE: Populate Accounts KB by running ./play.sh openwebui -t knowledge separately
```
./play.sh ollama openwebui
```

Create virtual Kubernetes clusters in RKE
```
./play.sh vclusters
```

Alternatively, run all playbooks automatically in order:

# pass options like -v and --step
./play.sh [ansible-playbook-opts]

# run all playbooks starting from "storage"
# ("storage" is a playbook tag in main.yml)
./play.sh storage-

Output from play.sh will be logged in "ansible.log".

VS Code Shortcuts

The default Bash shell for VS Code terminal has been configured to load a custom .bash_profile containing aliases for common Ansible commands as well as the play function with completions for playbook tags.

Multipass Required

Due to the dependency chain of the Prometheus monitoring stack (Keycloak and Valkey), the monitoring.yml playbook must be run after most other playbooks. At the same time, those dependent services also want to create ServiceMonitor resources that require the Prometheus Operator CRDs. Therefore, a second pass through all playbooks, starting with certmanager.yml, is required to enable metrics collection on those services.

Optional Playbooks

Shut down all/specific VMs

ansible-playbook shutdownvms.yml [-e targets={group|host|,...}]

Create/revert/delete VM snapshots

2.1. Create new snaphots

ansible-playbook snapshotvms.yml [-e targets={group|host|,...}] \
                                  -e '{"desc":"text description"}'

2.2. Revert to snapshots

ansible-playbook snapshotvms.yml  -e do=revert \
                                 [-e targets={group|host|,...}]  \
                                  -e '{"desc":"text to search"}' \
                                 [-e '{"date":"YYYY-mm-dd prefix"}']

2.3. Delete old snaphots

ansible-playbook snapshotvms.yml  -e do=delete \
                                 [-e targets={group|host|,...}]  \
                                  -e '{"desc":"text to search"}' \
                                  -e '{"date":"YYYY-mm-dd prefix"}'

Start all/specific VMs

ansible-playbook startvms.yml [-e targets={group|host|,...}]

VM Storage

To expand the VM disk on a cluster node, the VM must be shut down (attempting to resize the disk from Xen Orchestra will fail with error: VDI in use).

Once the VM disk has been expanded, restart the VM and SSH into the node to resize the partition and LV.

$ sudo su

# verify new size
$ lsblk /dev/xvda

# resize partition
$ parted /dev/xvda
) print
Warning: Not all of the space available to /dev/xvda appears to be used...
Fix/Ignore? Fix

) resizepart 3 100%
# confirm new size
) print
) quit

# sync with kernel
$ partprobe

# confirm new size
$ lsblk /dev/xvda3

# resize VG volume
$ pvresize /dev/xvda3
Physical volume "/dev/xvda3" changed
1 physical volume(s) resized...

# confirm new size
$ pvdisplay

# show LV volumes
$ lvdisplay

# set exact LV size (G=GiB)
$ lvextend -vrL 50G /dev/ubuntu-vg/ubuntu-lv
# or grow LV by percentage
$ lvextend -vrl +90%FREE /dev/ubuntu-vg/ubuntu-lv
Extending logical volume ubuntu-vg/ubuntu-lv to up to...
fsadm: Executing resize2fs /dev/mapper/ubuntu--vg-ubuntu--lv
The filesystem on /dev/mapper/ubuntu--vg-ubuntu--lv is now...

After expanding all desired disks, run ./diskfree.sh to confirm available disk space on all cluster nodes.

rancher
-------
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda2       32G   18G   13G  60% /

k8s1
----
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   50G   21G   27G  44% /
/dev/mapper/ubuntu--vg-data--lv     30G  781M   30G   3% /data

k8s2
----
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   50G   22G   26G  47% /
/dev/mapper/ubuntu--vg-data--lv     30G  781M   30G   3% /data

k8s3
----
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   50G   23G   25G  48% /
/dev/mapper/ubuntu--vg-data--lv     30G  1.2G   29G   4% /data

k8s4
----
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   50G   27G   21G  57% /
/dev/mapper/ubuntu--vg-data--lv     30G  1.2G   29G   4% /data

Troubleshooting

Ansible's ad-hoc commands are useful in these scenarios.

Restart Kubernetes cluster services on all nodes

ansible rancher          -m ansible.builtin.service -b -a "name=k3s         state=restarted"
ansible control_plane_ha -m ansible.builtin.service -b -a "name=rke2-server state=restarted"
ansible workers_ha       -m ansible.builtin.service -b -a "name=rke2-agent  state=restarted"

NOTE: remove _ha suffix from target hosts if the RKE cluster was deployed in non-HA mode.

All kube-proxy static pods on continuous CrashLoopBackOff

This turns out to be a Linux kernel bug in linux-image-6.8.0-56-generic and above (discovered on upgrade to linux-image-6.8.0-57-generic), causing this error in the container logs:

ip6tables-restore v1.8.9 (nf_tables): unknown option "--xor-mark"

Current workaround is to downgrade to an earlier kernel.

# list installed kernel images
ansible -v k8s_all -a 'bash -c "dpkg -l | grep linux-image"'

# install working kernel image
ansible -v k8s_all -b -a 'apt-get install -y linux-image-6.8.0-55-generic'

# GRUB use working kernel image
ansible -v rancher -m ansible.builtin.shell -b -a '
    kernel="6.8.0-55-generic"
    dvuuid=$(blkid -s UUID -o value /dev/xvda2)
    menuid="gnulinux-advanced-$dvuuid>gnulinux-$kernel-advanced-$dvuuid"
    sed -Ei "s/^(GRUB_DEFAULT=).+$/\\1\"$menuid\"/" /etc/default/grub
    grep GRUB_DEFAULT /etc/default/grub
'
ansible -v cluster -m ansible.builtin.shell -b -a '
    kernel="6.8.0-55-generic"
    dvuuid=$(blkid -s UUID -o value /dev/mapper/ubuntu--vg-ubuntu--lv)
    menuid="gnulinux-advanced-$dvuuid>gnulinux-$kernel-advanced-$dvuuid"
    sed -Ei "s/^(GRUB_DEFAULT=).+$/\\1\"$menuid\"/" /etc/default/grub
    grep GRUB_DEFAULT /etc/default/grub
'
# update /boot/grub/grub.cfg
ansible -v k8s_all -b -a 'update-grub'

# reboot nodes, one at a time
ansible -v k8s_all -m ansible.builtin.reboot -b -a "post_reboot_delay=120" -f 1

# confirm working kernel image
ansible -v k8s_all -a 'uname -r'

# remove old backup kernels only
# (keep latest non-working kernel
# so upgrade won't install again)
ansible -v k8s_all -b -a 'apt-get autoremove -y --purge'

StatefulSet pod stuck on ContainerCreating due to MountDevice failed

Pod lifecycle events show an error like:

MountVolume.MountDevice failed for volume "pvc-4151d201-437b-4ceb-bbf6-c227ea49e285" : kubernetes.io/csi: attacher.MountDevice failed to create dir "/var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/0bb8a8bc36ca16f14a425e5eaf35ed51af6096bf0302129a05394ce51393cecd/globalmount": mkdir /var/lib/kubelet/.../globalmount: file exists

Problem is described by this GitHub issue, which may be caused by restarting the node while a Longhorn volume backup is in-progress.

An effective workaround is to unmount that volume.

$ ssh k8s1

$ mount | grep pvc-4151d201-437b-4ceb-bbf6-c227ea49e285
/dev/longhorn/pvc-4151d201-437b-4ceb-bbf6-c227ea49e285 on /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/0bb8a8bc36ca16f14a425e5eaf35ed51af6096bf0302129a05394ce51393cecd/globalmount type xfs (rw,relatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/longhorn/pvc-4151d201-437b-4ceb-bbf6-c227ea49e285 on /var/lib/kubelet/pods/06fc67d7-833f-4ecd-810f-77787fd703e6/volumes/kubernetes.io~csi/pvc-4151d201-437b-4ceb-bbf6-c227ea49e285/mount type xfs (rw,relatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota)

$ sudo umount /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/0bb8a8bc36ca16f14a425e5eaf35ed51af6096bf0302129a05394ce51393cecd/globalmount

Then restart the pod, and it should run successfully.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.ansible/modules		.ansible/modules
.vscode		.vscode
docs/grafana		docs/grafana
files		files
filter_plugins		filter_plugins
group_vars/all		group_vars/all
images		images
inventory		inventory
lookup_plugins		lookup_plugins
manifests		manifests
pki		pki
roles		roles
tasks		tasks
templates		templates
vars		vars
.ansible-lint		.ansible-lint
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ansible.cfg		ansible.cfg
argocd.yml		argocd.yml
basics.yml		basics.yml
certmanager.yml		certmanager.yml
cluster.yml		cluster.yml
debug.yml		debug.yml
diskfree.sh		diskfree.sh
files.yml		files.yml
gzage.sh		gzage.sh
harbor.yml		harbor.yml
istio.yml		istio.yml
keycloak.yml		keycloak.yml
logging.yml		logging.yml
main.yml		main.yml
manifests.yml		manifests.yml
metacontroller.yml		metacontroller.yml
minio.yml		minio.yml
monitoring.yml		monitoring.yml
nodefeatures.yml		nodefeatures.yml
ollama.yml		ollama.yml
opensearch.yml		opensearch.yml
openwebui.yml		openwebui.yml
packages.yml		packages.yml
play.sh		play.sh
postgresql.yml		postgresql.yml
qdrant.yml		qdrant.yml
rancher.yml		rancher.yml
shutdownvms.yml		shutdownvms.yml
snapshotvms.yml		snapshotvms.yml
startvms.yml		startvms.yml
storage.yml		storage.yml
thanos.yml		thanos.yml
valkey.yml		valkey.yml
vaultpass.sh		vaultpass.sh
vclusters.yml		vclusters.yml
velero.yml		velero.yml
wave.yml		wave.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Erhhung's Home Kubernetes Cluster

Overview

Cluster Topology

Cluster Services

Service Endpoints

Installation Sources

To-Do Tasks

Ansible Vault

Connections

Playbooks

VS Code Shortcuts

Multipass Required

Optional Playbooks

VM Storage

Troubleshooting

About

Uh oh!

Languages

License

erhhung/homelab-k8s

Folders and files

Latest commit

History

Repository files navigation

Erhhung's Home Kubernetes Cluster

Overview

Cluster Topology

Cluster Services

Service Endpoints

Installation Sources

To-Do Tasks

Ansible Vault

Connections

Playbooks

VS Code Shortcuts

Multipass Required

Optional Playbooks

VM Storage

Troubleshooting

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages