Kubernetes docs (#188)

eliaoggian · msimberg · bcumming · web-flow · commit f8a0227e5e53 · 2025-07-29T09:01:09.000Z
Added kubernetes cluster documentation. This is the migration of the doc
that was on KB. Updated some outdated details.

---------

Co-authored-by: Mikael Simberg &lt;mikael.simberg@iki.fi&gt;
Co-authored-by: Ben Cumming &lt;bcumming@cscs.ch&gt;
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -1,6 +1,7 @@
 * @bcumming @msimberg @RMeli
 docs/access/jupyterlab.md @rsarm
 docs/services/firecrest @jpdorsch @ekouts
+docs/services/kubernetes @eliaoggian
 docs/software/communication @Madeeks @msimberg
 docs/software/devtools/linaro @jgphpc
 docs/software/devtools/vihps @jgphpc
diff --git a/.github/actions/spelling/allow.txt b/.github/actions/spelling/allow.txt
@@ -230,7 +230,19 @@ pme
 pmi
 pmix
 podman
+prgenv
 preinstalled
+rke
+vms
+alpernetes
+kubeconfig
+ceph
+rwx
+rwo
+subdomain
+tls
+kured
+KUbernetes
 prerelease
 prereleases
 prgenv
diff --git a/docs/services/index.md b/docs/services/index.md
@@ -12,5 +12,11 @@
     FirecREST is a RESTful API for programmatically accessing High-Performance Computing resources.
 
     [:octicons-arrow-right-24: FirecREST][ref-firecrest]
+
+-   :fontawesome-solid-dharmachakra: __Kubernetes__
+
+    Kubernetes platform for automating deployment, scaling, and management of containerized applications.
+
+    [:octicons-arrow-right-24: Kubernetes][ref-kubernetes]
 </div>
 
diff --git a/docs/services/kubernetes/clusters.md b/docs/services/kubernetes/clusters.md
@@ -0,0 +1,221 @@
+[](){#ref-kubernetes-clusters}
+# CSCS Kubernetes clusters
+
+This document provides an overview of the Kubernetes clusters maintained by CSCS and offers step-by-step instructions for accessing and interacting with them.
+
+## Architecture
+
+All Kubernetes clusters at CSCS are:
+
+- Managed using **[Rancher](https://www.rancher.com)**
+- Running **[RKE2 (Rancher Kubernetes Engine 2)](https://github.com/rancher/rke2)**
+
+CSCS offers two types of Kubernetes clusters for partners:
+
+- **Harvester-only clusters**: These clusters run exclusively on virtual machines provisioned by Harvester (SUSE Virtualization), providing a flexible and isolated environment suitable for most workloads.
+- **Alpernetes clusters**: These clusters combine Harvester VMs with compute nodes from the Alps supercomputer. This hybrid setup, called *Alpernetes*, enables workloads to leverage both virtualized infrastructure and high-performance computing resources within the same Kubernetes environment.
+
+## Cluster Environments
+
+Clusters are grouped into two main environments:
+
+- **TDS** – Test and Development Systems  
+- **PROD** – Production
+
+See [Kubernetes upgrades][ref-kubernetes-clusters-upgrades] for detailed upgrade policy.
+
+## Kubernetes API Access
+
+You can access the Kubernetes API in two main ways:
+
+### Direct Internet Access
+
+- A Virtual IP is exposed for the API server.  
+- Access is restricted by source IP addresses of the partner.
+
+### Access via CSCS Jump Host
+
+- Connect through a jump host (e.g., `ela.cscs.ch`).
+- API calls are securely proxied through Rancher.
+
+To check which method you are using, examine the `current-context` in your `kubeconfig` file.
+
+## Cluster Access
+
+To interact with the cluster, you need the `kubectl` CLI:  
+🔗 [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl)  
+??? Note "`kubectl` is pre-installed on the CSCS jump host."
+
+
+### Retrieve your kubeconfig file
+
+#### Internal CSCS Users
+Access [Rancher](https://rancher.cscs.ch) and download the kubeconfig for your cluster. 
+   
+#### External Users
+A specific Rancher user and password should have been provided to the partner.
+
+Use the `kcscs` tool installed on `ela.cscs.ch` to obtain the kubeconfig by following the next steps.
+
+Download your SSH keys from [SSH Service](https://sshservice.cscs.ch) (and add them to the SSH agent).
+
+SSH to the jump host using the downloaded SSH keys
+```bash
+ssh ela.cscs.ch
+```
+
+Login with `kcscs` with the provided Rancher credentials
+```bash
+kcscs login
+```
+
+List the accessible clusters
+```bash
+kcscs list
+```
+
+Retrieve the kubeconfig file for a specific cluster
+```bash
+kcscs get
+```
+
+
+### Store the kubeconfig file
+
+```bash
+mv mykubeconfig.yaml ~/.kube/config
+```
+or
+```bash
+export KUBECONFIG=/home/user/kubeconfig.yaml
+```
+
+### Test connectivity
+   ```bash
+   kubectl get nodes
+   ```
+
+!!! warning
+    The kubeconfig file contains credentials. Keep it secure.
+
+## Pre-installed Applications
+
+All CSCS-provided clusters include a set of pre-installed tools and components, described below:
+
+### `ceph-csi`
+
+Provides dynamic persistent volume provisioning via the Ceph Container Storage Interface (CEPH CSI).
+
+#### Storage Classes
+
+- `cephfs` – ReadWriteMany (RWX), backed by HDD (large data volumes)
+- `rbd-hdd` – ReadWriteOnce (RWO), backed by HDD
+- `rbd-nvme` – RWO, backed by NVMe (high-performance workloads like databases)
+- `*-retain` – Same classes, but retain the volume after PVC deletion
+
+### `external-dns`
+
+Automatically manages DNS entries for:
+
+- Ingress resources
+- Services of type `LoadBalancer` (when annotated)
+
+#### Example
+```bash
+kubectl annotate service nginx "external-dns.alpha.kubernetes.io/hostname=nginx.mycluster.tds.cscs.ch."
+```
+
+!!! Note "Use a valid name under the configured subdomain"
+    
+🔗 [external-dns documentation](https://github.com/kubernetes-sigs/external-dns)
+
+### `cert-manager`
+
+Handles automatic issuance of TLS certificates from Let's Encrypt.
+
+#### Example
+```yaml
+apiVersion: cert-manager.io/v1
+kind: Certificate
+metadata:
+  name: echo
+spec:
+  secretName: echo
+  commonName: echo.mycluster.tds.cscs.ch
+  dnsNames:
+    - echo.mycluster.tds.cscs.ch
+  issuerRef:
+    kind: ClusterIssuer
+    name: letsencrypt
+```
+
+You can also issue certificates automatically via Ingress annotations (see `ingress-nginx` section).
+
+🔗 [cert-manager documentation](https://cert-manager.io)
+
+### `metallb`
+
+Enables `LoadBalancer` service types by assigning public IPs.
+
+!!! Warning "The public IP pool is limited. Prefer using `Ingress` unless you specifically need a `LoadBalancer` Service for TCP traffic."
+
+🔗 [MetalLB documentation](https://metallb.universe.tf)
+
+###  `ingress-nginx`
+
+Default Ingress controller with class `nginx`.  
+Supports automatic TLS via cert-manager annotations.
+
+Example:
+
+```yaml
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+  name: myIngress
+  namespace: myIngress
+  annotations:
+    cert-manager.io/cluster-issuer: letsencrypt
+spec:
+  rules:
+    - host: example.tds.cscs.ch
+      http:
+        paths:
+          - pathType: Prefix
+            path: /
+            backend:
+              service:
+                name: myservice
+                port:
+                  number: 80
+  tls:
+    - hosts:
+        - example.tds.cscs.ch
+      secretName: myingress-cert
+```
+
+🔗 [NGINX Ingress Docs](https://docs.nginx.com/nginx-ingress-controller)  
+🔗 [cert-manager Ingress Usage](https://cert-manager.io/docs/usage/ingress/)
+
+### `external-secrets`
+
+Integrates with secret management tools like **HashiCorp Vault**.
+
+Enables the usage of `ExternalSecret` resources to fetch secrets from `SecreStore` or `ClusterSecretStore` resources to fetch secrets and store them into `Secrets` inside the cluster.
+
+It helps to avoid storing secrets in the deployment manifests, especially in GitOps environments.
+
+🔗 [external-secrets documentation](https://external-secrets.io/)
+
+### `kured`
+
+Responsible for automatic node reboots (e.g., after kernel updates).
+
+🔗 [kured documentation](https://kured.dev/)
+
+### Observability
+
+Includes:
+
+- **Beats agents** – Export logs and metrics to CSCS’s central log system
+- **Prometheus** – Scrapes metrics and exports them to CSCS's central monitoring cluster
diff --git a/docs/services/kubernetes/index.md b/docs/services/kubernetes/index.md
@@ -0,0 +1,35 @@
+[](){#ref-kubernetes}
+# Kubernetes
+
+Kubernetes is only available for specific partners. 
+
+!!! Note
+    Kubernetes is not available for normal users on Alps.
+
+This documentation is designed to help partners who have been granted access to a Kubernetes cluster. 
+
+It explains how clusters are provisioned, maintained, and the policies in place for upgrades and updates.
+
+
+
+<div class="grid cards" markdown>
+-   :fontawesome-solid-layer-group: __Cluster Architecture__
+
+    CSCS Kubernetes cluster overview. What are the main components and how to interact with it. 
+
+    [:octicons-arrow-right-24: Clusters][ref-kubernetes-clusters]
+
+-   :fontawesome-solid-arrow-up-from-bracket: __Kubernetes Upgrades__
+
+    Kubernetes Cluster upgrade policy (Kubernetes version upgrades)
+
+    [:octicons-arrow-right-24: Kubernetes Upgrades][ref-kubernetes-clusters-upgrades]
+
+-   :fontawesome-solid-shield-halved: __Node Updates__
+
+    Cluster Nodes OS update policy (Regular Node Security Updates)
+
+    [:octicons-arrow-right-24: Node OS Updates][ref-kubernetes-node-updates]
+
+</div>
+
diff --git a/docs/services/kubernetes/kubernetes-upgrades.md b/docs/services/kubernetes/kubernetes-upgrades.md
@@ -0,0 +1,40 @@
+[](){#ref-kubernetes-clusters-upgrades}
+# Kubernetes Cluster Upgrade Policy
+
+To maintain a secure, stable, and supported platform, we regularly upgrade our Kubernetes clusters. We use **[RKE2](https://docs.rke2.io/)** as our Kubernetes distribution.
+
+## Upgrade Flow
+
+**Phased Rollout**
+
+  - Upgrades are first applied to **TDS clusters** (Test and Development Systems).
+  - After a **minimum of 2 weeks**, if no critical issues are observed, the same upgrade will be applied to **PROD clusters**.
+
+**No Fixed Schedule**
+
+  - Upgrades are not done on a strict calendar basis.
+  - Timing may depend on compatibility with **other infrastructure components** (e.g., storage, CNI plugins, monitoring tools).
+  - However, all clusters will be upgraded **before the current Kubernetes version reaches End of Life (EOL)**.
+
+## Upgrade Impact
+
+The **impact of a Kubernetes upgrade can vary**, depending on the nature of the changes involved:
+
+**Minimal Impact**
+
+  - For example, upgrades that affect only the `kubelet` may be **transparent to workloads**.
+  - Rolling restarts may occur, but no downtime is expected for well-configured applications.
+
+**Potentially Disruptive**
+
+  - Upgrades involving components such as the **CNI (Container Network Interface)** may cause **temporary network interruptions**.
+  - Other control plane or critical component updates might cause short-lived disruption to scheduling or connectivity.
+
+??? Note "Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades."
+
+## What You Can Expect
+
+- Upgrades are performed using safe, tested procedures with minimal risk to production workloads.
+- TDS clusters serve as a **canary environment**, allowing us to identify issues early.
+- All clusters are kept **aligned with supported Kubernetes versions**.
+
diff --git a/docs/services/kubernetes/node-updates.md b/docs/services/kubernetes/node-updates.md
@@ -0,0 +1,46 @@
+[](){#ref-kubernetes-node-updates}
+# Kubernetes Nodes OS Update Policy
+
+To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters.
+
+## Maintenance Schedule
+
+- **Frequency**: Every **first week of the month**  
+- **Reboot Window**: **Monday to Friday**, between **09:00 and 15:00**  
+- **Time Zone**: Europe/Zurich
+
+These updates include important security patches and system updates for the operating systems of cluster nodes.
+
+??? Note "Nodes will be rebooted only if required by the updates."
+
+## Urgent Security Patches
+
+In the event of a **critical zero-day vulnerability**, we will apply patches and perform reboots (if required) **as soon as possible**, outside of the regular update schedule if needed.  
+
+- Affected nodes will be updated **immediately** to protect the platform.
+- Users will be notified ahead of time **when possible**.
+- Standard safety and rolling reboot practices will still be followed.
+
+## Reboot Management with Kured
+
+We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kured) to safely automate the reboot process. Kured ensures that:
+
+- Reboots are triggered **only when necessary** (e.g., after kernel updates).
+- Nodes are rebooted **one at a time** to avoid service disruption.
+- Reboots occur **only during the defined window** 
+- Nodes are **cordoned**, **drained**, and **gracefully reintegrated** after reboot.
+
+## Application Requirements
+
+To avoid service disruption during node maintenance, applications **must be designed for high availability**. Specifically:
+
+- Use **multiple replicas** spread across nodes.
+- Follow **cloud-native best practices**, including:
+  - Proper **readiness** and **liveness probes**
+  - **Graceful shutdown** support
+  - **Stateless design** or resilient handling of state
+  - Appropriate **resource requests and limits**
+
+!!! Warning
+    Applications that do not meet these requirements **may experience temporary disruption** during node reboots.
+
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -103,6 +103,11 @@ nav:
     - services/index.md
     - 'FirecREST': services/firecrest.md
     - 'CI/CD': services/cicd.md
+    - 'Kubernetes':
+      - services/kubernetes/index.md
+      - 'Clusters': services/kubernetes/clusters.md
+      - 'Kubernetes Upgrades': services/kubernetes/kubernetes-upgrades.md
+      - 'Node OS Updates': services/kubernetes/node-updates.md
   - 'Running Jobs':
     - running/index.md
     - 'Slurm': running/slurm.md