eth-cscs · bcumming · Jul 29, 2025 · Jul 7, 2025 · Jul 7, 2025 · Jul 7, 2025
diff --git a/docs/kubernetes/clusters.md b/docs/kubernetes/clusters.md
@@ -0,0 +1,213 @@
+
+# CSCS Kubernetes Clusters
+
+This document provides an overview of the Kubernetes clusters maintained by CSCS and offers step-by-step instructions for accessing and interacting with them.
+
+---
+
+## Architecture
+
+All Kubernetes clusters at CSCS are:
+
+- Managed using **[Rancher](https://www.rancher.com)**
+- Running **[RKE2 (Rancher Kubernetes Engine 2)](https://github.com/rancher/rke2)**
+
+---
+
+## Cluster Environments
+
+Clusters are grouped into two main environments:
+
+- **TDS** – Test and Development Systems  
+- **PROD** – Production
+
+TDS clusters receive updates first. If no issues arise, the same updates are then applied to PROD clusters.
+
+---
+
+## Kubernetes API Access
+
+You can access the Kubernetes API in two main ways:
+
+### Direct Internet Access
+
+- A Virtual IP is exposed for the API server.  
+- Access can be restricted by source IP addresses.
+
+### Access via CSCS Jump Host
+
+- Connect through a bastion host (e.g., `ela.cscs.ch`).
+- API calls are securely proxied through Rancher.
+
+To check which method you are using, examine the `current-context` in your `kubeconfig` file.
+
+---
+
+## Cluster Access
+
+To interact with the cluster, you need the `kubectl` CLI:  
+🔗 [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl)  
+> `kubectl` is pre-installed on the CSCS jump host.
+
+### Step-by-Step Access Guide
+
+#### Retrieve your kubeconfig file
+   - If you have a CSCS account and can access [Rancher](https://rancher.cscs.ch), download the kubeconfig for your cluster.
+
+   - If you have a CSCS account but can't access [Rancher](https://rancher.cscs.ch), request a local Rancher user and use the **kcscs** tool installed on **ela.cscs.ch** to obtain the kubeconfig:
+    - Download your SSH keys from [SSH Service](https://sshservice.cscs.ch)
+    - SSH to `ela.cscs.ch` using the downloaded SSH keys
+    - Run `kcscs login` and insert your Rancher local user credentials (Supplied by CSCS)
+    - Run `kcscs list` to list the clusters you have access to
+    - Run `kcscs get` to get the kubeconfig file for a specific cluster
+
+   - If you don't have a CSCS account, open a Service Desk ticket to ask support.
+
+#### Store the kubeconfig file
+   ```bash
+   mv mykubeconfig.yaml ~/.kube/config
+   # or
+   export KUBECONFIG=/home/user/kubeconfig.yaml
+   ```
+
+#### Test connectivity
+   ```bash
+   kubectl get nodes
+   ```
+
+> ⚠️ The kubeconfig file contains credentials. Keep it secure.
+
+---
+
+## Pre-installed Applications
+
+All CSCS-provided clusters include a set of pre-installed tools and components, described below:
+
+---
+
+### 📦 `ceph-csi`
+
+Provides **dynamic persistent volume provisioning** via the Ceph Container Storage Interface.
+
+#### Storage Classes
+
+- `cephfs` – ReadWriteMany (RWX), backed by HDD (large data volumes)
+- `rbd-hdd` – ReadWriteOnce (RWO), backed by HDD
+- `rbd-nvme` – RWO, backed by NVMe (high-performance workloads like databases)
+- `*-retain` – Same classes, but retain the volume after PVC deletion
+
+---
+
+### 🌐 `external-dns`
+
+Automatically manages DNS entries for:
+
+- Ingress resources
+- Services of type `LoadBalancer` (when annotated)
+
+#### Example
+```bash
+kubectl annotate service nginx "external-dns.alpha.kubernetes.io/hostname=nginx.mycluster.tds.cscs.ch."
+```
+
+> ✅ Use a valid name under the configured subdomain.  
+📄 [external-dns documentation](https://github.com/kubernetes-sigs/external-dns)
+
+---
+
+### 🔐 `cert-manager`
+
+Handles automatic issuance of TLS certificates from Let's Encrypt.
+
+#### Example
+```yaml
+apiVersion: cert-manager.io/v1
+kind: Certificate
+metadata:
+  name: echo
+spec:
+  secretName: echo
+  commonName: echo.mycluster.tds.cscs.ch
+  dnsNames:
+    - echo.mycluster.tds.cscs.ch
+  issuerRef:
+    kind: ClusterIssuer
+    name: letsencrypt
+```
+
+You can also issue certs automatically via Ingress annotations (see `ingress-nginx` section).
+
+📄 [cert-manager documentation](https://cert-manager.io)
+
+---
+
+### 📡 `metallb`
+
+Enables `LoadBalancer` service types by assigning public IPs.
+
+> ⚠️ The public IP pool is limited.  
+Prefer using `Ingress` unless you specifically need a `LoadBalancer`.  
+📄 [metallb documentation](https://metallb.universe.tf)
+
+---
+
+### 🌍 `ingress-nginx`
+
+Default Ingress controller with class `nginx`.  
+Supports automatic TLS via cert-manager annotations.
+
+#### Example\
+```yaml
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+  name: myIngress
+  namespace: myIngress
+  annotations:
+    cert-manager.io/cluster-issuer: letsencrypt
+spec:
+  rules:
+    - host: example.tds.cscs.ch
+      http:
+        paths:
+          - pathType: Prefix
+            path: /
+            backend:
+              service:
+                name: myservice
+                port:
+                  number: 80
+  tls:
+    - hosts:
+        - example.tds.cscs.ch
+      secretName: myingress-cert
+```
+
+📄 [NGINX Ingress Docs](https://docs.nginx.com/nginx-ingress-controller)  
+📄 [cert-manager Ingress Usage](https://cert-manager.io/docs/usage/ingress/)
+
+---
+
+### 🔑 `external-secrets`
+
+Integrates with secret management tools like **HashiCorp Vault**.
+
+📄 [external-secrets documentation](https://external-secrets.io/)
+
+---
+
+### 🔁 `kured`
+
+Responsible for automatic node reboots (e.g., after kernel updates).
+
+📄 [kured documentation](https://kured.dev/)
+
+---
+
+### 📊 Observability
+
+Includes:
+
+- **ECK Operator**  
+- **Beats agents** – Export logs and metrics to CSCS’s central log system
+- **Prometheus** – Scrapes metrics and exports them to CSCS's central monitoring cluster
diff --git a/docs/kubernetes/kubernetes-upgrades.md b/docs/kubernetes/kubernetes-upgrades.md
@@ -0,0 +1,48 @@
+# Kubernetes Cluster Upgrade Policy
+
+To maintain a secure, stable, and supported platform, we regularly upgrade our Kubernetes clusters. We use **[RKE2](https://docs.rke2.io/)** as our Kubernetes distribution.
+
+---
+
+## 🔄 Upgrade Flow
+
+- **Phased Rollout**:
+  - Upgrades are first applied to **TDS clusters** (Test and Development Systems).
+  - After a **minimum of 2 weeks**, if no critical issues are observed, the same upgrade will be applied to **PROD clusters**.
+
+- **No Fixed Schedule**:
+  - Upgrades are not done on a strict calendar basis.
+  - Timing may depend on compatibility with **other infrastructure components** (e.g., storage, CNI plugins, monitoring tools).
+  - However, all clusters will be upgraded **before the current Kubernetes version reaches End of Life (EOL)**.
+
+---
+
+## ⚠️ Upgrade Impact
+
+The **impact of a Kubernetes upgrade can vary**, depending on the nature of the changes involved:
+
+- **Minimal Impact**:
+  - For example, upgrades that affect only the `kubelet` may be **transparent to workloads**.
+  - Rolling restarts may occur, but no downtime is expected for well-configured applications.
+
+- **Potentially Disruptive**:
+  - Upgrades involving components such as the **CNI (Container Network Interface)** may cause **temporary network interruptions**.
+  - Other control plane or critical component updates might cause short-lived disruption to scheduling or connectivity.
+
+> 💡 Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades.
+
+---
+
+## ✅ What You Can Expect
+
+- Upgrades are performed using safe, tested procedures with minimal risk to production workloads.
+- TDS clusters serve as a **canary environment**, allowing us to identify issues early.
+- All clusters are kept **aligned with supported Kubernetes versions**.
+
+---
+
+## 💬 Questions?
+
+If you have any questions about upcoming Kubernetes upgrades or want help verifying your application’s readiness, please contact the Network and Cloud team via Service Desk ticket.
+
+Thank you for your support and collaboration in keeping our platform secure and reliable.
diff --git a/docs/kubernetes/node-upgrades.md b/docs/kubernetes/node-upgrades.md
@@ -0,0 +1,59 @@
+# Kubernetes Nodes OS Update Policy
+
+To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters.
+
+---
+
+## 🔄 Maintenance Schedule
+
+- **Frequency**: Every **first week of the month**  
+- **Reboot Window**: **Monday to Friday**, between **09:00 and 15:00**  
+- **Time Zone**: Europe/Zurich
+
+These updates include important security patches and system updates for the operating systems of cluster nodes.
+
+> ⚠️ **Note:** Nodes will be **rebooted only if required** by the updates. If no reboot is necessary, nodes will remain in service without disruption.
+
+---
+
+## 🚨 Urgent Security Patches
+
+In the event of a **critical zero-day vulnerability**, we will apply patches and perform reboots (if required) **as soon as possible**, outside of the regular update schedule if needed.  
+
+- Affected nodes will be updated **immediately** to protect the platform.
+- Users will be notified ahead of time **when possible**.
+- Standard safety and rolling reboot practices will still be followed.
+
+---
+
+## 🛠️ Reboot Management with Kured
+
+We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kured) to safely automate the reboot process. Kured ensures that:
+
+- Reboots are triggered **only when necessary** (e.g., after kernel updates).
+- Nodes are rebooted **one at a time** to avoid service disruption.
+- Reboots occur **only during the defined window** 
+- Nodes are **cordoned**, **drained**, and **gracefully reintegrated** after reboot.
+
+---
+
+## ✅ Application Requirements
+
+To avoid service disruption during node maintenance, applications **must be designed for high availability**. Specifically:
+
+- Use **multiple replicas** spread across nodes.
+- Follow **cloud-native best practices**, including:
+  - Proper **readiness** and **liveness probes**
+  - **Graceful shutdown** support
+  - **Stateless design** or resilient handling of state
+  - Appropriate **resource requests and limits**
+
+> ❗ Applications that do not meet these requirements **may experience temporary disruption** during node reboots.
+
+---
+
+## 👩‍💻 Need Help?
+
+If you have questions or need help preparing your applications for rolling node maintenance, please contact the Network and Cloud team via Service Desk ticket.
+
+Thank you for your cooperation and commitment to building robust, cloud-native services.
@@ -118,6 +118,10 @@ nav:
       - 'LLM Inference': guides/mlp_tutorials/llm-inference.md
       - 'LLM Finetuning': guides/mlp_tutorials/llm-finetuning.md
       - 'LLM Training': guides/mlp_tutorials/llm-nanotron-training.md
+  - 'Kubernetes':
+    - 'Clusters': kubernetes/clusters.md
+    - 'Kubernetes Upgrades': kubernetes/kubernetes-upgrades.md
+    - 'Node OS Upgrades': kubernetes/node-upgrades.md
   - 'Policies':
     - policies/index.md
     - 'User Regulations': policies/regulations.md