Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
213 changes: 213 additions & 0 deletions docs/kubernetes/clusters.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@

# CSCS Kubernetes Clusters

This document provides an overview of the Kubernetes clusters maintained by CSCS and offers step-by-step instructions for accessing and interacting with them.

---

## Architecture

All Kubernetes clusters at CSCS are:

- Managed using **[Rancher](https://www.rancher.com)**
- Running **[RKE2 (Rancher Kubernetes Engine 2)](https://github.com/rancher/rke2)**

---

## Cluster Environments

Clusters are grouped into two main environments:

- **TDS** – Test and Development Systems
- **PROD** – Production

TDS clusters receive updates first. If no issues arise, the same updates are then applied to PROD clusters.

---

## Kubernetes API Access

You can access the Kubernetes API in two main ways:

### Direct Internet Access

- A Virtual IP is exposed for the API server.
- Access can be restricted by source IP addresses.

### Access via CSCS Jump Host

- Connect through a bastion host (e.g., `ela.cscs.ch`).
- API calls are securely proxied through Rancher.

To check which method you are using, examine the `current-context` in your `kubeconfig` file.

---

## Cluster Access

To interact with the cluster, you need the `kubectl` CLI:
🔗 [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl)
> `kubectl` is pre-installed on the CSCS jump host.
### Step-by-Step Access Guide

#### Retrieve your kubeconfig file
- If you have a CSCS account and can access [Rancher](https://rancher.cscs.ch), download the kubeconfig for your cluster.

- If you have a CSCS account but can't access [Rancher](https://rancher.cscs.ch), request a local Rancher user and use the **kcscs** tool installed on **ela.cscs.ch** to obtain the kubeconfig:
- Download your SSH keys from [SSH Service](https://sshservice.cscs.ch)
- SSH to `ela.cscs.ch` using the downloaded SSH keys
- Run `kcscs login` and insert your Rancher local user credentials (Supplied by CSCS)
- Run `kcscs list` to list the clusters you have access to
- Run `kcscs get` to get the kubeconfig file for a specific cluster

- If you don't have a CSCS account, open a Service Desk ticket to ask support.

#### Store the kubeconfig file
```bash
mv mykubeconfig.yaml ~/.kube/config
# or
export KUBECONFIG=/home/user/kubeconfig.yaml
```

#### Test connectivity
```bash
kubectl get nodes
```

> ⚠️ The kubeconfig file contains credentials. Keep it secure.
---

## Pre-installed Applications

All CSCS-provided clusters include a set of pre-installed tools and components, described below:

---

### 📦 `ceph-csi`

Provides **dynamic persistent volume provisioning** via the Ceph Container Storage Interface.

#### Storage Classes

- `cephfs` – ReadWriteMany (RWX), backed by HDD (large data volumes)
- `rbd-hdd` – ReadWriteOnce (RWO), backed by HDD
- `rbd-nvme` – RWO, backed by NVMe (high-performance workloads like databases)
- `*-retain` – Same classes, but retain the volume after PVC deletion

---

### 🌐 `external-dns`

Automatically manages DNS entries for:

- Ingress resources
- Services of type `LoadBalancer` (when annotated)

#### Example
```bash
kubectl annotate service nginx "external-dns.alpha.kubernetes.io/hostname=nginx.mycluster.tds.cscs.ch."
```

> ✅ Use a valid name under the configured subdomain.
📄 [external-dns documentation](https://github.com/kubernetes-sigs/external-dns)

---

### 🔐 `cert-manager`

Handles automatic issuance of TLS certificates from Let's Encrypt.

#### Example
```yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: echo
spec:
secretName: echo
commonName: echo.mycluster.tds.cscs.ch
dnsNames:
- echo.mycluster.tds.cscs.ch
issuerRef:
kind: ClusterIssuer
name: letsencrypt
```
You can also issue certs automatically via Ingress annotations (see `ingress-nginx` section).

📄 [cert-manager documentation](https://cert-manager.io)

---

### 📡 `metallb`

Enables `LoadBalancer` service types by assigning public IPs.

> ⚠️ The public IP pool is limited.
Prefer using `Ingress` unless you specifically need a `LoadBalancer`.
📄 [metallb documentation](https://metallb.universe.tf)

---

### 🌍 `ingress-nginx`

Default Ingress controller with class `nginx`.
Supports automatic TLS via cert-manager annotations.

#### Example\
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myIngress
namespace: myIngress
annotations:
cert-manager.io/cluster-issuer: letsencrypt
spec:
rules:
- host: example.tds.cscs.ch
http:
paths:
- pathType: Prefix
path: /
backend:
service:
name: myservice
port:
number: 80
tls:
- hosts:
- example.tds.cscs.ch
secretName: myingress-cert
```

📄 [NGINX Ingress Docs](https://docs.nginx.com/nginx-ingress-controller)
📄 [cert-manager Ingress Usage](https://cert-manager.io/docs/usage/ingress/)

---

### 🔑 `external-secrets`

Integrates with secret management tools like **HashiCorp Vault**.

📄 [external-secrets documentation](https://external-secrets.io/)

---

### 🔁 `kured`

Responsible for automatic node reboots (e.g., after kernel updates).

📄 [kured documentation](https://kured.dev/)

---

### 📊 Observability

Includes:

- **ECK Operator**
- **Beats agents** – Export logs and metrics to CSCS’s central log system
- **Prometheus** – Scrapes metrics and exports them to CSCS's central monitoring cluster
48 changes: 48 additions & 0 deletions docs/kubernetes/kubernetes-upgrades.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Kubernetes Cluster Upgrade Policy

To maintain a secure, stable, and supported platform, we regularly upgrade our Kubernetes clusters. We use **[RKE2](https://docs.rke2.io/)** as our Kubernetes distribution.

---

## 🔄 Upgrade Flow

- **Phased Rollout**:
- Upgrades are first applied to **TDS clusters** (Test and Development Systems).
- After a **minimum of 2 weeks**, if no critical issues are observed, the same upgrade will be applied to **PROD clusters**.

- **No Fixed Schedule**:
- Upgrades are not done on a strict calendar basis.
- Timing may depend on compatibility with **other infrastructure components** (e.g., storage, CNI plugins, monitoring tools).
- However, all clusters will be upgraded **before the current Kubernetes version reaches End of Life (EOL)**.

---

## ⚠️ Upgrade Impact

The **impact of a Kubernetes upgrade can vary**, depending on the nature of the changes involved:

- **Minimal Impact**:
- For example, upgrades that affect only the `kubelet` may be **transparent to workloads**.
- Rolling restarts may occur, but no downtime is expected for well-configured applications.

- **Potentially Disruptive**:
- Upgrades involving components such as the **CNI (Container Network Interface)** may cause **temporary network interruptions**.
- Other control plane or critical component updates might cause short-lived disruption to scheduling or connectivity.

> 💡 Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades.
---

## ✅ What You Can Expect

- Upgrades are performed using safe, tested procedures with minimal risk to production workloads.
- TDS clusters serve as a **canary environment**, allowing us to identify issues early.
- All clusters are kept **aligned with supported Kubernetes versions**.

---

## 💬 Questions?

If you have any questions about upcoming Kubernetes upgrades or want help verifying your application’s readiness, please contact the Network and Cloud team via Service Desk ticket.

Thank you for your support and collaboration in keeping our platform secure and reliable.
59 changes: 59 additions & 0 deletions docs/kubernetes/node-upgrades.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Kubernetes Nodes OS Update Policy

To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters.

---

## 🔄 Maintenance Schedule

- **Frequency**: Every **first week of the month**
- **Reboot Window**: **Monday to Friday**, between **09:00 and 15:00**
- **Time Zone**: Europe/Zurich

These updates include important security patches and system updates for the operating systems of cluster nodes.

> ⚠️ **Note:** Nodes will be **rebooted only if required** by the updates. If no reboot is necessary, nodes will remain in service without disruption.
---

## 🚨 Urgent Security Patches

In the event of a **critical zero-day vulnerability**, we will apply patches and perform reboots (if required) **as soon as possible**, outside of the regular update schedule if needed.

- Affected nodes will be updated **immediately** to protect the platform.
- Users will be notified ahead of time **when possible**.
- Standard safety and rolling reboot practices will still be followed.

---

## 🛠️ Reboot Management with Kured

We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kured) to safely automate the reboot process. Kured ensures that:

- Reboots are triggered **only when necessary** (e.g., after kernel updates).
- Nodes are rebooted **one at a time** to avoid service disruption.
- Reboots occur **only during the defined window**
- Nodes are **cordoned**, **drained**, and **gracefully reintegrated** after reboot.

---

## ✅ Application Requirements

To avoid service disruption during node maintenance, applications **must be designed for high availability**. Specifically:

- Use **multiple replicas** spread across nodes.
- Follow **cloud-native best practices**, including:
- Proper **readiness** and **liveness probes**
- **Graceful shutdown** support
- **Stateless design** or resilient handling of state
- Appropriate **resource requests and limits**

> ❗ Applications that do not meet these requirements **may experience temporary disruption** during node reboots.
---

## 👩‍💻 Need Help?

If you have questions or need help preparing your applications for rolling node maintenance, please contact the Network and Cloud team via Service Desk ticket.

Thank you for your cooperation and commitment to building robust, cloud-native services.
4 changes: 4 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,10 @@ nav:
- 'LLM Inference': guides/mlp_tutorials/llm-inference.md
- 'LLM Finetuning': guides/mlp_tutorials/llm-finetuning.md
- 'LLM Training': guides/mlp_tutorials/llm-nanotron-training.md
- 'Kubernetes':
- 'Clusters': kubernetes/clusters.md
- 'Kubernetes Upgrades': kubernetes/kubernetes-upgrades.md
- 'Node OS Upgrades': kubernetes/node-upgrades.md
- 'Policies':
- policies/index.md
- 'User Regulations': policies/regulations.md
Expand Down