Skip to content

Commit f8a0227

Browse files
eliaoggianmsimbergbcumming
authored
Kubernetes docs (#188)
Added kubernetes cluster documentation. This is the migration of the doc that was on KB. Updated some outdated details. --------- Co-authored-by: Mikael Simberg <[email protected]> Co-authored-by: Ben Cumming <[email protected]>
1 parent d54335f commit f8a0227

File tree

8 files changed

+366
-0
lines changed

8 files changed

+366
-0
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
* @bcumming @msimberg @RMeli
22
docs/access/jupyterlab.md @rsarm
33
docs/services/firecrest @jpdorsch @ekouts
4+
docs/services/kubernetes @eliaoggian
45
docs/software/communication @Madeeks @msimberg
56
docs/software/devtools/linaro @jgphpc
67
docs/software/devtools/vihps @jgphpc

.github/actions/spelling/allow.txt

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -230,7 +230,19 @@ pme
230230
pmi
231231
pmix
232232
podman
233+
prgenv
233234
preinstalled
235+
rke
236+
vms
237+
alpernetes
238+
kubeconfig
239+
ceph
240+
rwx
241+
rwo
242+
subdomain
243+
tls
244+
kured
245+
KUbernetes
234246
prerelease
235247
prereleases
236248
prgenv

docs/services/index.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,5 +12,11 @@
1212
FirecREST is a RESTful API for programmatically accessing High-Performance Computing resources.
1313

1414
[:octicons-arrow-right-24: FirecREST][ref-firecrest]
15+
16+
- :fontawesome-solid-dharmachakra: __Kubernetes__
17+
18+
Kubernetes platform for automating deployment, scaling, and management of containerized applications.
19+
20+
[:octicons-arrow-right-24: Kubernetes][ref-kubernetes]
1521
</div>
1622

Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
[](){#ref-kubernetes-clusters}
2+
# CSCS Kubernetes clusters
3+
4+
This document provides an overview of the Kubernetes clusters maintained by CSCS and offers step-by-step instructions for accessing and interacting with them.
5+
6+
## Architecture
7+
8+
All Kubernetes clusters at CSCS are:
9+
10+
- Managed using **[Rancher](https://www.rancher.com)**
11+
- Running **[RKE2 (Rancher Kubernetes Engine 2)](https://github.com/rancher/rke2)**
12+
13+
CSCS offers two types of Kubernetes clusters for partners:
14+
15+
- **Harvester-only clusters**: These clusters run exclusively on virtual machines provisioned by Harvester (SUSE Virtualization), providing a flexible and isolated environment suitable for most workloads.
16+
- **Alpernetes clusters**: These clusters combine Harvester VMs with compute nodes from the Alps supercomputer. This hybrid setup, called *Alpernetes*, enables workloads to leverage both virtualized infrastructure and high-performance computing resources within the same Kubernetes environment.
17+
18+
## Cluster Environments
19+
20+
Clusters are grouped into two main environments:
21+
22+
- **TDS** – Test and Development Systems
23+
- **PROD** – Production
24+
25+
See [Kubernetes upgrades][ref-kubernetes-clusters-upgrades] for detailed upgrade policy.
26+
27+
## Kubernetes API Access
28+
29+
You can access the Kubernetes API in two main ways:
30+
31+
### Direct Internet Access
32+
33+
- A Virtual IP is exposed for the API server.
34+
- Access is restricted by source IP addresses of the partner.
35+
36+
### Access via CSCS Jump Host
37+
38+
- Connect through a jump host (e.g., `ela.cscs.ch`).
39+
- API calls are securely proxied through Rancher.
40+
41+
To check which method you are using, examine the `current-context` in your `kubeconfig` file.
42+
43+
## Cluster Access
44+
45+
To interact with the cluster, you need the `kubectl` CLI:
46+
🔗 [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl)
47+
??? Note "`kubectl` is pre-installed on the CSCS jump host."
48+
49+
50+
### Retrieve your kubeconfig file
51+
52+
#### Internal CSCS Users
53+
Access [Rancher](https://rancher.cscs.ch) and download the kubeconfig for your cluster.
54+
55+
#### External Users
56+
A specific Rancher user and password should have been provided to the partner.
57+
58+
Use the `kcscs` tool installed on `ela.cscs.ch` to obtain the kubeconfig by following the next steps.
59+
60+
Download your SSH keys from [SSH Service](https://sshservice.cscs.ch) (and add them to the SSH agent).
61+
62+
SSH to the jump host using the downloaded SSH keys
63+
```bash
64+
ssh ela.cscs.ch
65+
```
66+
67+
Login with `kcscs` with the provided Rancher credentials
68+
```bash
69+
kcscs login
70+
```
71+
72+
List the accessible clusters
73+
```bash
74+
kcscs list
75+
```
76+
77+
Retrieve the kubeconfig file for a specific cluster
78+
```bash
79+
kcscs get
80+
```
81+
82+
83+
### Store the kubeconfig file
84+
85+
```bash
86+
mv mykubeconfig.yaml ~/.kube/config
87+
```
88+
or
89+
```bash
90+
export KUBECONFIG=/home/user/kubeconfig.yaml
91+
```
92+
93+
### Test connectivity
94+
```bash
95+
kubectl get nodes
96+
```
97+
98+
!!! warning
99+
The kubeconfig file contains credentials. Keep it secure.
100+
101+
## Pre-installed Applications
102+
103+
All CSCS-provided clusters include a set of pre-installed tools and components, described below:
104+
105+
### `ceph-csi`
106+
107+
Provides dynamic persistent volume provisioning via the Ceph Container Storage Interface (CEPH CSI).
108+
109+
#### Storage Classes
110+
111+
- `cephfs` – ReadWriteMany (RWX), backed by HDD (large data volumes)
112+
- `rbd-hdd` – ReadWriteOnce (RWO), backed by HDD
113+
- `rbd-nvme` – RWO, backed by NVMe (high-performance workloads like databases)
114+
- `*-retain` – Same classes, but retain the volume after PVC deletion
115+
116+
### `external-dns`
117+
118+
Automatically manages DNS entries for:
119+
120+
- Ingress resources
121+
- Services of type `LoadBalancer` (when annotated)
122+
123+
#### Example
124+
```bash
125+
kubectl annotate service nginx "external-dns.alpha.kubernetes.io/hostname=nginx.mycluster.tds.cscs.ch."
126+
```
127+
128+
!!! Note "Use a valid name under the configured subdomain"
129+
130+
🔗 [external-dns documentation](https://github.com/kubernetes-sigs/external-dns)
131+
132+
### `cert-manager`
133+
134+
Handles automatic issuance of TLS certificates from Let's Encrypt.
135+
136+
#### Example
137+
```yaml
138+
apiVersion: cert-manager.io/v1
139+
kind: Certificate
140+
metadata:
141+
name: echo
142+
spec:
143+
secretName: echo
144+
commonName: echo.mycluster.tds.cscs.ch
145+
dnsNames:
146+
- echo.mycluster.tds.cscs.ch
147+
issuerRef:
148+
kind: ClusterIssuer
149+
name: letsencrypt
150+
```
151+
152+
You can also issue certificates automatically via Ingress annotations (see `ingress-nginx` section).
153+
154+
🔗 [cert-manager documentation](https://cert-manager.io)
155+
156+
### `metallb`
157+
158+
Enables `LoadBalancer` service types by assigning public IPs.
159+
160+
!!! Warning "The public IP pool is limited. Prefer using `Ingress` unless you specifically need a `LoadBalancer` Service for TCP traffic."
161+
162+
🔗 [MetalLB documentation](https://metallb.universe.tf)
163+
164+
### `ingress-nginx`
165+
166+
Default Ingress controller with class `nginx`.
167+
Supports automatic TLS via cert-manager annotations.
168+
169+
Example:
170+
171+
```yaml
172+
apiVersion: networking.k8s.io/v1
173+
kind: Ingress
174+
metadata:
175+
name: myIngress
176+
namespace: myIngress
177+
annotations:
178+
cert-manager.io/cluster-issuer: letsencrypt
179+
spec:
180+
rules:
181+
- host: example.tds.cscs.ch
182+
http:
183+
paths:
184+
- pathType: Prefix
185+
path: /
186+
backend:
187+
service:
188+
name: myservice
189+
port:
190+
number: 80
191+
tls:
192+
- hosts:
193+
- example.tds.cscs.ch
194+
secretName: myingress-cert
195+
```
196+
197+
🔗 [NGINX Ingress Docs](https://docs.nginx.com/nginx-ingress-controller)
198+
🔗 [cert-manager Ingress Usage](https://cert-manager.io/docs/usage/ingress/)
199+
200+
### `external-secrets`
201+
202+
Integrates with secret management tools like **HashiCorp Vault**.
203+
204+
Enables the usage of `ExternalSecret` resources to fetch secrets from `SecreStore` or `ClusterSecretStore` resources to fetch secrets and store them into `Secrets` inside the cluster.
205+
206+
It helps to avoid storing secrets in the deployment manifests, especially in GitOps environments.
207+
208+
🔗 [external-secrets documentation](https://external-secrets.io/)
209+
210+
### `kured`
211+
212+
Responsible for automatic node reboots (e.g., after kernel updates).
213+
214+
🔗 [kured documentation](https://kured.dev/)
215+
216+
### Observability
217+
218+
Includes:
219+
220+
- **Beats agents** – Export logs and metrics to CSCS’s central log system
221+
- **Prometheus** – Scrapes metrics and exports them to CSCS's central monitoring cluster

docs/services/kubernetes/index.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
[](){#ref-kubernetes}
2+
# Kubernetes
3+
4+
Kubernetes is only available for specific partners.
5+
6+
!!! Note
7+
Kubernetes is not available for normal users on Alps.
8+
9+
This documentation is designed to help partners who have been granted access to a Kubernetes cluster.
10+
11+
It explains how clusters are provisioned, maintained, and the policies in place for upgrades and updates.
12+
13+
14+
15+
<div class="grid cards" markdown>
16+
- :fontawesome-solid-layer-group: __Cluster Architecture__
17+
18+
CSCS Kubernetes cluster overview. What are the main components and how to interact with it.
19+
20+
[:octicons-arrow-right-24: Clusters][ref-kubernetes-clusters]
21+
22+
- :fontawesome-solid-arrow-up-from-bracket: __Kubernetes Upgrades__
23+
24+
Kubernetes Cluster upgrade policy (Kubernetes version upgrades)
25+
26+
[:octicons-arrow-right-24: Kubernetes Upgrades][ref-kubernetes-clusters-upgrades]
27+
28+
- :fontawesome-solid-shield-halved: __Node Updates__
29+
30+
Cluster Nodes OS update policy (Regular Node Security Updates)
31+
32+
[:octicons-arrow-right-24: Node OS Updates][ref-kubernetes-node-updates]
33+
34+
</div>
35+
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
[](){#ref-kubernetes-clusters-upgrades}
2+
# Kubernetes Cluster Upgrade Policy
3+
4+
To maintain a secure, stable, and supported platform, we regularly upgrade our Kubernetes clusters. We use **[RKE2](https://docs.rke2.io/)** as our Kubernetes distribution.
5+
6+
## Upgrade Flow
7+
8+
**Phased Rollout**
9+
10+
- Upgrades are first applied to **TDS clusters** (Test and Development Systems).
11+
- After a **minimum of 2 weeks**, if no critical issues are observed, the same upgrade will be applied to **PROD clusters**.
12+
13+
**No Fixed Schedule**
14+
15+
- Upgrades are not done on a strict calendar basis.
16+
- Timing may depend on compatibility with **other infrastructure components** (e.g., storage, CNI plugins, monitoring tools).
17+
- However, all clusters will be upgraded **before the current Kubernetes version reaches End of Life (EOL)**.
18+
19+
## Upgrade Impact
20+
21+
The **impact of a Kubernetes upgrade can vary**, depending on the nature of the changes involved:
22+
23+
**Minimal Impact**
24+
25+
- For example, upgrades that affect only the `kubelet` may be **transparent to workloads**.
26+
- Rolling restarts may occur, but no downtime is expected for well-configured applications.
27+
28+
**Potentially Disruptive**
29+
30+
- Upgrades involving components such as the **CNI (Container Network Interface)** may cause **temporary network interruptions**.
31+
- Other control plane or critical component updates might cause short-lived disruption to scheduling or connectivity.
32+
33+
??? Note "Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades."
34+
35+
## What You Can Expect
36+
37+
- Upgrades are performed using safe, tested procedures with minimal risk to production workloads.
38+
- TDS clusters serve as a **canary environment**, allowing us to identify issues early.
39+
- All clusters are kept **aligned with supported Kubernetes versions**.
40+
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
[](){#ref-kubernetes-node-updates}
2+
# Kubernetes Nodes OS Update Policy
3+
4+
To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters.
5+
6+
## Maintenance Schedule
7+
8+
- **Frequency**: Every **first week of the month**
9+
- **Reboot Window**: **Monday to Friday**, between **09:00 and 15:00**
10+
- **Time Zone**: Europe/Zurich
11+
12+
These updates include important security patches and system updates for the operating systems of cluster nodes.
13+
14+
??? Note "Nodes will be rebooted only if required by the updates."
15+
16+
## Urgent Security Patches
17+
18+
In the event of a **critical zero-day vulnerability**, we will apply patches and perform reboots (if required) **as soon as possible**, outside of the regular update schedule if needed.
19+
20+
- Affected nodes will be updated **immediately** to protect the platform.
21+
- Users will be notified ahead of time **when possible**.
22+
- Standard safety and rolling reboot practices will still be followed.
23+
24+
## Reboot Management with Kured
25+
26+
We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kured) to safely automate the reboot process. Kured ensures that:
27+
28+
- Reboots are triggered **only when necessary** (e.g., after kernel updates).
29+
- Nodes are rebooted **one at a time** to avoid service disruption.
30+
- Reboots occur **only during the defined window**
31+
- Nodes are **cordoned**, **drained**, and **gracefully reintegrated** after reboot.
32+
33+
## Application Requirements
34+
35+
To avoid service disruption during node maintenance, applications **must be designed for high availability**. Specifically:
36+
37+
- Use **multiple replicas** spread across nodes.
38+
- Follow **cloud-native best practices**, including:
39+
- Proper **readiness** and **liveness probes**
40+
- **Graceful shutdown** support
41+
- **Stateless design** or resilient handling of state
42+
- Appropriate **resource requests and limits**
43+
44+
!!! Warning
45+
Applications that do not meet these requirements **may experience temporary disruption** during node reboots.
46+

mkdocs.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,11 @@ nav:
103103
- services/index.md
104104
- 'FirecREST': services/firecrest.md
105105
- 'CI/CD': services/cicd.md
106+
- 'Kubernetes':
107+
- services/kubernetes/index.md
108+
- 'Clusters': services/kubernetes/clusters.md
109+
- 'Kubernetes Upgrades': services/kubernetes/kubernetes-upgrades.md
110+
- 'Node OS Updates': services/kubernetes/node-updates.md
106111
- 'Running Jobs':
107112
- running/index.md
108113
- 'Slurm': running/slurm.md

0 commit comments

Comments
 (0)