Skip to content

Commit 23e471e

Browse files
committed
Add Kubernetes cluster docs
1 parent cb39104 commit 23e471e

File tree

2 files changed

+214
-0
lines changed

2 files changed

+214
-0
lines changed

β€Ždocs/kubernetes/clusters.mdβ€Ž

Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
2+
# CSCS Kubernetes Clusters
3+
4+
This document provides an overview of the Kubernetes clusters maintained by CSCS and offers step-by-step instructions for accessing and interacting with them.
5+
6+
---
7+
8+
## Architecture
9+
10+
All Kubernetes clusters at CSCS are:
11+
12+
- Managed using **[Rancher](https://www.rancher.com)**
13+
- Running **[RKE2 (Rancher Kubernetes Engine 2)](https://github.com/rancher/rke2)**
14+
15+
---
16+
17+
## Cluster Environments
18+
19+
Clusters are grouped into two main environments:
20+
21+
- **TDS** – Test and Development Systems
22+
- **PROD** – Production
23+
24+
TDS clusters receive updates first. If no issues arise, the same updates are then applied to PROD clusters.
25+
26+
---
27+
28+
## Kubernetes API Access
29+
30+
You can access the Kubernetes API in two main ways:
31+
32+
### Direct Internet Access
33+
34+
- A Virtual IP is exposed for the API server.
35+
- Access can be restricted by source IP addresses.
36+
37+
### Access via CSCS Jump Host
38+
39+
- Connect through a bastion host (e.g., `ela.cscs.ch`).
40+
- API calls are securely proxied through Rancher.
41+
42+
To check which method you are using, examine the `current-context` in your `kubeconfig` file.
43+
44+
---
45+
46+
## Cluster Access
47+
48+
To interact with the cluster, you need the `kubectl` CLI:
49+
πŸ”— [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl)
50+
> `kubectl` is pre-installed on the CSCS jump host.
51+
52+
### Step-by-Step Access Guide
53+
54+
#### Retrieve your kubeconfig file
55+
- If you have a CSCS account and can access [Rancher](https://rancher.cscs.ch), download the kubeconfig for your cluster.
56+
57+
- If you have a CSCS account but can't access [Rancher](https://rancher.cscs.ch), request a local Rancher user and use the **kcscs** tool installed on **ela.cscs.ch** to obtain the kubeconfig:
58+
- Download your SSH keys from [SSH Service](https://sshservice.cscs.ch)
59+
- SSH to `ela.cscs.ch` using the downloaded SSH keys
60+
- Run `kcscs login` and insert your Rancher local user credentials (Supplied by CSCS)
61+
- Run `kcscs list` to list the clusters you have access to
62+
- Run `kcscs get` to get the kubeconfig file for a specific cluster
63+
64+
- If you don't have a CSCS account, open a Service Desk ticket to ask support.
65+
66+
#### Store the kubeconfig file
67+
```bash
68+
mv mykubeconfig.yaml ~/.kube/config
69+
# or
70+
export KUBECONFIG=/home/user/kubeconfig.yaml
71+
```
72+
73+
#### Test connectivity
74+
```bash
75+
kubectl get nodes
76+
```
77+
78+
> ⚠️ The kubeconfig file contains credentials. Keep it secure.
79+
80+
---
81+
82+
## Pre-installed Applications
83+
84+
All CSCS-provided clusters include a set of pre-installed tools and components, described below:
85+
86+
---
87+
88+
### πŸ“¦ `ceph-csi`
89+
90+
Provides **dynamic persistent volume provisioning** via the Ceph Container Storage Interface.
91+
92+
#### Storage Classes
93+
94+
- `cephfs` – ReadWriteMany (RWX), backed by HDD (large data volumes)
95+
- `rbd-hdd` – ReadWriteOnce (RWO), backed by HDD
96+
- `rbd-nvme` – RWO, backed by NVMe (high-performance workloads like databases)
97+
- `*-retain` – Same classes, but retain the volume after PVC deletion
98+
99+
---
100+
101+
### 🌐 `external-dns`
102+
103+
Automatically manages DNS entries for:
104+
105+
- Ingress resources
106+
- Services of type `LoadBalancer` (when annotated)
107+
108+
#### Example
109+
```bash
110+
kubectl annotate service nginx "external-dns.alpha.kubernetes.io/hostname=nginx.mycluster.tds.cscs.ch."
111+
```
112+
113+
> βœ… Use a valid name under the configured subdomain.
114+
πŸ“„ [external-dns documentation](https://github.com/kubernetes-sigs/external-dns)
115+
116+
---
117+
118+
### πŸ” `cert-manager`
119+
120+
Handles automatic issuance of TLS certificates from Let's Encrypt.
121+
122+
#### Example
123+
```yaml
124+
apiVersion: cert-manager.io/v1
125+
kind: Certificate
126+
metadata:
127+
name: echo
128+
spec:
129+
secretName: echo
130+
commonName: echo.mycluster.tds.cscs.ch
131+
dnsNames:
132+
- echo.mycluster.tds.cscs.ch
133+
issuerRef:
134+
kind: ClusterIssuer
135+
name: letsencrypt
136+
```
137+
138+
You can also issue certs automatically via Ingress annotations (see `ingress-nginx` section).
139+
140+
πŸ“„ [cert-manager documentation](https://cert-manager.io)
141+
142+
---
143+
144+
### πŸ“‘ `metallb`
145+
146+
Enables `LoadBalancer` service types by assigning public IPs.
147+
148+
> ⚠️ The public IP pool is limited.
149+
Prefer using `Ingress` unless you specifically need a `LoadBalancer`.
150+
πŸ“„ [metallb documentation](https://metallb.universe.tf)
151+
152+
---
153+
154+
### 🌍 `ingress-nginx`
155+
156+
Default Ingress controller with class `nginx`.
157+
Supports automatic TLS via cert-manager annotations.
158+
159+
#### Example\
160+
```yaml
161+
apiVersion: networking.k8s.io/v1
162+
kind: Ingress
163+
metadata:
164+
name: myIngress
165+
namespace: myIngress
166+
annotations:
167+
cert-manager.io/cluster-issuer: letsencrypt
168+
spec:
169+
rules:
170+
- host: example.tds.cscs.ch
171+
http:
172+
paths:
173+
- pathType: Prefix
174+
path: /
175+
backend:
176+
service:
177+
name: myservice
178+
port:
179+
number: 80
180+
tls:
181+
- hosts:
182+
- example.tds.cscs.ch
183+
secretName: myingress-cert
184+
```
185+
186+
πŸ“„ [NGINX Ingress Docs](https://docs.nginx.com/nginx-ingress-controller)
187+
πŸ“„ [cert-manager Ingress Usage](https://cert-manager.io/docs/usage/ingress/)
188+
189+
---
190+
191+
### πŸ”‘ `external-secrets`
192+
193+
Integrates with secret management tools like **HashiCorp Vault**.
194+
195+
πŸ“„ [external-secrets documentation](https://external-secrets.io/)
196+
197+
---
198+
199+
### πŸ” `kured`
200+
201+
Responsible for automatic node reboots (e.g., after kernel updates).
202+
203+
πŸ“„ [kured documentation](https://kured.dev/)
204+
205+
---
206+
207+
### πŸ“Š Observability
208+
209+
Includes:
210+
211+
- **ECK Operator**
212+
- **Beats agents** – Export logs and metrics to CSCS’s central log system
213+
- **Prometheus** – Scrapes metrics and exports them to CSCS's central monitoring cluster

β€Žmkdocs.ymlβ€Ž

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,7 @@ nav:
119119
- 'LLM Finetuning': guides/mlp_tutorials/llm-finetuning.md
120120
- 'LLM Training': guides/mlp_tutorials/llm-nanotron-training.md
121121
- 'Kubernetes':
122+
- 'Clusters': kubernetes/clusters.md
122123
- 'Kubernetes Upgrades': kubernetes/kubernetes-upgrades.md
123124
- 'Node OS Upgrades': kubernetes/node-upgrades.md
124125
- 'Policies':

0 commit comments

Comments
Β (0)