Skip to content

Commit 115c9b4

Browse files
committed
Fix docs based on review
1 parent 1c4fbd7 commit 115c9b4

File tree

3 files changed

+8
-59
lines changed

3 files changed

+8
-59
lines changed

docs/kubernetes/clusters.md

Lines changed: 8 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -2,17 +2,13 @@
22

33
This document provides an overview of the Kubernetes clusters maintained by CSCS and offers step-by-step instructions for accessing and interacting with them.
44

5-
---
6-
75
## Architecture
86

97
All Kubernetes clusters at CSCS are:
108

119
- Managed using **[Rancher](https://www.rancher.com)**
1210
- Running **[RKE2 (Rancher Kubernetes Engine 2)](https://github.com/rancher/rke2)**
1311

14-
---
15-
1612
## Cluster Environments
1713

1814
Clusters are grouped into two main environments:
@@ -22,8 +18,6 @@ Clusters are grouped into two main environments:
2218

2319
TDS clusters receive updates first. If no issues arise, the same updates are then applied to PROD clusters.
2420

25-
---
26-
2721
## Kubernetes API Access
2822

2923
You can access the Kubernetes API in two main ways:
@@ -40,8 +34,6 @@ You can access the Kubernetes API in two main ways:
4034

4135
To check which method you are using, examine the `current-context` in your `kubeconfig` file.
4236

43-
---
44-
4537
## Cluster Access
4638

4739
To interact with the cluster, you need the `kubectl` CLI:
@@ -79,15 +71,11 @@ export KUBECONFIG=/home/user/kubeconfig.yaml
7971

8072
> ⚠️ The kubeconfig file contains credentials. Keep it secure.
8173
82-
---
83-
8474
## Pre-installed Applications
8575

8676
All CSCS-provided clusters include a set of pre-installed tools and components, described below:
8777

88-
---
89-
90-
### 📦 `ceph-csi`
78+
### `ceph-csi`
9179

9280
Provides **dynamic persistent volume provisioning** via the Ceph Container Storage Interface.
9381

@@ -98,9 +86,7 @@ Provides **dynamic persistent volume provisioning** via the Ceph Container Stora
9886
- `rbd-nvme` – RWO, backed by NVMe (high-performance workloads like databases)
9987
- `*-retain` – Same classes, but retain the volume after PVC deletion
10088

101-
---
102-
103-
### 🌐 `external-dns`
89+
### `external-dns`
10490

10591
Automatically manages DNS entries for:
10692

@@ -115,9 +101,7 @@ kubectl annotate service nginx "external-dns.alpha.kubernetes.io/hostname=nginx.
115101
!!! info "Use a valid name under the configured subdomain"
116102
[external-dns documentation](https://github.com/kubernetes-sigs/external-dns)
117103

118-
---
119-
120-
### 🔐 `cert-manager`
104+
### `cert-manager`
121105

122106
Handles automatic issuance of TLS certificates from Let's Encrypt.
123107

@@ -141,19 +125,15 @@ You can also issue certs automatically via Ingress annotations (see `ingress-ngi
141125

142126
📄 [cert-manager documentation](https://cert-manager.io)
143127

144-
---
145-
146-
### 📡 `metallb`
128+
### `metallb`
147129

148130
Enables `LoadBalancer` service types by assigning public IPs.
149131

150132
> ⚠️ The public IP pool is limited.
151133
Prefer using `Ingress` unless you specifically need a `LoadBalancer`.
152134
📄 [metallb documentation](https://metallb.universe.tf)
153135

154-
---
155-
156-
### 🌍 `ingress-nginx`
136+
### `ingress-nginx`
157137

158138
Default Ingress controller with class `nginx`.
159139
Supports automatic TLS via cert-manager annotations.
@@ -188,25 +168,19 @@ spec:
188168
📄 [NGINX Ingress Docs](https://docs.nginx.com/nginx-ingress-controller)
189169
📄 [cert-manager Ingress Usage](https://cert-manager.io/docs/usage/ingress/)
190170

191-
---
192-
193-
### 🔑 `external-secrets`
171+
### `external-secrets`
194172

195173
Integrates with secret management tools like **HashiCorp Vault**.
196174

197175
📄 [external-secrets documentation](https://external-secrets.io/)
198176

199-
---
200-
201-
### 🔁 `kured`
177+
### `kured`
202178

203179
Responsible for automatic node reboots (e.g., after kernel updates).
204180

205181
📄 [kured documentation](https://kured.dev/)
206182

207-
---
208-
209-
### 📊 Observability
183+
### Observability
210184

211185
Includes:
212186

docs/kubernetes/kubernetes-upgrades.md

Lines changed: 0 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,6 @@
22

33
To maintain a secure, stable, and supported platform, we regularly upgrade our Kubernetes clusters. We use **[RKE2](https://docs.rke2.io/)** as our Kubernetes distribution.
44

5-
---
6-
75
## 🔄 Upgrade Flow
86

97
- **Phased Rollout**:
@@ -15,8 +13,6 @@ To maintain a secure, stable, and supported platform, we regularly upgrade our K
1513
- Timing may depend on compatibility with **other infrastructure components** (e.g., storage, CNI plugins, monitoring tools).
1614
- However, all clusters will be upgraded **before the current Kubernetes version reaches End of Life (EOL)**.
1715

18-
---
19-
2016
## ⚠️ Upgrade Impact
2117

2218
The **impact of a Kubernetes upgrade can vary**, depending on the nature of the changes involved:
@@ -31,18 +27,8 @@ The **impact of a Kubernetes upgrade can vary**, depending on the nature of the
3127

3228
> 💡 Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades.
3329
34-
---
35-
3630
## ✅ What You Can Expect
3731

3832
- Upgrades are performed using safe, tested procedures with minimal risk to production workloads.
3933
- TDS clusters serve as a **canary environment**, allowing us to identify issues early.
4034
- All clusters are kept **aligned with supported Kubernetes versions**.
41-
42-
---
43-
44-
## 💬 Questions?
45-
46-
If you have any questions about upcoming Kubernetes upgrades or want help verifying your application’s readiness, please contact the Network and Cloud team via Service Desk ticket.
47-
48-
Thank you for your support and collaboration in keeping our platform secure and reliable.

docs/kubernetes/node-upgrades.md

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,6 @@
22

33
To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters.
44

5-
---
6-
75
## 🔄 Maintenance Schedule
86

97
- **Frequency**: Every **first week of the month**
@@ -14,8 +12,6 @@ These updates include important security patches and system updates for the oper
1412

1513
> ⚠️ **Note:** Nodes will be **rebooted only if required** by the updates. If no reboot is necessary, nodes will remain in service without disruption.
1614
17-
---
18-
1915
## 🚨 Urgent Security Patches
2016

2117
In the event of a **critical zero-day vulnerability**, we will apply patches and perform reboots (if required) **as soon as possible**, outside of the regular update schedule if needed.
@@ -24,8 +20,6 @@ In the event of a **critical zero-day vulnerability**, we will apply patches and
2420
- Users will be notified ahead of time **when possible**.
2521
- Standard safety and rolling reboot practices will still be followed.
2622

27-
---
28-
2923
## 🛠️ Reboot Management with Kured
3024

3125
We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kured) to safely automate the reboot process. Kured ensures that:
@@ -35,8 +29,6 @@ We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kure
3529
- Reboots occur **only during the defined window**
3630
- Nodes are **cordoned**, **drained**, and **gracefully reintegrated** after reboot.
3731

38-
---
39-
4032
## ✅ Application Requirements
4133

4234
To avoid service disruption during node maintenance, applications **must be designed for high availability**. Specifically:
@@ -50,10 +42,7 @@ To avoid service disruption during node maintenance, applications **must be desi
5042

5143
> ❗ Applications that do not meet these requirements **may experience temporary disruption** during node reboots.
5244
53-
---
54-
5545
## 👩‍💻 Need Help?
5646

5747
If you have questions or need help preparing your applications for rolling node maintenance, please contact the Network and Cloud team via Service Desk ticket.
5848

59-
Thank you for your cooperation and commitment to building robust, cloud-native services.

0 commit comments

Comments
 (0)