Skip to content

Commit cb39104

Browse files
committed
Add Kubernetes Updates docs
1 parent fa72290 commit cb39104

File tree

3 files changed

+110
-0
lines changed

3 files changed

+110
-0
lines changed
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Kubernetes Cluster Upgrade Policy
2+
3+
To maintain a secure, stable, and supported platform, we regularly upgrade our Kubernetes clusters. We use **[RKE2](https://docs.rke2.io/)** as our Kubernetes distribution.
4+
5+
---
6+
7+
## 🔄 Upgrade Flow
8+
9+
- **Phased Rollout**:
10+
- Upgrades are first applied to **TDS clusters** (Test and Development Systems).
11+
- After a **minimum of 2 weeks**, if no critical issues are observed, the same upgrade will be applied to **PROD clusters**.
12+
13+
- **No Fixed Schedule**:
14+
- Upgrades are not done on a strict calendar basis.
15+
- Timing may depend on compatibility with **other infrastructure components** (e.g., storage, CNI plugins, monitoring tools).
16+
- However, all clusters will be upgraded **before the current Kubernetes version reaches End of Life (EOL)**.
17+
18+
---
19+
20+
## ⚠️ Upgrade Impact
21+
22+
The **impact of a Kubernetes upgrade can vary**, depending on the nature of the changes involved:
23+
24+
- **Minimal Impact**:
25+
- For example, upgrades that affect only the `kubelet` may be **transparent to workloads**.
26+
- Rolling restarts may occur, but no downtime is expected for well-configured applications.
27+
28+
- **Potentially Disruptive**:
29+
- Upgrades involving components such as the **CNI (Container Network Interface)** may cause **temporary network interruptions**.
30+
- Other control plane or critical component updates might cause short-lived disruption to scheduling or connectivity.
31+
32+
> 💡 Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades.
33+
34+
---
35+
36+
## ✅ What You Can Expect
37+
38+
- Upgrades are performed using safe, tested procedures with minimal risk to production workloads.
39+
- TDS clusters serve as a **canary environment**, allowing us to identify issues early.
40+
- All clusters are kept **aligned with supported Kubernetes versions**.
41+
42+
---
43+
44+
## 💬 Questions?
45+
46+
If you have any questions about upcoming Kubernetes upgrades or want help verifying your application’s readiness, please contact the Network and Cloud team via Service Desk ticket.
47+
48+
Thank you for your support and collaboration in keeping our platform secure and reliable.

docs/kubernetes/node-upgrades.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Kubernetes Nodes OS Update Policy
2+
3+
To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters.
4+
5+
---
6+
7+
## 🔄 Maintenance Schedule
8+
9+
- **Frequency**: Every **first week of the month**
10+
- **Reboot Window**: **Monday to Friday**, between **09:00 and 15:00**
11+
- **Time Zone**: Europe/Zurich
12+
13+
These updates include important security patches and system updates for the operating systems of cluster nodes.
14+
15+
> ⚠️ **Note:** Nodes will be **rebooted only if required** by the updates. If no reboot is necessary, nodes will remain in service without disruption.
16+
17+
---
18+
19+
## 🚨 Urgent Security Patches
20+
21+
In the event of a **critical zero-day vulnerability**, we will apply patches and perform reboots (if required) **as soon as possible**, outside of the regular update schedule if needed.
22+
23+
- Affected nodes will be updated **immediately** to protect the platform.
24+
- Users will be notified ahead of time **when possible**.
25+
- Standard safety and rolling reboot practices will still be followed.
26+
27+
---
28+
29+
## 🛠️ Reboot Management with Kured
30+
31+
We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kured) to safely automate the reboot process. Kured ensures that:
32+
33+
- Reboots are triggered **only when necessary** (e.g., after kernel updates).
34+
- Nodes are rebooted **one at a time** to avoid service disruption.
35+
- Reboots occur **only during the defined window**
36+
- Nodes are **cordoned**, **drained**, and **gracefully reintegrated** after reboot.
37+
38+
---
39+
40+
## ✅ Application Requirements
41+
42+
To avoid service disruption during node maintenance, applications **must be designed for high availability**. Specifically:
43+
44+
- Use **multiple replicas** spread across nodes.
45+
- Follow **cloud-native best practices**, including:
46+
- Proper **readiness** and **liveness probes**
47+
- **Graceful shutdown** support
48+
- **Stateless design** or resilient handling of state
49+
- Appropriate **resource requests and limits**
50+
51+
> ❗ Applications that do not meet these requirements **may experience temporary disruption** during node reboots.
52+
53+
---
54+
55+
## 👩‍💻 Need Help?
56+
57+
If you have questions or need help preparing your applications for rolling node maintenance, please contact the Network and Cloud team via Service Desk ticket.
58+
59+
Thank you for your cooperation and commitment to building robust, cloud-native services.

mkdocs.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,9 @@ nav:
118118
- 'LLM Inference': guides/mlp_tutorials/llm-inference.md
119119
- 'LLM Finetuning': guides/mlp_tutorials/llm-finetuning.md
120120
- 'LLM Training': guides/mlp_tutorials/llm-nanotron-training.md
121+
- 'Kubernetes':
122+
- 'Kubernetes Upgrades': kubernetes/kubernetes-upgrades.md
123+
- 'Node OS Upgrades': kubernetes/node-upgrades.md
121124
- 'Policies':
122125
- policies/index.md
123126
- 'User Regulations': policies/regulations.md

0 commit comments

Comments
 (0)