|
| 1 | +# Kubernetes Nodes OS Update Policy |
| 2 | + |
| 3 | +To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## 🔄 Maintenance Schedule |
| 8 | + |
| 9 | +- **Frequency**: Every **first week of the month** |
| 10 | +- **Reboot Window**: **Monday to Friday**, between **09:00 and 15:00** |
| 11 | +- **Time Zone**: Europe/Zurich |
| 12 | + |
| 13 | +These updates include important security patches and system updates for the operating systems of cluster nodes. |
| 14 | + |
| 15 | +> ⚠️ **Note:** Nodes will be **rebooted only if required** by the updates. If no reboot is necessary, nodes will remain in service without disruption. |
| 16 | +
|
| 17 | +--- |
| 18 | + |
| 19 | +## 🚨 Urgent Security Patches |
| 20 | + |
| 21 | +In the event of a **critical zero-day vulnerability**, we will apply patches and perform reboots (if required) **as soon as possible**, outside of the regular update schedule if needed. |
| 22 | + |
| 23 | +- Affected nodes will be updated **immediately** to protect the platform. |
| 24 | +- Users will be notified ahead of time **when possible**. |
| 25 | +- Standard safety and rolling reboot practices will still be followed. |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## 🛠️ Reboot Management with Kured |
| 30 | + |
| 31 | +We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kured) to safely automate the reboot process. Kured ensures that: |
| 32 | + |
| 33 | +- Reboots are triggered **only when necessary** (e.g., after kernel updates). |
| 34 | +- Nodes are rebooted **one at a time** to avoid service disruption. |
| 35 | +- Reboots occur **only during the defined window** |
| 36 | +- Nodes are **cordoned**, **drained**, and **gracefully reintegrated** after reboot. |
| 37 | + |
| 38 | +--- |
| 39 | + |
| 40 | +## ✅ Application Requirements |
| 41 | + |
| 42 | +To avoid service disruption during node maintenance, applications **must be designed for high availability**. Specifically: |
| 43 | + |
| 44 | +- Use **multiple replicas** spread across nodes. |
| 45 | +- Follow **cloud-native best practices**, including: |
| 46 | + - Proper **readiness** and **liveness probes** |
| 47 | + - **Graceful shutdown** support |
| 48 | + - **Stateless design** or resilient handling of state |
| 49 | + - Appropriate **resource requests and limits** |
| 50 | + |
| 51 | +> ❗ Applications that do not meet these requirements **may experience temporary disruption** during node reboots. |
| 52 | +
|
| 53 | +--- |
| 54 | + |
| 55 | +## 👩💻 Need Help? |
| 56 | + |
| 57 | +If you have questions or need help preparing your applications for rolling node maintenance, please contact the Network and Cloud team via Service Desk ticket. |
| 58 | + |
| 59 | +Thank you for your cooperation and commitment to building robust, cloud-native services. |
0 commit comments