|
| 1 | +## v1.0.0-rc1 |
| 2 | + |
| 3 | +### Added |
| 4 | + |
| 5 | +- The Slurm Helm Chart can now be configured with `PrologSlurmctld` and |
| 6 | + `EpilogSlurmctld`. |
| 7 | +- Add arm64 support and multiarch manifest. |
| 8 | +- Added NodePort to `v1alpha1.ServiceSpec`. |
| 9 | +- Added pod hostname resolution of NodeSet pods. |
| 10 | +- Adds hostname label to pods for Slurm node mapping. |
| 11 | +- Synchronize Kubernetes node [un]cordon state to NodeSet pods and their Slurm |
| 12 | + nodes. When Kubernetes nodes are cordoned, NodeSet pods running on those nodes |
| 13 | + are also cordoned and their Slurm nodes drained. Those NodeSet pods remain |
| 14 | + cordoned until the Kubernetes node becomes uncordoned. |
| 15 | +- Implements graceful nodeset pod disruption handling |
| 16 | +- Added metrics-server-bind-address command-line option for the slurm-operator |
| 17 | + controller. |
| 18 | +- Added liveness probe to slurmrestd container, which will restart its pod if it |
| 19 | + becomes unresponsive long enough. |
| 20 | +- Custom Slurm node drain message for kubectl cordon. |
| 21 | +- Adds dynamic node tainting. |
| 22 | +- Can now support hybrid clusters, where one or more Slurm components exist |
| 23 | + externally to Kubernetes but be joined to the same Slurm cluster. |
| 24 | + |
| 25 | +### Fixed |
| 26 | + |
| 27 | +- Fixes parsing of `ServiceSpec` via `ServiceSpecWrapper`. |
| 28 | +- Correctly use global imagePullPolicy as the default value for all containers. |
| 29 | +- Determine cluster domain instead of assuming the default (`cluster.local`). |
| 30 | +- Update kubeVersion parsing to handle provider suffixes (e.g., GKE |
| 31 | + `x.y.z-gke.a`). |
| 32 | +- Fixed odd number of arguments logger error when updating pod conditions. |
| 33 | +- Avoid needless NotFound errors when patching pod conditions. |
| 34 | +- Fixed regression where nodeset `partition.enabled` was not being respected. |
| 35 | +- Initial NodeSet no longer accidentally owns the worker service. |
| 36 | +- Fixed issue where changes to slurmd and/or logfile subobjects where not |
| 37 | + causing a rolling update. |
| 38 | +- Fixed notation used to refer to LoginSets in installation docs. |
| 39 | +- Fixed documentation for uninstalling slurm-operator-crds. |
| 40 | +- When checking if a Slurm node is fully drained, the logic now follows closely |
| 41 | + to how Slurm represents the drained state. There were certain edge cases that |
| 42 | + could alleged the node was not drained when it actually was. |
| 43 | +- Check if Slurm node is [un]drain before requesting the opposite. This avoids a |
| 44 | + race condition where an admin or script has applied [un]drain to the Slurm |
| 45 | + node but the operator is not aware of it. |
| 46 | +- When Slurm nodes are put into drain state, the provided reason should not be |
| 47 | + thrashed by subsequent drain requests. |
| 48 | +- Fixed installation instruction for cert-manager chart. |
| 49 | +- Fixes bug wereby slurm-controller hostname was set incorrectly. |
| 50 | +- Fixes per-nodeset partition creation. |
| 51 | +- Fixed chart installation failure where NOTES.txt failed to fetch value from |
| 52 | + nested object where the parent was null. |
| 53 | +- Fixed imagePullPolicy in slurm-operator Helm chart. |
| 54 | +- Fixes edge case where Slurm node state is not reset when a worker pod migrates |
| 55 | + kube nodes. |
| 56 | +- Reduce checksum collision during file change detection by using SHA256 instead |
| 57 | + of MD5. |
| 58 | +- When `CgroupPlugin=disabled`, do not configure `PrologFlags=Contain` and other |
| 59 | + parameters that depend on it. |
| 60 | +- Added liveness probe to slurmd container to restart the pod if slurmd crashes |
| 61 | + after starting. |
| 62 | +- Prevent Slurm node undrain when node is down or notresponding. |
| 63 | +- Fixed reason prefixing behavior in MakeNodeUndrain. |
| 64 | +- Default webhook timeout is now consistent across all endpoints, respecing the |
| 65 | + user input, otherwise using the Kubernetes default. |
| 66 | +- Fixed case where multiple env variables in LoginSet would cause the operator |
| 67 | + to keep updating the LoginSet Deployment causing the underlying ReplicaSet to |
| 68 | + endlessly thrash. |
| 69 | +- Fixed case where NodeSets being added or removed from the Slurm cluster was |
| 70 | + not triggering a reconfigure. |
| 71 | + |
| 72 | +### Changed |
| 73 | + |
| 74 | +- Organized documentation into sub-directories. |
| 75 | +- Updates the paths used to refer to the user's home directory in installation |
| 76 | + instructions. |
| 77 | +- Slurm node [un]drain activity now includes more context. |
| 78 | +- Made the NodeSet updateStrategy configurable in the Slurm helm chart. The |
| 79 | + default minUnavailable was changed to 25%. |
| 80 | +- Shortened naming schema for health and metrics addresses. |
| 81 | +- Exposed addresses for health and metrics of the slurm-operator controller pod |
| 82 | + via the Helm chart. |
| 83 | +- slurmctld - The reconfigure container is now a sidecar instead of main |
| 84 | + container. |
| 85 | +- Reduced interval of the reconfigure check. After the kubelet updates mounted |
| 86 | + files in the pod, a reconfigure will be issued more quickly. |
| 87 | +- All supplemental containers are now `corev1.Container`, allowing full |
| 88 | + configuration. |
| 89 | +- Chart metadata is no longer applied to the pod template. |
| 90 | +- Updated NodeSet pod preStop to better indicate why the Slurm node was set to |
| 91 | + DOWN before deletion. |
| 92 | +- Service metadata is now configurable separately from the pod template |
| 93 | + metadata. |
| 94 | +- Webhooks avoid kube-system namespace. |
| 95 | +- Replaced slurm-exporter with a serviceMonitor that scrapes slurmctld directly. |
| 96 | +- Move to Slurm v44 API (from v43). |
| 97 | + |
| 98 | +### Removed |
| 99 | + |
| 100 | +- Removed defaulting webhooks. |
| 101 | +- Removed v1alpha1 CRDs to cleanly delineate v1 from v0 releases. Going forward, |
| 102 | + old versions of CRDs in v1 releases will linger in a deprecated state and be |
| 103 | + removed in future releases as needed. |
0 commit comments