Skip to content

Commit f48fb94

Browse files
chore: prepare changelog for v1.0.0-rc1 release
1 parent a9e3a25 commit f48fb94

File tree

2 files changed

+104
-0
lines changed

2 files changed

+104
-0
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
All notable changes to this project will be documented in version specific
44
files.
55

6+
- [CHANGELOG-1.0.md](./CHANGELOG/CHANGELOG-1.0.md)
67
- [CHANGELOG-0.4.md](./CHANGELOG/CHANGELOG-0.4.md)
78
- [CHANGELOG-0.3.md](./CHANGELOG/CHANGELOG-0.3.md)
89
- [CHANGELOG-0.2.md](./CHANGELOG/CHANGELOG-0.2.md)

CHANGELOG/CHANGELOG-1.0.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
## v1.0.0-rc1
2+
3+
### Added
4+
5+
- The Slurm Helm Chart can now be configured with `PrologSlurmctld` and
6+
`EpilogSlurmctld`.
7+
- Add arm64 support and multiarch manifest.
8+
- Added NodePort to `v1alpha1.ServiceSpec`.
9+
- Added pod hostname resolution of NodeSet pods.
10+
- Adds hostname label to pods for Slurm node mapping.
11+
- Synchronize Kubernetes node [un]cordon state to NodeSet pods and their Slurm
12+
nodes. When Kubernetes nodes are cordoned, NodeSet pods running on those nodes
13+
are also cordoned and their Slurm nodes drained. Those NodeSet pods remain
14+
cordoned until the Kubernetes node becomes uncordoned.
15+
- Implements graceful nodeset pod disruption handling
16+
- Added metrics-server-bind-address command-line option for the slurm-operator
17+
controller.
18+
- Added liveness probe to slurmrestd container, which will restart its pod if it
19+
becomes unresponsive long enough.
20+
- Custom Slurm node drain message for kubectl cordon.
21+
- Adds dynamic node tainting.
22+
- Can now support hybrid clusters, where one or more Slurm components exist
23+
externally to Kubernetes but be joined to the same Slurm cluster.
24+
25+
### Fixed
26+
27+
- Fixes parsing of `ServiceSpec` via `ServiceSpecWrapper`.
28+
- Correctly use global imagePullPolicy as the default value for all containers.
29+
- Determine cluster domain instead of assuming the default (`cluster.local`).
30+
- Update kubeVersion parsing to handle provider suffixes (e.g., GKE
31+
`x.y.z-gke.a`).
32+
- Fixed odd number of arguments logger error when updating pod conditions.
33+
- Avoid needless NotFound errors when patching pod conditions.
34+
- Fixed regression where nodeset `partition.enabled` was not being respected.
35+
- Initial NodeSet no longer accidentally owns the worker service.
36+
- Fixed issue where changes to slurmd and/or logfile subobjects where not
37+
causing a rolling update.
38+
- Fixed notation used to refer to LoginSets in installation docs.
39+
- Fixed documentation for uninstalling slurm-operator-crds.
40+
- When checking if a Slurm node is fully drained, the logic now follows closely
41+
to how Slurm represents the drained state. There were certain edge cases that
42+
could alleged the node was not drained when it actually was.
43+
- Check if Slurm node is [un]drain before requesting the opposite. This avoids a
44+
race condition where an admin or script has applied [un]drain to the Slurm
45+
node but the operator is not aware of it.
46+
- When Slurm nodes are put into drain state, the provided reason should not be
47+
thrashed by subsequent drain requests.
48+
- Fixed installation instruction for cert-manager chart.
49+
- Fixes bug wereby slurm-controller hostname was set incorrectly.
50+
- Fixes per-nodeset partition creation.
51+
- Fixed chart installation failure where NOTES.txt failed to fetch value from
52+
nested object where the parent was null.
53+
- Fixed imagePullPolicy in slurm-operator Helm chart.
54+
- Fixes edge case where Slurm node state is not reset when a worker pod migrates
55+
kube nodes.
56+
- Reduce checksum collision during file change detection by using SHA256 instead
57+
of MD5.
58+
- When `CgroupPlugin=disabled`, do not configure `PrologFlags=Contain` and other
59+
parameters that depend on it.
60+
- Added liveness probe to slurmd container to restart the pod if slurmd crashes
61+
after starting.
62+
- Prevent Slurm node undrain when node is down or notresponding.
63+
- Fixed reason prefixing behavior in MakeNodeUndrain.
64+
- Default webhook timeout is now consistent across all endpoints, respecing the
65+
user input, otherwise using the Kubernetes default.
66+
- Fixed case where multiple env variables in LoginSet would cause the operator
67+
to keep updating the LoginSet Deployment causing the underlying ReplicaSet to
68+
endlessly thrash.
69+
- Fixed case where NodeSets being added or removed from the Slurm cluster was
70+
not triggering a reconfigure.
71+
72+
### Changed
73+
74+
- Organized documentation into sub-directories.
75+
- Updates the paths used to refer to the user's home directory in installation
76+
instructions.
77+
- Slurm node [un]drain activity now includes more context.
78+
- Made the NodeSet updateStrategy configurable in the Slurm helm chart. The
79+
default minUnavailable was changed to 25%.
80+
- Shortened naming schema for health and metrics addresses.
81+
- Exposed addresses for health and metrics of the slurm-operator controller pod
82+
via the Helm chart.
83+
- slurmctld - The reconfigure container is now a sidecar instead of main
84+
container.
85+
- Reduced interval of the reconfigure check. After the kubelet updates mounted
86+
files in the pod, a reconfigure will be issued more quickly.
87+
- All supplemental containers are now `corev1.Container`, allowing full
88+
configuration.
89+
- Chart metadata is no longer applied to the pod template.
90+
- Updated NodeSet pod preStop to better indicate why the Slurm node was set to
91+
DOWN before deletion.
92+
- Service metadata is now configurable separately from the pod template
93+
metadata.
94+
- Webhooks avoid kube-system namespace.
95+
- Replaced slurm-exporter with a serviceMonitor that scrapes slurmctld directly.
96+
- Move to Slurm v44 API (from v43).
97+
98+
### Removed
99+
100+
- Removed defaulting webhooks.
101+
- Removed v1alpha1 CRDs to cleanly delineate v1 from v0 releases. Going forward,
102+
old versions of CRDs in v1 releases will linger in a deprecated state and be
103+
removed in future releases as needed.

0 commit comments

Comments
 (0)