1.23.0
Changes made since version 1.22.3 prior to version 1.23.0:
π Features
- Slurm 25.05.3
- PR: #1587
- Configurable values for ActiveChecks chart
- PR: #1562
- deprecate worker field and add nodeSetRefs validation
- PR: #1627
- SCHED-124: add structured partitions
- PR: #1632
- SCHED-138: add SOPERATOR_NODE_SETS_ON=true for static worker configuration
- PR: #1638
- Ensure healthy nodes check
- PR: #1645
- SCHED-137: add --instance-id and --extra both static and dynamic config
- PR: #1651
- Tailscale support
- PR: #1657
- SCHED-162: add all_reduce_perf in docker
- PR: #1659
- SCHED-155: automatic block topology
- PR: #1658
- Build Slurm 25.05.4
- PR: #1661
- add retrigger check
- PR: #1663
- SCHED-180: Refactor ActiveCheck CRD additional printer columns
- PR: #1673
- SCHED-249: add podMonitor
- PR: #1681
- SCHED-243: Make Nebius Mk8s conditions configurable
- PR: #1696
- Add ib-gpu-perf check
- PR: #1665
- Change order of docker image installation for better cache and simplify structure
- PR: #1715
- SCHED-250: Alpha version of NodeSet reconciliation
- PR: #1699
- SCHED-208 SCHED-209 SCHED-210 Refactoring Active Check helm charts and related controller behavior
- PR: #1726
- Ansible for managing Jail state
- PR: #1742
- Update dcgmi version to 1:4.4.2-1
- PR: #1780
- SCHED-380: Skip maintenance handling based on node labels
- PR: #1773
- SCHED-248 Set-unhealthy on extensive check failure with compute instance id and check run id
- PR: #1757
- SCHED-421 Move enable-node-replacement to separate param in values
- PR: #1820
- Exporter: add reservation name as a label to slurm_node_info
- PR: #1838
- SCHED-413 Enable ib perf gpu
- PR: #1843
π Fixes
- add HostUsers optinal for all conponents but for wokers default false
- PR: #1512
- Update slurm active check status when submission failed + Renaming
- PR: #1571
- add logs format for leader elections
- PR: #1619
- Rename createuser to soperator-createuser to avoid PostgreSQL conflict
- PR: #1650
- SCHED-247: Rename K8s node condition MaintenanceScheduled->NebiusMaintenanceScheduled
- PR: #1675
- Helm: allow set customSlurmConfig
- PR: #1731
- Fix: set versions annotation for AdvancedStatefulSet
- PR: #1730
- Allow set tolerations for controllerManager
- PR: #1733
- Fix: Reconciler error in jailedconfig
- PR: #1738
- SCHED-165 Update soperator notifier helm to tag job owner in Slack
- PR: #1725
- Fix: Check if driver installed in /run/nvidia/driver
- PR: #1744
- Run hc_program passive checks more often
- PR: #1759
- Fix ActiveCheck additional printer columns
- PR: #1760
- Fix bug in maintenanceIgnoreNodeLabels
- PR: #1783
- fix syslog parsing on ubuntu24:04
- PR: #1788
- fix active checks output
- PR: #1805
- SCHED-417, SCHED-430, SCHED-437, SCHED-439, SCHED-441, SCHED-429: Pre-release 1.23 fixes related to extensive checks
- PR: #1822
- Don't use
:in extensive check reservation names- PR: #1833
- NOTIC: Don't use sudo in soperator-outputs-logs-cleaner
- PR: #1835
- SCHED-376 Update health-checker version (with new nccl-with-ib and ib-gpu-perf limits)
- PR: #1840
- NOTIC: Don't use sudo in all-reduce-perf-nccl-in-docker
- PR: #1846
- SCHED-485: First slurmJob active checks may fail in soperator
- PR: #1855
- Preinstall sudo in active check images
- PR: #1859
- NOTIC: Use sudo only in active checks that don't use containers
- PR: #1862
π¦ Dependencies
- Bump sigs.k8s.io/controller-runtime from 0.21.0 to 0.22.1
- PR: #1616
- Bump golang.org/x/crypto from 0.41.0 to 0.42.0
- PR: #1568
- Bump github.com/prometheus/client_golang from 1.23.0 to 1.23.2
- PR: #1622
- Bump github.com/onsi/ginkgo/v2 from 2.25.2 to 2.25.3
- PR: #1625
- Bump github.com/zclconf/go-cty from 1.16.4 to 1.17.0
- PR: #1624
- Bump github.com/gruntwork-io/terratest from 0.50.0 to 0.51.0
- PR: #1626
- Bump docker/login-action from 3.5.0 to 3.6.0
- PR: #1631
- Bump github.com/onsi/ginkgo/v2 from 2.25.3 to 2.26.0
- PR: #1634
- Bump softprops/action-gh-release from 2.3.3 to 2.3.4
- PR: #1635
- Bump sigs.k8s.io/controller-runtime from 0.22.1 to 0.22.2
- PR: #1644
- Bump golang.org/x/sys from 0.36.0 to 0.37.0
- PR: #1643
- Bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.85.0 to 0.86.0
- PR: #1642
- Bump softprops/action-gh-release from 2.3.4 to 2.4.0
- PR: #1640
- Bump golang.org/x/crypto from 0.42.0 to 0.43.0
- PR: #1646
- Bump sigs.k8s.io/controller-runtime from 0.22.2 to 0.22.3
- PR: #1653
- Bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.86.0 to 0.86.1
- PR: #1652
- Bump github.com/gruntwork-io/terratest from 0.51.0 to 0.52.0
- PR: #1690
- Bump github.com/onsi/ginkgo/v2 from 2.26.0 to 2.27.2
- PR: #1698
- Bump actions/download-artifact from 5 to 6
- PR: #1693
- Bump actions/upload-artifact from 4 to 5
- PR: #1689
- Bump softprops/action-gh-release from 2.4.0 to 2.4.1
- PR: #1654
- Bump sigs.k8s.io/controller-runtime from 0.22.3 to 0.22.4
- PR: #1719
- Bump mikepenz/release-changelog-builder-action from 5.4.1 to 6.0.1
- PR: #1720
- Bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.86.1 to 0.86.2
- PR: #1752
- Bump softprops/action-gh-release from 2.4.1 to 2.4.2
- PR: #1753
- Bump golang.org/x/crypto from 0.43.0 to 0.44.0
- PR: #1767
- Bump k8s.io/api from 0.34.1 to 0.34.2
- PR: #1770
- Bump k8s.io/component-base from 0.34.1 to 0.34.2
- PR: #1771
Other
- NFS Server helm chart fixes
- PR: #1569
- bump go 1.25
- PR: #1613
- Metrics for SlurmCluster CR via KubeStateMetrics config
- PR: #1633
- Set
driftDetection.mode: warnby default for helm releases- PR: #1669
- Separate versioning for NFS server image and chart
- PR: #1758
- Allow volume size increase for filestore PVCs and PVs
- PR: #1786
- Update dcgm-exporter
- PR: #1811
Contributors:
@Uburro, @ChessProfessor, @ali-sattari, @github-actions[bot], @asteny, @itechdima, @theyoprst, @dependabot[bot], @dstaroff, @rdjjke, @andriishestakov
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 5958 | 403 | 464 | 79600 | 24660 |