Releases: nebius/soperator
1.23.2
Changes made since version 1.23.1 prior to version 1.23.2:
🚀 Features
- SCHED-706: Adding node metrics to track node status in Slurm
- PR: #1995
- Upgrade docker version
- PR: #2049
🐛 Fixes
- SCHED-658: remove reaction on the comment int ensure healthy
- PR: #2011
- SCHED-658: fix validation commentPrefix: null and drainReasonPrefix…
- PR: #2018
- SCHED-487 Do not wait for cancelled jobs in wait-for-checks
- PR: #2037
- Disable periodic JobAcctGather stats collection by default
- PR: #2038
- SCHED-789: nvtop 3.2.0.2-1+noble is no longer available
- PR: #2053
📦 Dependencies
- SCHED-656 Upgrade health-checker
- PR: #2040
- SCHED-394 Build and populate jail for both cuda 12 and 13
- PR: #2010
- Bump soperator version to 1.23.2
- PR: #2056
Contributors:
@theyoprst, @Uburro, @github-actions[bot], @ChessProfessor, @rdjjke, @asteny
| 📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
|---|---|---|---|---|
| 737 | 0 | 57 | 1152 | 214 |
1.23.1
Changes made since version 1.23.0 prior to version 1.23.1:
🚀 Features
- Add cleanup_scratch_data optional passive check
- PR: #1868
- SCHED-490 Add a flag to disable extensive check
- PR: #1913
- Adjust rmem and wmem network sysctls by default
- PR: #1928
- Support container customEnv + enable video capability by default
- PR: #1900
- SCHED-619 Creating symlink to the slurm configs for login containers
- PR: #1960
🐛 Fixes
- SCHED-507: (e2e) Install yq in github action
- PR: #1871
- SCHED-565 Change node replacement drain prefix to [compute_maintenance]
- PR: #1904
- SCHED-563 Bump health-checker to 1.0.0-171.251205
- PR: #1906
- SCHED-542: Convert wait-for-soperatorchecks-srun-ready to a k8s check job
- PR: #1889
- bump python3-apt version
- PR: #1921
- NOTIC: fix bugs with dublicate customVolumeMount
- PR: #1925
- Fix nfs_in_k8s TF variable set in E2E
- PR: #1939
- Use unstable NFS version in E2E TF
- PR: #1942
- Bind libslurm.so.* from container to jail
- PR: #1938
- SCHED-609: Fix Enroot containers
- PR: #1948
- SCHED-570 Use storage-driver vfs by default
- PR: #1949
- Make scontrol reboot work by fixing RebootProgram script permissions
- PR: #1958
- Allocate all available memory by default
- PR: #1968
- Run extensive checks on reservations more often
- PR: #1970
- Fix node auto-replacement after maintenance events
- PR: #1973
📦 Dependencies
- SCHED-492 Upgrade cuda version and get rid of dcgmi in ansible
- PR: #1901
Other
- Fix storage class in 1.23
- PR: #1908
Contributors:
@theyoprst, @rdjjke, @github-actions[bot], @ChessProfessor, @ali-sattari, @asteny, @Uburro, @itechdima
| 📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
|---|---|---|---|---|
| 1520 | 43 | 70 | 23722 | 9967 |
1.23.0
Changes made since version 1.22.3 prior to version 1.23.0:
🚀 Features
- Slurm 25.05.3
- PR: #1587
- Configurable values for ActiveChecks chart
- PR: #1562
- deprecate worker field and add nodeSetRefs validation
- PR: #1627
- SCHED-124: add structured partitions
- PR: #1632
- SCHED-138: add SOPERATOR_NODE_SETS_ON=true for static worker configuration
- PR: #1638
- Ensure healthy nodes check
- PR: #1645
- SCHED-137: add --instance-id and --extra both static and dynamic config
- PR: #1651
- Tailscale support
- PR: #1657
- SCHED-162: add all_reduce_perf in docker
- PR: #1659
- SCHED-155: automatic block topology
- PR: #1658
- Build Slurm 25.05.4
- PR: #1661
- add retrigger check
- PR: #1663
- SCHED-180: Refactor ActiveCheck CRD additional printer columns
- PR: #1673
- SCHED-249: add podMonitor
- PR: #1681
- SCHED-243: Make Nebius Mk8s conditions configurable
- PR: #1696
- Add ib-gpu-perf check
- PR: #1665
- Change order of docker image installation for better cache and simplify structure
- PR: #1715
- SCHED-250: Alpha version of NodeSet reconciliation
- PR: #1699
- SCHED-208 SCHED-209 SCHED-210 Refactoring Active Check helm charts and related controller behavior
- PR: #1726
- Ansible for managing Jail state
- PR: #1742
- Update dcgmi version to 1:4.4.2-1
- PR: #1780
- SCHED-380: Skip maintenance handling based on node labels
- PR: #1773
- SCHED-248 Set-unhealthy on extensive check failure with compute instance id and check run id
- PR: #1757
- SCHED-421 Move enable-node-replacement to separate param in values
- PR: #1820
- Exporter: add reservation name as a label to slurm_node_info
- PR: #1838
- SCHED-413 Enable ib perf gpu
- PR: #1843
🐛 Fixes
- add HostUsers optinal for all conponents but for wokers default false
- PR: #1512
- Update slurm active check status when submission failed + Renaming
- PR: #1571
- add logs format for leader elections
- PR: #1619
- Rename createuser to soperator-createuser to avoid PostgreSQL conflict
- PR: #1650
- SCHED-247: Rename K8s node condition MaintenanceScheduled->NebiusMaintenanceScheduled
- PR: #1675
- Helm: allow set customSlurmConfig
- PR: #1731
- Fix: set versions annotation for AdvancedStatefulSet
- PR: #1730
- Allow set tolerations for controllerManager
- PR: #1733
- Fix: Reconciler error in jailedconfig
- PR: #1738
- SCHED-165 Update soperator notifier helm to tag job owner in Slack
- PR: #1725
- Fix: Check if driver installed in /run/nvidia/driver
- PR: #1744
- Run hc_program passive checks more often
- PR: #1759
- Fix ActiveCheck additional printer columns
- PR: #1760
- Fix bug in maintenanceIgnoreNodeLabels
- PR: #1783
- fix syslog parsing on ubuntu24:04
- PR: #1788
- fix active checks output
- PR: #1805
- SCHED-417, SCHED-430, SCHED-437, SCHED-439, SCHED-441, SCHED-429: Pre-release 1.23 fixes related to extensive checks
- PR: #1822
- Don't use
:in extensive check reservation names- PR: #1833
- NOTIC: Don't use sudo in soperator-outputs-logs-cleaner
- PR: #1835
- SCHED-376 Update health-checker version (with new nccl-with-ib and ib-gpu-perf limits)
- PR: #1840
- NOTIC: Don't use sudo in all-reduce-perf-nccl-in-docker
- PR: #1846
- SCHED-485: First slurmJob active checks may fail in soperator
- PR: #1855
- Preinstall sudo in active check images
- PR: #1859
- NOTIC: Use sudo only in active checks that don't use containers
- PR: #1862
📦 Dependencies
- Bump sigs.k8s.io/controller-runtime from 0.21.0 to 0.22.1
- PR: #1616
- Bump golang.org/x/crypto from 0.41.0 to 0.42.0
- PR: #1568
- Bump github.com/prometheus/client_golang from 1.23.0 to 1.23.2
- PR: #1622
- Bump github.com/onsi/ginkgo/v2 from 2.25.2 to 2.25.3
- PR: #1625
- Bump github.com/zclconf/go-cty from 1.16.4 to 1.17.0
- PR: #1624
- Bump github.com/gruntwork-io/terratest from 0.50.0 to 0.51.0
- PR: #1626
- Bump docker/login-action from 3.5.0 to 3.6.0
- PR: #1631
- Bump github.com/onsi/ginkgo/v2 from 2.25.3 to 2.26.0
- PR: #1634
- Bump softprops/action-gh-release from 2.3.3 to 2.3.4
- PR: #1635
- Bump sigs.k8s.io/controller-runtime from 0.22.1 to 0.22.2
- PR: #1644
- Bump golang.org/x/sys from 0.36.0 to 0.37.0
- PR: #1643
- Bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.85.0 to 0.86.0
- PR: #1642
- Bump softprops/action-gh-release from 2.3.4 to 2.4.0
- PR: #1640
- Bump golang.org/x/crypto from 0.42.0 to 0.43.0
- PR: #1646
- Bump sigs.k8s.io/controller-runtime from 0.22.2 to 0.22.3
- PR: #1653
- Bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.86.0 to 0.86.1
- PR: #1652
- Bump github.com/gruntwork-io/terratest from 0.51.0 to 0.52.0
- PR: #1690
- Bump github.com/onsi/ginkgo/v2 from 2.26.0 to 2.27.2
- PR: #1698
- Bump actions/download-artifact from 5 to 6
- PR: #1693
- Bump actions/upload-artifact from 4 to 5
- PR: #1689
- Bump softprops/action-gh-release from 2.4.0 to 2.4.1
- PR: #1654
- Bump sigs.k8s.io/controller-runtime from 0.22.3 to 0.22.4
- PR: #1719
- Bump mikepenz/release-changelog-builder-action from 5.4.1 to 6.0.1
- PR: #1720
- Bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.86.1 to 0.86.2
- PR: #1752
- Bump softprops/action-gh-release from 2.4.1 to 2.4.2
- PR: #1753
- Bump golang.org/x/crypto from 0.43.0 to 0.44.0
- PR: #1767
- Bump k8s.io/api from 0.34.1 to 0.34.2
- PR: #1770
- Bump k8s.io/component-base from 0.34.1 to 0.34.2
- PR: #1771
Other
- NFS Server helm chart fixes
- PR: #1569
- bump go 1.25
- PR: #1613
- Metrics for SlurmCluster CR via KubeStateMetrics config
- PR: #1633
- Set
driftDetection.mode: warnby default for helm releases- PR: #1669
- Separate versioning for NFS server image and chart
- PR: #1758
- Allow volume size increase for filestore PVCs and PVs
- PR: #1786
- Update dcgm-exporter
- PR: #1811
Contributors:
@Uburro, @ChessProfessor, @ali-sattari, @github-actions[bot], @asteny, @itechdima, @theyoprst, @dependabot[bot], @dstaroff, @rdjjke, @andriishestakov
| 📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
|---|---|---|---|---|
| 5958 | 403 | 464 | 79600 | 24660 |
1.22.4
Changes made since version 1.22.3 prior to version 1.22.4:
🚀 Features
- Bump enroot version 4.0.1
- PR: #1781
🐛 Fixes
- Run hc_program passive checks more often
- PR: #1759
Contributors:
@rdjjke, @itechdima, @asteny
| 📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
|---|---|---|---|---|
| 132 | 0 | 19 | 135 | 116 |
1.22.3
Changes made since version 1.22.2 prior to version 1.22.3:
🚀 Features
- Support B300 in checks
- PR: #1722
🐛 Fixes
- Ignore NOT_RESPONDING nodes in sconfigcontroller
- PR: #1735
Contributors:
@itechdima, @rdjjke
| 📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
|---|---|---|---|---|
| 137 | 0 | 8 | 72 | 51 |
1.22.2
Changes made since version 1.22.1 prior to version 1.22.2:
🐛 Fixes
- SCHED-303: update health-checker
- PR: #1704
- SCHED-310: Explicitly set $HOME in slurmJob ActiveChecks
- PR: #1706
- SCHED-300: Delete wait-for-checks-job if helm release failed
- PR: #1708
- SCHED-303, SCHED-304: Fix health checks: ib_link on Supermicro B200, nvidia_smi output format
- PR: #1711
Contributors:
@itechdima, @rdjjke, @github-actions[bot], @theyoprst, @ChessProfessor
| 📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
|---|---|---|---|---|
| 327 | 0 | 15 | 61 | 135 |
1.22.1
Changes made since version 1.22.0 prior to version 1.22.1:
🚀 Features
- add collecting logs from network-operator and gpu-operator
- PR: #1605
- SCHED-174: Add kube_node_labels metric (#1639)
- PR: #1647
🐛 Fixes
- SCHED-173: Fix the jobs limit increase
- PR: #1670
- Skip sbatch submission failures
- PR: #1668
- SCHED-251: ignore down nodes
- PR: #1683
- SCHED-286 Add sudo to enroot cleanup Acitve Check
- PR: #1686
Other
- add delete-not-ready-nodes=true by default in helm
- PR: #1615
Contributors:
@Uburro, @theyoprst, @github-actions[bot], @rdjjke, @ChessProfessor, @itechdima
| 📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
|---|---|---|---|---|
| 468 | 68 | 23 | 221 | 81 |
1.22.0
Changes made since version 1.21.9 prior to version 1.22.0:
🚀 Features
- [ha controller] add placeholder and single replicas
- PR: #1344
- Bump container toolkit and do not install it into the jail
- PR: #1347
- Bump slurm 24.11.6
- PR: #1352
- Improve job metrics handling in exporter
- PR: #1358
- Speedup populate_jail + don't overwrite existing data
- PR: #1278
- Implement async metrics collection for SLURM exporter
- PR: #1376
- add scrape node conditions and pod termination reasons metrics from ksm
- PR: #1379
- add gpu-fryer check
- PR: #1375
- Add memory latency and bandwidth Active Checks
- PR: #1368
- Add self-monitoring metrics on separate port 8081
- PR: #1382
- add cuda-samples test
- PR: #1381
- add health-checker upgrade
- PR: #1384
- [slurmcluster] maxUnavalaible for workes and support pre install images
- PR: #1391
- add comment reaction on the check
- PR: #1396
- [CD] Support for VMAgent external labels
- PR: #1403
- Add JailedConfig CR
- PR: #1287
- Undrain nodes after user problems + rewrite passive checks on Python
- PR: #1392
- Collecting health_checker_cmd_stdout logs
- PR: #1410
- Standalone Slurm exporter - No Kubernetes needed
- PR: #1415
- [SlurmCluster CRD] rewrite logic of printcolumn
- PR: #1434
- Clean old logs from /opt/soperator-outputs
- PR: #1431
- Add slurm scripts for managing per-job tmpfs directories
- PR: #1432
- move IB checks to health-checker
- PR: #1438
- add aggregation to the JailedConfig
- PR: #1435
- Log one-line JSON outputs for health-checker + rewrite in Python
- PR: #1447
- Chessprofessor/each worker jobs
- PR: #1443
- #1312 Add ib-write-bw/lat cpu checks
- PR: #1323
- add activechecks PrinterColumns
- PR: #1497
- add priorityclassname for components of clurm cluster
- PR: #1519
- Bump health checker 1.0.0-150
- PR: #1526
- Slurm scripts drop cache shmem
- PR: #1554
- remove appormore deny for libEGL_ for running docker active check image
- PR: #1574
- bump nc-health-checker_1.0.0-151.250904
- PR: #1573
- [slurm login] add SshdServiceLoadBalancerSourceRanges to login node
- PR: #1581
- Fix passive check filtering
- PR: #1582
🐛 Fixes
- [nccl-debug] Use chmod instead of umask
- PR: #1345
- remove size from controller spec
- PR: #1356
- fix cache-sync-timeout to k8s default
- PR: #1367
- Don't plan eachWorkerJobArray active checks on bad nodes
- PR: #1378
- remove validation tool init container #1361
- PR: #1372
- [nodeTopology] re-generate CM node topology if cm deleted.
- PR: #1383
- SLURM exporter: use env configuration and improve docs
- PR: #1394
- [worker topology controller] initial cm topology until asts not found
- PR: #1401
- removing deprication fileds
- PR: #1407
- fix priorityclass name for controller
- PR: #1411
- fix issue with metadata.resourceVersion: Invalid value: 0x0: must be specified for an Patch in ASTS
- PR: #1420
- set default values as a defalt in CRD #1421
- PR: #1422
- Pre release fixes 1.22/0
- PR: #1423
- Fix pagination issue in cache
- PR: #1433
- [sopertochecs] change drain reason for maintenance
- PR: #1430
- Disable acctg by default
- PR: #1442
- add aggregation to the JailedConfig
- PR: #1435
- Fix error with read-only workdir in dcgmi_diag_r1 health-checker
- PR: #1450
- [sconfigcontroller] fix reconcile jailedconfig
- PR: #1455
- Fallback on
unix.renameat2toos.renamewhenrenameat2is not supported- PR: #1452
- fix getting error when [user_problem] reason #1468
- PR: #1474
- enable nodeLogs by default
- PR: #1477
- fix preemptionPolicy for controller
- PR: #1495
- fix patch cm with empty labels
- PR: #1508
- add resources values for spo
- PR: #1513
- Move Healthchekcer parts to optimize for build time
- PR: #1532
- Cherry-pick active checks in Helm
- PR: #1529
- [nccld-plugin] Make user responsible for correct rights of the output directory
- PR: #1541
- Add fixes for activechecks in 1.22
- PR: #1542
- Slurm scripts drop cache shmem
- PR: #1554
- DCGM Exporter fix for toolkit validation
- PR: #1583
- Add wait after users creation + Split all-reduce-perf + Fix dcgmi_diag_r1
- PR: #1572
- [TopologyController] add EnsureWorkerTopologyConfigMap to check existing of JailedConfig
- PR: #1580
- Fix passive check filtering
- PR: #1582
- change k8up-cleanup image
- PR: #1597
- Extract Enroot's config paths to config dir
- PR: #1599
📦 Dependencies
- Bump docker/login-action from 3.4.0 to 3.5.0
- PR: #1370
- Bump github.com/getkin/kin-openapi from 0.122.0 to 0.131.0
- PR: #1387
- Bump golang.org/x/oauth2 from 0.24.0 to 0.27.0
- PR: #1386
- Bump golang.org/x/net from 0.37.0 to 0.38.0
- PR: #1388
- Bump sigs.k8s.io/yaml from 1.4.0 to 1.6.0
- PR: #1338
- Bump actions/checkout from 4.2.2 to 5.0.0
- PR: #1408
- Bump actions/download-artifact from 4 to 5
- PR: #1389
📔Docs
- SLURM exporter: use env configuration and improve docs
- PR: #1394
Other
- Merge dev to main
- PR: #1332
- Set sconfigcontroller UID and GID from Helm values
- PR: #1327
- DCGM exporter Helm chart: support metricRelablings
- PR: #1425
- [nccl-debug] Don't try chmodding files and directories without a need
- PR: #1522
- Support passing extraArgs for node-exporter
- PR: #1534
- Add reason label to slurm_node_info metric for better observability
- PR: #1538
- [cherry-pick] Add node unavailability and draining duration metrics
- PR: #1561
- Support DCGM Exporter on driverful
- PR: #1578
- Change DCGM exporter image version
- PR: #1595
Contributors:
@theyoprst, @mcheshkov, @dstaroff, @Uburro, @asteny, @itechdima, @rdjjke, @ChessProfessor, @dependabot[bot], @ali-sattari, @mateusclira-nv, @dnugmanov, @github-actions[bot]
| 📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
|---|---|---|---|---|
| 5613 | 593 | 250 | 9661 | 5211 |
1.21.14
Changes made since version 1.21.13 prior to version 1.21.14:
🐛 Fixes
- remove-hc-host-service-check
- PR: #1585
- change k8up-cleanup image
- PR: #1602
- Extract Enroot's config paths to config dir
- PR: #1604
Contributors:
@itechdima, @ChessProfessor, @Uburro, @dstaroff
| 📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
|---|---|---|---|---|
| 164 | 0 | 10 | 73 | 79 |
1.21.13
Changes made since version 1.21.12 prior to version 1.21.13:
🐛 Fixes
- fix patch cm with empty labels
- PR: #1507
📔Docs
- fix doc Soperator Helm chart
- PR: #1493
Other
- release 1.21.13
- PR: #1509
Contributors:
@Uburro
| 📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
|---|---|---|---|---|
| 164 | 33 | 3 | 49 | 65 |