Releases: nebius/soperator
3.0.2
Changes made since version 3.0.1 prior to version 3.0.2:
π Fixes
- SCHED-1015: add retries values for all Helm Releases
- PR: #2215
- SCHED-1039: Add CUDA-to-NCCL tests version mapping into the helm chart.
- PR: #2240
- SCHED-1033: Upgrade Slurm version 25.11.3
- PR: #2229
- SCHED-1074: Replace deprecated gcr.io registry
- PR: #2271
- fix wrong cleanup_enroot execution
- PR: #2278
- SLURMSUPPORT-320: fix spo version kube rbac proxy
- PR: #2282
- SCHED-1049: rallback topology controller and fixed bug with dynamic topology
- PR: #2263
- SCHED-1135 Ignore comment checks in wait-for-checks-job
- PR: #2306
Contributors:
@theyoprst, @Uburro, @github-actions[bot], @itechdima, @ChessProfessor
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 582 | 0 | 73 | 1744 | 258 |
2.0.5
Changes made since version 2.0.4 prior to version 2.0.5:
π Fixes
- fix wrong cleanup_enroot execution
- PR: #2278
- SLURMSUPPORT-320: fix spo version kube rbac proxy
- PR: #2282
Contributors:
@itechdima, @Uburro, @theyoprst
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 133 | 0 | 19 | 732 | 162 |
2.0.4
Changes made since version 2.0.3 prior to version 2.0.4:
π Fixes
- SCHED-1074: Replace deprecated gcr.io registry
- PR: #2271
Contributors:
@theyoprst
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 78 | 0 | 3 | 79 | 79 |
2.0.3
Changes made since version 2.0.2 prior to version 2.0.3:
π Fixes
- SCHED-1039: Add CUDA-to-NCCL tests version mapping into the helm chart.
- PR: #2240
Contributors:
@theyoprst
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 103 | 0 | 9 | 581 | 130 |
3.0.1
Changes made since version 3.0.0 prior to version 3.0.1:
π Fixes
- SCHED-1008: Fix IB topology for GPU nodes
- PR: #2201
- Make it possible to add tier-2 topology switches to extra constraints
- PR: #2206
- SCHED-1007: Ignore CPU-only nodes in IB topology
- PR: #2203
- SCHED-1015: add retries values for all Helm Releases (#2215)
- PR: #2218
Contributors:
@theyoprst, @Uburro, @rdjjke
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 304 | 0 | 18 | 1236 | 456 |
2.0.2
Changes made since version 2.0.1 prior to version 2.0.2:
π Fixes
- SCHED-1015: add retries values for all Helm Releases
- PR: #2215
Contributors:
@theyoprst, @Uburro
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 84 | 0 | 4 | 363 | 319 |
3.0.0
Changes made since version 2.0.1 prior to version 3.0.0:
π Features
- Use Slurm versions 25.11.2
- PR: #2130
- Build jail images in parallel on github runners
- PR: #2158
- SCHED-945 Build controllers and slurm images in parallel
- PR: #2163
- SCHED-953 Run lint, build-helm-charts and pre-build on github runners
- PR: #2167
- SCHED-958 Build images with docker registry cache
- PR: #2170
- Use output variables for builds
- PR: #2172
- SCHED-848: add ephemeral nodes
- PR: #2101
π Fixes
- Fix name/namespace parameters for render secret
- PR: #2117
- SCHED-941: upgrade verions of munge
- PR: #2174
- SCHED-987: Delete dynamic workers from code
- PR: #2192
- SCHED-986: Ignore POWERED_DOWN nodes in active checks
- PR: #2198
π¦ Dependencies
- Bump docker/login-action from 3.6.0 to 3.7.0
- PR: #2102
- Bump mikepenz/release-changelog-builder-action from 6.0.1 to 6.1.0
- PR: #2122
- Bump github.com/cert-manager/cert-manager from 1.18.2 to 1.18.5 in the go_modules group across 1 directory
- PR: #2149
- Bump filelock from 3.20.1 to 3.20.3 in /ansible in the pip group across 1 directory
- PR: #2030
πDocs
- Update Active Checks doc to 2.0
- PR: #2160
Other
- Add affinity and nodeSelector support to soperator manager
- PR: #2127
- Better default params for NFS storageclass
- PR: #2169
Contributors:
@theyoprst, @github-actions[bot], @dependabot[bot], @andriishestakov, @janekmichalik, @asteny, @ChessProfessor, @ali-sattari, @Uburro
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 1315 | 136 | 97 | 6649 | 14293 |
2.0.1
Changes made since version 2.0.0 prior to version 2.0.1:
π Fixes
- fix long termination of worker pods
- PR: #2164
π¦ Dependencies
- SCHED-931 Upgrade nvidia toolkit version to 1.18.2-1
- PR: #2161
Contributors:
@itechdima, @theyoprst, @ChessProfessor, @asteny
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 158 | 0 | 9 | 592 | 537 |
2.0.0
Changes made since version 1.23.2 prior to version 2.0.0:
π Features
- Change base cuda image and use ansible for nccl-tests
- PR: #1967
- Change base Neubuntu image for all images
- PR: #1975
- feat: Adding node metrics to track node status in Slurm
- PR: #1927
- Support disabling controllers via flag or env
- PR: #1978
- Use base neubuntu image with ansible and return pushing stable release images to the Github docker registry
- PR: #1987
- Allow set procMount
- PR: #1980
- SCHED-696: update umbrella chart for logging
- PR: #1991
- Move common-packages, repos and python roles to the base layers of images (repo ml-containers)
- PR: #1993
- SCHED-696: customize log headers
- PR: #2007
- SCHED-626 Removing slurm installation from images (use base image)
- PR: #2008
- SCHED-696: configure logs endpoint
- PR: #2028
- Use base images with Nebius apt snapshots
- PR: #2031
- SCHED-696: add attribute for service provider application
- PR: #2033
- SCHED-761 move openmpi role to the ml-containers repo
- PR: #2035
- SCHED-773 Move dcgmi, cuda and nccl-tests roles to the ml-containers
- PR: #2047
- Bump docker and nvtop
- PR: #2051
- Use base image for jail and active_checks with ansible roles for downloading binaries: nccl-tests, cuda-samples and mlc
- PR: #2062
- Use slurm_training_diag as base image for jail
- PR: #2064
- SCHED-864 Create sansible docker image for handling the jail state
- PR: #2108
- SCHED-906 remove outdated scripts
- PR: #2142
- bump nvtop
- PR: #2156
π Fixes
- SCHED-567: Ensure deterministic startup order between DB, accounting and controller
- PR: #1918
- Fix k8up backup image repo and tag
- PR: #1966
- SLURMSUPPORT-75: add more state unavailable node to slurm exporter
- PR: #1988
- SCHED-690: removing exporter rb, sa, role from soperator to helm chart
- PR: #1986
- Fix pod monitor bug in renderer
- PR: #1992
- SCHED-785: Plug-in SPANK plugins properly in slurm job active checks
- PR: #2058
- SCHED-807: fix autohealing for nodesets
- PR: #2072
- Get correct environment for passive checks
- PR: #2070
- turn off dcgmi diag active checks
- PR: #2082
- increase default reconfiguration period
- PR: #2087
- Use base images without workdir /opt/ansible
- PR: #2084
- SCHED-855 add WorkingDir for activecheck container images
- PR: #2100
- do not always undrain node
- PR: #2113
- Use CLOUD nodes and make gres.conf configurable for NodeSets
- PR: #2119
- SCHED-898 Ignore non-draining checks in wait-for-active-checks job
- PR: #2132
- SCHED-885: change init containers order
- PR: #2133
- Add slurm script that does chmod a+rw for enroot image layers
- PR: #2138
- [SCHED-804] Deprecate and make optional slurmNodes.worker field in SlurmCluster CRD
- PR: #2141
- Use default nfs-in-k8s for e2e
- PR: #2152
π¦ Dependencies
- Bump golang.org/x/crypto from 0.45.0 to 0.46.0
- PR: #1919
- Bump github.com/onsi/gomega from 1.38.2 to 1.38.3
- PR: #1920
- Bump k8s.io/client-go from 0.34.2 to 0.34.3
- PR: #1930
- Bump k8s.io/component-base from 0.34.2 to 0.34.3
- PR: #1929
- Bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.86.2 to 0.87.1
- PR: #1931
- Bump sigs.k8s.io/controller-runtime from 0.22.3 to 0.22.4
- PR: #1935
- Bump actions/upload-artifact from 5 to 6
- PR: #1934
- Bump actions/download-artifact from 6 to 7
- PR: #1933
- Bump filelock from 3.20.0 to 3.20.1 in /ansible in the pip group across 1 directory
- PR: #1946
- Bump actions/checkout from 4 to 6
- PR: #1994
- Bump actions/checkout from 4 to 6
- PR: #2023
Other
- Support customizing built-in Slurm scripts
- PR: #1915
- Make Helm chart soperator-activechecks customizable
- PR: #1954
- Fixes for issues with metrics and dashboards
- PR: #2061
- Add fsGroupChangePolicy: "OnRootMismatch" to NFS server StatefulSet
- PR: #2066
- NOTIC: Move status and resolution fields to log labels from body
- PR: #2125
Contributors:
@github-actions[bot], @dependabot[bot], @dstaroff, @Uburro, @ali-sattari, @asteny, @aaroniscode, @mateusclira-nv, @theyoprst, @andriishestakov, @itechdima, @rdjjke, @ChessProfessor
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 3985 | 358 | 250 | 41893 | 5554 |
1.23.3
Changes made since version 1.23.2 prior to version 1.23.3:
π Fixes
Contributors:
@itechdima, @ali-sattari
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 122 | 0 | 5 | 71 | 71 |