Skip to content

Releases: nebius/soperator

1.23.2

19 Jan 18:15
454f486

Choose a tag to compare

Changes made since version 1.23.1 prior to version 1.23.2:

🚀 Features

  • SCHED-706: Adding node metrics to track node status in Slurm
  • Upgrade docker version

🐛 Fixes

  • SCHED-658: remove reaction on the comment int ensure healthy
  • SCHED-658: fix validation commentPrefix: null and drainReasonPrefix…
  • SCHED-487 Do not wait for cancelled jobs in wait-for-checks
  • Disable periodic JobAcctGather stats collection by default
  • SCHED-789: nvtop 3.2.0.2-1+noble is no longer available

📦 Dependencies

  • SCHED-656 Upgrade health-checker
  • SCHED-394 Build and populate jail for both cuda 12 and 13
  • Bump soperator version to 1.23.2

Contributors:
@theyoprst, @Uburro, @github-actions[bot], @ChessProfessor, @rdjjke, @asteny

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
737 0 57 1152 214

1.23.1

29 Dec 16:15
035d75a

Choose a tag to compare

Changes made since version 1.23.0 prior to version 1.23.1:

🚀 Features

  • Add cleanup_scratch_data optional passive check
  • SCHED-490 Add a flag to disable extensive check
  • Adjust rmem and wmem network sysctls by default
  • Support container customEnv + enable video capability by default
  • SCHED-619 Creating symlink to the slurm configs for login containers

🐛 Fixes

  • SCHED-507: (e2e) Install yq in github action
  • SCHED-565 Change node replacement drain prefix to [compute_maintenance]
  • SCHED-563 Bump health-checker to 1.0.0-171.251205
  • SCHED-542: Convert wait-for-soperatorchecks-srun-ready to a k8s check job
  • bump python3-apt version
  • NOTIC: fix bugs with dublicate customVolumeMount
  • Fix nfs_in_k8s TF variable set in E2E
  • Use unstable NFS version in E2E TF
  • Bind libslurm.so.* from container to jail
  • SCHED-609: Fix Enroot containers
  • SCHED-570 Use storage-driver vfs by default
  • Make scontrol reboot work by fixing RebootProgram script permissions
  • Allocate all available memory by default
  • Run extensive checks on reservations more often
  • Fix node auto-replacement after maintenance events

📦 Dependencies

  • SCHED-492 Upgrade cuda version and get rid of dcgmi in ansible

Other

  • Fix storage class in 1.23

Contributors:
@theyoprst, @rdjjke, @github-actions[bot], @ChessProfessor, @ali-sattari, @asteny, @Uburro, @itechdima

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
1520 43 70 23722 9967

1.23.0

28 Nov 13:24
68bbca4

Choose a tag to compare

Changes made since version 1.22.3 prior to version 1.23.0:

🚀 Features

  • Slurm 25.05.3
  • Configurable values for ActiveChecks chart
  • deprecate worker field and add nodeSetRefs validation
  • SCHED-124: add structured partitions
  • SCHED-138: add SOPERATOR_NODE_SETS_ON=true for static worker configuration
  • Ensure healthy nodes check
  • SCHED-137: add --instance-id and --extra both static and dynamic config
  • Tailscale support
  • SCHED-162: add all_reduce_perf in docker
  • SCHED-155: automatic block topology
  • Build Slurm 25.05.4
  • add retrigger check
  • SCHED-180: Refactor ActiveCheck CRD additional printer columns
  • SCHED-249: add podMonitor
  • SCHED-243: Make Nebius Mk8s conditions configurable
  • Add ib-gpu-perf check
  • Change order of docker image installation for better cache and simplify structure
  • SCHED-250: Alpha version of NodeSet reconciliation
  • SCHED-208 SCHED-209 SCHED-210 Refactoring Active Check helm charts and related controller behavior
  • Ansible for managing Jail state
  • Update dcgmi version to 1:4.4.2-1
  • SCHED-380: Skip maintenance handling based on node labels
  • SCHED-248 Set-unhealthy on extensive check failure with compute instance id and check run id
  • SCHED-421 Move enable-node-replacement to separate param in values
  • Exporter: add reservation name as a label to slurm_node_info
  • SCHED-413 Enable ib perf gpu

🐛 Fixes

  • add HostUsers optinal for all conponents but for wokers default false
  • Update slurm active check status when submission failed + Renaming
  • add logs format for leader elections
  • Rename createuser to soperator-createuser to avoid PostgreSQL conflict
  • SCHED-247: Rename K8s node condition MaintenanceScheduled->NebiusMaintenanceScheduled
  • Helm: allow set customSlurmConfig
  • Fix: set versions annotation for AdvancedStatefulSet
  • Allow set tolerations for controllerManager
  • Fix: Reconciler error in jailedconfig
  • SCHED-165 Update soperator notifier helm to tag job owner in Slack
  • Fix: Check if driver installed in /run/nvidia/driver
  • Run hc_program passive checks more often
  • Fix ActiveCheck additional printer columns
  • Fix bug in maintenanceIgnoreNodeLabels
  • fix syslog parsing on ubuntu24:04
  • fix active checks output
  • SCHED-417, SCHED-430, SCHED-437, SCHED-439, SCHED-441, SCHED-429: Pre-release 1.23 fixes related to extensive checks
  • Don't use : in extensive check reservation names
  • NOTIC: Don't use sudo in soperator-outputs-logs-cleaner
  • SCHED-376 Update health-checker version (with new nccl-with-ib and ib-gpu-perf limits)
  • NOTIC: Don't use sudo in all-reduce-perf-nccl-in-docker
  • SCHED-485: First slurmJob active checks may fail in soperator
  • Preinstall sudo in active check images
  • NOTIC: Use sudo only in active checks that don't use containers

📦 Dependencies

  • Bump sigs.k8s.io/controller-runtime from 0.21.0 to 0.22.1
  • Bump golang.org/x/crypto from 0.41.0 to 0.42.0
  • Bump github.com/prometheus/client_golang from 1.23.0 to 1.23.2
  • Bump github.com/onsi/ginkgo/v2 from 2.25.2 to 2.25.3
  • Bump github.com/zclconf/go-cty from 1.16.4 to 1.17.0
  • Bump github.com/gruntwork-io/terratest from 0.50.0 to 0.51.0
  • Bump docker/login-action from 3.5.0 to 3.6.0
  • Bump github.com/onsi/ginkgo/v2 from 2.25.3 to 2.26.0
  • Bump softprops/action-gh-release from 2.3.3 to 2.3.4
  • Bump sigs.k8s.io/controller-runtime from 0.22.1 to 0.22.2
  • Bump golang.org/x/sys from 0.36.0 to 0.37.0
  • Bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.85.0 to 0.86.0
  • Bump softprops/action-gh-release from 2.3.4 to 2.4.0
  • Bump golang.org/x/crypto from 0.42.0 to 0.43.0
  • Bump sigs.k8s.io/controller-runtime from 0.22.2 to 0.22.3
  • Bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.86.0 to 0.86.1
  • Bump github.com/gruntwork-io/terratest from 0.51.0 to 0.52.0
  • Bump github.com/onsi/ginkgo/v2 from 2.26.0 to 2.27.2
  • Bump actions/download-artifact from 5 to 6
  • Bump actions/upload-artifact from 4 to 5
  • Bump softprops/action-gh-release from 2.4.0 to 2.4.1
  • Bump sigs.k8s.io/controller-runtime from 0.22.3 to 0.22.4
  • Bump mikepenz/release-changelog-builder-action from 5.4.1 to 6.0.1
  • Bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.86.1 to 0.86.2
  • Bump softprops/action-gh-release from 2.4.1 to 2.4.2
  • Bump golang.org/x/crypto from 0.43.0 to 0.44.0
  • Bump k8s.io/api from 0.34.1 to 0.34.2
  • Bump k8s.io/component-base from 0.34.1 to 0.34.2

Other

  • NFS Server helm chart fixes
  • bump go 1.25
  • Metrics for SlurmCluster CR via KubeStateMetrics config
  • Set driftDetection.mode: warn by default for helm releases
  • Separate versioning for NFS server image and chart
  • Allow volume size increase for filestore PVCs and PVs
  • Update dcgm-exporter

Contributors:
@Uburro, @ChessProfessor, @ali-sattari, @github-actions[bot], @asteny, @itechdima, @theyoprst, @dependabot[bot], @dstaroff, @rdjjke, @andriishestakov

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
5958 403 464 79600 24660

1.22.4

18 Nov 19:16
abd146b

Choose a tag to compare

Changes made since version 1.22.3 prior to version 1.22.4:

🚀 Features

  • Bump enroot version 4.0.1

🐛 Fixes

  • Run hc_program passive checks more often

Contributors:
@rdjjke, @itechdima, @asteny

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
132 0 19 135 116

1.22.3

06 Nov 18:11
eedbc53

Choose a tag to compare

Changes made since version 1.22.2 prior to version 1.22.3:

🚀 Features

  • Support B300 in checks

🐛 Fixes

  • Ignore NOT_RESPONDING nodes in sconfigcontroller

Contributors:
@itechdima, @rdjjke

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
137 0 8 72 51

1.22.2

04 Nov 11:14
00b8800

Choose a tag to compare

Changes made since version 1.22.1 prior to version 1.22.2:

🐛 Fixes

  • SCHED-303: update health-checker
  • SCHED-310: Explicitly set $HOME in slurmJob ActiveChecks
  • SCHED-300: Delete wait-for-checks-job if helm release failed
  • SCHED-303, SCHED-304: Fix health checks: ib_link on Supermicro B200, nvidia_smi output format

Contributors:
@itechdima, @rdjjke, @github-actions[bot], @theyoprst, @ChessProfessor

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
327 0 15 61 135

1.22.1

27 Oct 13:12
4a2d535

Choose a tag to compare

Changes made since version 1.22.0 prior to version 1.22.1:

🚀 Features

  • add collecting logs from network-operator and gpu-operator
  • SCHED-174: Add kube_node_labels metric (#1639)

🐛 Fixes

  • SCHED-173: Fix the jobs limit increase
  • Skip sbatch submission failures
  • SCHED-251: ignore down nodes
  • SCHED-286 Add sudo to enroot cleanup Acitve Check

Other

  • add delete-not-ready-nodes=true by default in helm

Contributors:
@Uburro, @theyoprst, @github-actions[bot], @rdjjke, @ChessProfessor, @itechdima

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
468 68 23 221 81

1.22.0

24 Sep 11:53
7d94720

Choose a tag to compare

Changes made since version 1.21.9 prior to version 1.22.0:

🚀 Features

  • [ha controller] add placeholder and single replicas
  • Bump container toolkit and do not install it into the jail
  • Bump slurm 24.11.6
  • Improve job metrics handling in exporter
  • Speedup populate_jail + don't overwrite existing data
  • Implement async metrics collection for SLURM exporter
  • add scrape node conditions and pod termination reasons metrics from ksm
  • add gpu-fryer check
  • Add memory latency and bandwidth Active Checks
  • Add self-monitoring metrics on separate port 8081
  • add cuda-samples test
  • add health-checker upgrade
  • [slurmcluster] maxUnavalaible for workes and support pre install images
  • add comment reaction on the check
  • [CD] Support for VMAgent external labels
  • Add JailedConfig CR
  • Undrain nodes after user problems + rewrite passive checks on Python
  • Collecting health_checker_cmd_stdout logs
  • Standalone Slurm exporter - No Kubernetes needed
  • [SlurmCluster CRD] rewrite logic of printcolumn
  • Clean old logs from /opt/soperator-outputs
  • Add slurm scripts for managing per-job tmpfs directories
  • move IB checks to health-checker
  • add aggregation to the JailedConfig
  • Log one-line JSON outputs for health-checker + rewrite in Python
  • Chessprofessor/each worker jobs
  • #1312 Add ib-write-bw/lat cpu checks
  • add activechecks PrinterColumns
  • add priorityclassname for components of clurm cluster
  • Bump health checker 1.0.0-150
  • Slurm scripts drop cache shmem
  • remove appormore deny for libEGL_ for running docker active check image
  • bump nc-health-checker_1.0.0-151.250904
  • [slurm login] add SshdServiceLoadBalancerSourceRanges to login node
  • Fix passive check filtering

🐛 Fixes

  • [nccl-debug] Use chmod instead of umask
  • remove size from controller spec
  • fix cache-sync-timeout to k8s default
  • Don't plan eachWorkerJobArray active checks on bad nodes
  • remove validation tool init container #1361
  • [nodeTopology] re-generate CM node topology if cm deleted.
  • SLURM exporter: use env configuration and improve docs
  • [worker topology controller] initial cm topology until asts not found
  • removing deprication fileds
  • fix priorityclass name for controller
  • fix issue with metadata.resourceVersion: Invalid value: 0x0: must be specified for an Patch in ASTS
  • set default values as a defalt in CRD #1421
  • Pre release fixes 1.22/0
  • Fix pagination issue in cache
  • [sopertochecs] change drain reason for maintenance
  • Disable acctg by default
  • add aggregation to the JailedConfig
  • Fix error with read-only workdir in dcgmi_diag_r1 health-checker
  • [sconfigcontroller] fix reconcile jailedconfig
  • Fallback on unix.renameat2 to os.rename when renameat2 is not supported
  • fix getting error when [user_problem] reason #1468
  • enable nodeLogs by default
  • fix preemptionPolicy for controller
  • fix patch cm with empty labels
  • add resources values for spo
  • Move Healthchekcer parts to optimize for build time
  • Cherry-pick active checks in Helm
  • [nccld-plugin] Make user responsible for correct rights of the output directory
  • Add fixes for activechecks in 1.22
  • Slurm scripts drop cache shmem
  • DCGM Exporter fix for toolkit validation
  • Add wait after users creation + Split all-reduce-perf + Fix dcgmi_diag_r1
  • [TopologyController] add EnsureWorkerTopologyConfigMap to check existing of JailedConfig
  • Fix passive check filtering
  • change k8up-cleanup image
  • Extract Enroot's config paths to config dir

📦 Dependencies

  • Bump docker/login-action from 3.4.0 to 3.5.0
  • Bump github.com/getkin/kin-openapi from 0.122.0 to 0.131.0
  • Bump golang.org/x/oauth2 from 0.24.0 to 0.27.0
  • Bump golang.org/x/net from 0.37.0 to 0.38.0
  • Bump sigs.k8s.io/yaml from 1.4.0 to 1.6.0
  • Bump actions/checkout from 4.2.2 to 5.0.0
  • Bump actions/download-artifact from 4 to 5

📔Docs

  • SLURM exporter: use env configuration and improve docs

Other

  • Merge dev to main
  • Set sconfigcontroller UID and GID from Helm values
  • DCGM exporter Helm chart: support metricRelablings
  • [nccl-debug] Don't try chmodding files and directories without a need
  • Support passing extraArgs for node-exporter
  • Add reason label to slurm_node_info metric for better observability
  • [cherry-pick] Add node unavailability and draining duration metrics
  • Support DCGM Exporter on driverful
  • Change DCGM exporter image version

Contributors:
@theyoprst, @mcheshkov, @dstaroff, @Uburro, @asteny, @itechdima, @rdjjke, @ChessProfessor, @dependabot[bot], @ali-sattari, @mateusclira-nv, @dnugmanov, @github-actions[bot]

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
5613 593 250 9661 5211

1.21.14

19 Sep 15:59
45730bb

Choose a tag to compare

Changes made since version 1.21.13 prior to version 1.21.14:

🐛 Fixes

  • remove-hc-host-service-check
  • change k8up-cleanup image
  • Extract Enroot's config paths to config dir

Contributors:
@itechdima, @ChessProfessor, @Uburro, @dstaroff

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
164 0 10 73 79

1.21.13

29 Aug 10:48
8450468

Choose a tag to compare

Changes made since version 1.21.12 prior to version 1.21.13:

🐛 Fixes

  • fix patch cm with empty labels

📔Docs

  • fix doc Soperator Helm chart

Other

  • release 1.21.13

Contributors:
@Uburro

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
164 33 3 49 65