Skip to content

1.23.0

Choose a tag to compare

@github-actions github-actions released this 28 Nov 13:24
· 912 commits to main since this release
68bbca4

Changes made since version 1.22.3 prior to version 1.23.0:

πŸš€ Features

  • Slurm 25.05.3
  • Configurable values for ActiveChecks chart
  • deprecate worker field and add nodeSetRefs validation
  • SCHED-124: add structured partitions
  • SCHED-138: add SOPERATOR_NODE_SETS_ON=true for static worker configuration
  • Ensure healthy nodes check
  • SCHED-137: add --instance-id and --extra both static and dynamic config
  • Tailscale support
  • SCHED-162: add all_reduce_perf in docker
  • SCHED-155: automatic block topology
  • Build Slurm 25.05.4
  • add retrigger check
  • SCHED-180: Refactor ActiveCheck CRD additional printer columns
  • SCHED-249: add podMonitor
  • SCHED-243: Make Nebius Mk8s conditions configurable
  • Add ib-gpu-perf check
  • Change order of docker image installation for better cache and simplify structure
  • SCHED-250: Alpha version of NodeSet reconciliation
  • SCHED-208 SCHED-209 SCHED-210 Refactoring Active Check helm charts and related controller behavior
  • Ansible for managing Jail state
  • Update dcgmi version to 1:4.4.2-1
  • SCHED-380: Skip maintenance handling based on node labels
  • SCHED-248 Set-unhealthy on extensive check failure with compute instance id and check run id
  • SCHED-421 Move enable-node-replacement to separate param in values
  • Exporter: add reservation name as a label to slurm_node_info
  • SCHED-413 Enable ib perf gpu

πŸ› Fixes

  • add HostUsers optinal for all conponents but for wokers default false
  • Update slurm active check status when submission failed + Renaming
  • add logs format for leader elections
  • Rename createuser to soperator-createuser to avoid PostgreSQL conflict
  • SCHED-247: Rename K8s node condition MaintenanceScheduled->NebiusMaintenanceScheduled
  • Helm: allow set customSlurmConfig
  • Fix: set versions annotation for AdvancedStatefulSet
  • Allow set tolerations for controllerManager
  • Fix: Reconciler error in jailedconfig
  • SCHED-165 Update soperator notifier helm to tag job owner in Slack
  • Fix: Check if driver installed in /run/nvidia/driver
  • Run hc_program passive checks more often
  • Fix ActiveCheck additional printer columns
  • Fix bug in maintenanceIgnoreNodeLabels
  • fix syslog parsing on ubuntu24:04
  • fix active checks output
  • SCHED-417, SCHED-430, SCHED-437, SCHED-439, SCHED-441, SCHED-429: Pre-release 1.23 fixes related to extensive checks
  • Don't use : in extensive check reservation names
  • NOTIC: Don't use sudo in soperator-outputs-logs-cleaner
  • SCHED-376 Update health-checker version (with new nccl-with-ib and ib-gpu-perf limits)
  • NOTIC: Don't use sudo in all-reduce-perf-nccl-in-docker
  • SCHED-485: First slurmJob active checks may fail in soperator
  • Preinstall sudo in active check images
  • NOTIC: Use sudo only in active checks that don't use containers

πŸ“¦ Dependencies

  • Bump sigs.k8s.io/controller-runtime from 0.21.0 to 0.22.1
  • Bump golang.org/x/crypto from 0.41.0 to 0.42.0
  • Bump github.com/prometheus/client_golang from 1.23.0 to 1.23.2
  • Bump github.com/onsi/ginkgo/v2 from 2.25.2 to 2.25.3
  • Bump github.com/zclconf/go-cty from 1.16.4 to 1.17.0
  • Bump github.com/gruntwork-io/terratest from 0.50.0 to 0.51.0
  • Bump docker/login-action from 3.5.0 to 3.6.0
  • Bump github.com/onsi/ginkgo/v2 from 2.25.3 to 2.26.0
  • Bump softprops/action-gh-release from 2.3.3 to 2.3.4
  • Bump sigs.k8s.io/controller-runtime from 0.22.1 to 0.22.2
  • Bump golang.org/x/sys from 0.36.0 to 0.37.0
  • Bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.85.0 to 0.86.0
  • Bump softprops/action-gh-release from 2.3.4 to 2.4.0
  • Bump golang.org/x/crypto from 0.42.0 to 0.43.0
  • Bump sigs.k8s.io/controller-runtime from 0.22.2 to 0.22.3
  • Bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.86.0 to 0.86.1
  • Bump github.com/gruntwork-io/terratest from 0.51.0 to 0.52.0
  • Bump github.com/onsi/ginkgo/v2 from 2.26.0 to 2.27.2
  • Bump actions/download-artifact from 5 to 6
  • Bump actions/upload-artifact from 4 to 5
  • Bump softprops/action-gh-release from 2.4.0 to 2.4.1
  • Bump sigs.k8s.io/controller-runtime from 0.22.3 to 0.22.4
  • Bump mikepenz/release-changelog-builder-action from 5.4.1 to 6.0.1
  • Bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.86.1 to 0.86.2
  • Bump softprops/action-gh-release from 2.4.1 to 2.4.2
  • Bump golang.org/x/crypto from 0.43.0 to 0.44.0
  • Bump k8s.io/api from 0.34.1 to 0.34.2
  • Bump k8s.io/component-base from 0.34.1 to 0.34.2

Other

  • NFS Server helm chart fixes
  • bump go 1.25
  • Metrics for SlurmCluster CR via KubeStateMetrics config
  • Set driftDetection.mode: warn by default for helm releases
  • Separate versioning for NFS server image and chart
  • Allow volume size increase for filestore PVCs and PVs
  • Update dcgm-exporter

Contributors:
@Uburro, @ChessProfessor, @ali-sattari, @github-actions[bot], @asteny, @itechdima, @theyoprst, @dependabot[bot], @dstaroff, @rdjjke, @andriishestakov

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
5958 403 464 79600 24660