Skip to content

Releases: nebius/soperator

3.0.2

12 Mar 18:07
a25054d

Choose a tag to compare

Changes made since version 3.0.1 prior to version 3.0.2:

πŸ› Fixes

  • SCHED-1015: add retries values for all Helm Releases
  • SCHED-1039: Add CUDA-to-NCCL tests version mapping into the helm chart.
  • SCHED-1033: Upgrade Slurm version 25.11.3
  • SCHED-1074: Replace deprecated gcr.io registry
  • fix wrong cleanup_enroot execution
  • SLURMSUPPORT-320: fix spo version kube rbac proxy
  • SCHED-1049: rallback topology controller and fixed bug with dynamic topology
  • SCHED-1135 Ignore comment checks in wait-for-checks-job

Contributors:
@theyoprst, @Uburro, @github-actions[bot], @itechdima, @ChessProfessor

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
582 0 73 1744 258

2.0.5

11 Mar 17:10
9f9000d

Choose a tag to compare

Changes made since version 2.0.4 prior to version 2.0.5:

πŸ› Fixes

  • fix wrong cleanup_enroot execution
  • SLURMSUPPORT-320: fix spo version kube rbac proxy

Contributors:
@itechdima, @Uburro, @theyoprst

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
133 0 19 732 162

2.0.4

09 Mar 13:27
6209dce

Choose a tag to compare

Changes made since version 2.0.3 prior to version 2.0.4:

πŸ› Fixes

  • SCHED-1074: Replace deprecated gcr.io registry

Contributors:
@theyoprst

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
78 0 3 79 79

2.0.3

03 Mar 13:40
9eb3e28

Choose a tag to compare

Changes made since version 2.0.2 prior to version 2.0.3:

πŸ› Fixes

  • SCHED-1039: Add CUDA-to-NCCL tests version mapping into the helm chart.

Contributors:
@theyoprst

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
103 0 9 581 130

3.0.1

25 Feb 21:39
cc48130

Choose a tag to compare

3.0.1 Pre-release
Pre-release

Changes made since version 3.0.0 prior to version 3.0.1:

πŸ› Fixes

  • SCHED-1008: Fix IB topology for GPU nodes
  • Make it possible to add tier-2 topology switches to extra constraints
  • SCHED-1007: Ignore CPU-only nodes in IB topology
  • SCHED-1015: add retries values for all Helm Releases (#2215)

Contributors:
@theyoprst, @Uburro, @rdjjke

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
304 0 18 1236 456

2.0.2

25 Feb 16:04
135d48d

Choose a tag to compare

Changes made since version 2.0.1 prior to version 2.0.2:

πŸ› Fixes

  • SCHED-1015: add retries values for all Helm Releases

Contributors:
@theyoprst, @Uburro

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
84 0 4 363 319

3.0.0

23 Feb 20:17
b9227d9

Choose a tag to compare

Changes made since version 2.0.1 prior to version 3.0.0:

πŸš€ Features

  • Use Slurm versions 25.11.2
  • Build jail images in parallel on github runners
  • SCHED-945 Build controllers and slurm images in parallel
  • SCHED-953 Run lint, build-helm-charts and pre-build on github runners
  • SCHED-958 Build images with docker registry cache
  • Use output variables for builds
  • SCHED-848: add ephemeral nodes

πŸ› Fixes

  • Fix name/namespace parameters for render secret
  • SCHED-941: upgrade verions of munge
  • SCHED-987: Delete dynamic workers from code
  • SCHED-986: Ignore POWERED_DOWN nodes in active checks

πŸ“¦ Dependencies

  • Bump docker/login-action from 3.6.0 to 3.7.0
  • Bump mikepenz/release-changelog-builder-action from 6.0.1 to 6.1.0
  • Bump github.com/cert-manager/cert-manager from 1.18.2 to 1.18.5 in the go_modules group across 1 directory
  • Bump filelock from 3.20.1 to 3.20.3 in /ansible in the pip group across 1 directory

πŸ“”Docs

  • Update Active Checks doc to 2.0

Other

  • Add affinity and nodeSelector support to soperator manager
  • Better default params for NFS storageclass

Contributors:
@theyoprst, @github-actions[bot], @dependabot[bot], @andriishestakov, @janekmichalik, @asteny, @ChessProfessor, @ali-sattari, @Uburro

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
1315 136 97 6649 14293

2.0.1

18 Feb 17:59
9226a38

Choose a tag to compare

Changes made since version 2.0.0 prior to version 2.0.1:

πŸ› Fixes

  • fix long termination of worker pods

πŸ“¦ Dependencies

  • SCHED-931 Upgrade nvidia toolkit version to 1.18.2-1

Contributors:
@itechdima, @theyoprst, @ChessProfessor, @asteny

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
158 0 9 592 537

2.0.0

10 Feb 19:05
d07f06e

Choose a tag to compare

Changes made since version 1.23.2 prior to version 2.0.0:

πŸš€ Features

  • Change base cuda image and use ansible for nccl-tests
  • Change base Neubuntu image for all images
  • feat: Adding node metrics to track node status in Slurm
  • Support disabling controllers via flag or env
  • Use base neubuntu image with ansible and return pushing stable release images to the Github docker registry
  • Allow set procMount
  • SCHED-696: update umbrella chart for logging
  • Move common-packages, repos and python roles to the base layers of images (repo ml-containers)
  • SCHED-696: customize log headers
  • SCHED-626 Removing slurm installation from images (use base image)
  • SCHED-696: configure logs endpoint
  • Use base images with Nebius apt snapshots
  • SCHED-696: add attribute for service provider application
  • SCHED-761 move openmpi role to the ml-containers repo
  • SCHED-773 Move dcgmi, cuda and nccl-tests roles to the ml-containers
  • Bump docker and nvtop
  • Use base image for jail and active_checks with ansible roles for downloading binaries: nccl-tests, cuda-samples and mlc
  • Use slurm_training_diag as base image for jail
  • SCHED-864 Create sansible docker image for handling the jail state
  • SCHED-906 remove outdated scripts
  • bump nvtop

πŸ› Fixes

  • SCHED-567: Ensure deterministic startup order between DB, accounting and controller
  • Fix k8up backup image repo and tag
  • SLURMSUPPORT-75: add more state unavailable node to slurm exporter
  • SCHED-690: removing exporter rb, sa, role from soperator to helm chart
  • Fix pod monitor bug in renderer
  • SCHED-785: Plug-in SPANK plugins properly in slurm job active checks
  • SCHED-807: fix autohealing for nodesets
  • Get correct environment for passive checks
  • turn off dcgmi diag active checks
  • increase default reconfiguration period
  • Use base images without workdir /opt/ansible
  • SCHED-855 add WorkingDir for activecheck container images
  • do not always undrain node
  • Use CLOUD nodes and make gres.conf configurable for NodeSets
  • SCHED-898 Ignore non-draining checks in wait-for-active-checks job
  • SCHED-885: change init containers order
  • Add slurm script that does chmod a+rw for enroot image layers
  • [SCHED-804] Deprecate and make optional slurmNodes.worker field in SlurmCluster CRD
  • Use default nfs-in-k8s for e2e

πŸ“¦ Dependencies

  • Bump golang.org/x/crypto from 0.45.0 to 0.46.0
  • Bump github.com/onsi/gomega from 1.38.2 to 1.38.3
  • Bump k8s.io/client-go from 0.34.2 to 0.34.3
  • Bump k8s.io/component-base from 0.34.2 to 0.34.3
  • Bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.86.2 to 0.87.1
  • Bump sigs.k8s.io/controller-runtime from 0.22.3 to 0.22.4
  • Bump actions/upload-artifact from 5 to 6
  • Bump actions/download-artifact from 6 to 7
  • Bump filelock from 3.20.0 to 3.20.1 in /ansible in the pip group across 1 directory
  • Bump actions/checkout from 4 to 6
  • Bump actions/checkout from 4 to 6

Other

  • Support customizing built-in Slurm scripts
  • Make Helm chart soperator-activechecks customizable
  • Fixes for issues with metrics and dashboards
  • Add fsGroupChangePolicy: "OnRootMismatch" to NFS server StatefulSet
  • NOTIC: Move status and resolution fields to log labels from body

Contributors:
@github-actions[bot], @dependabot[bot], @dstaroff, @Uburro, @ali-sattari, @asteny, @aaroniscode, @mateusclira-nv, @theyoprst, @andriishestakov, @itechdima, @rdjjke, @ChessProfessor

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
3985 358 250 41893 5554

1.23.3

26 Jan 20:11
c00848c

Choose a tag to compare

Changes made since version 1.23.2 prior to version 1.23.3:

πŸ› Fixes

  • turn off dcgmi diag active checks
  • increase default reconfiguration period

Contributors:
@itechdima, @ali-sattari

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
122 0 5 71 71