Skip to content

Releases: nebius/soperator

1.21.2

02 Jul 16:53
64a13e9

Choose a tag to compare

Changes made since version 1.21.1 prior to version 1.21.2:

πŸ› Fixes

  • hotfix: change customMounts issue with empty array
  • [FIX] Replace kebabcase with custom function
  • NOTIC: Fix cuda-pins versions

Contributors:
@Uburro, @dstaroff, @rdjjke

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
193 0 11 282 71

1.21.1

01 Jul 17:57
7067e09

Choose a tag to compare

Changes made since version 1.21.0 prior to version 1.21.1:

πŸš€ Features

  • Accounting external DB SSL support
  • add b200 support in health checker
  • #1093: Update default CUDA 12.4->12.9, NCCL 2.21->2.26, and others
  • NOTIC: Adapt Slurm config for B200
  • Upgrade NCCL-tests v2.16.4
  • Reorganize log directory structure by worker node

πŸ› Fixes

  • Add node affinity to jail collector to exclude non-worker nodes
  • #1037: Make health_checker.sh take only the first failed check name
  • hotfix: add pollInterval to otel logs jail
  • add region to o11y
  • add explicit home directory for soperator users
  • Disable OpenTelemetry collectors by default

πŸ“¦ Dependencies

  • Bump step-security/harden-runner from 2.12.1 to 2.12.2

Contributors:
@webconn, @theyoprst, @itechdima, @Uburro, @rdjjke, @dependabot[bot], @asteny, @ChessProfessor

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
863 0 50 864 299

1.21.0

27 Jun 10:51
de52738

Choose a tag to compare

Changes made since version 1.20.1 prior to version 1.21.0:

πŸš€ Features

  • [sconfigcontroller] add annotation for storing path #562
  • Add Soperator exporter app.
  • Add Soperator Exporter infra.
  • [topology-aware] add topologyconfcontroller #427
  • Build multiarchitecture images (amd64/arm64)
  • Soperator Exporter: add node metrics.
  • feat: Add headless service for login pod-to-pod communication
  • Automatically generate Slurm network topology using K8s node labels #427
  • Exporter: metrics for jobs.
  • Add slurm_job_alloc_gpu_seconds_total metric
  • #1000 Support eachWorkerJobArray in ActiveCheck spec
  • Use soperator exporter by default.
  • #710 Add reactions to active checks
  • install health check library in jail
  • Enhance node metrics with state labels and remove job GPU metric
  • #1008 Add clear_enroot_check ActiveCheck
  • bump slurm to the version 24.11.5
  • Speed up CI builds 7min -> 4min
  • Change default slurm config values
  • Enable metrics in Nebius o11y agent #773
  • Add controller RPC metrics export
  • [soperator] add option supporing TopologyPlugin=topology/tree #1048
  • [soperator] add option supporing TopologyPlugin=topology/tree #1048
  • Disable NCCL benchmark in soperator by default
  • #1044 Add all_reduce_perf nccl check
  • add mockery to the make file
  • NCCL Debug SPANK plugin
  • NCCL debug plugin deployment
  • Implement centralized logging scheme for OpenTelemetry collector
  • [EPIC] Replace K8s nodes by setting conditions #913
  • Preinstall Soperator utility scripts to jail
  • add config map with all slurm scripts (prolog, epilog, hc program, etc.)

πŸ› Fixes

  • add rbac list node to soperator nodetopology #427
  • [docker] containers runs outside parent cgroup of slurmd #563
  • Fix not working backups
  • Do not lower case for slurm node state (for consistency with jobs).
  • Fix slurm failed states
  • Fix exporter disabling v2
  • fix logs collector cluster name
  • Create slurm job outputs dir using umask
  • Remove validation for TaskPluginParam

πŸ“¦ Dependencies

  • Bump docker/setup-buildx-action from 3.10.0 to 3.11.0
  • Bump step-security/harden-runner from 2.12.0 to 2.12.1
  • Bump mikepenz/release-changelog-builder-action from 5.3.0 to 5.3.1
  • Bump docker/setup-buildx-action from 3.11.0 to 3.11.1
  • Bump softprops/action-gh-release from 2.2.2 to 2.3.2

Other

  • Change docker hub images for busybox and ubuntu on nebius images
  • Add documentation about the components of the Helm chart.
  • Migrate from pkg/errors to standard library and enable depguard linter
  • Add SLURM exporter documentation

Contributors:
@Uburro, @theyoprst, @itechdima, @asteny, @iamrajiv, @ChessProfessor, @dependabot[bot], @dstaroff, @rdjjke

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
3203 293 353 20628 3004

1.20.1

11 Jun 15:33
22c49a6

Choose a tag to compare

Changes made since version 1.20.0 prior to version 1.20.1:

Fixes:

  • Wrong backup chart
  • Missing sizeLimit for in-memory volume
πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
4 55 48

1.20.0

02 Jun 13:43
0b73e60

Choose a tag to compare

Changes made since version 1.19.0 prior to version 1.20.0:

πŸš€ Features

  • Change soperatorconfig CRD for k8sJob
  • 508: SConfigController | Added controller logic
  • bump versions of go 1.24 and controller-runtime 0.20.3
  • Support custom init containers
  • #618: Speedup nvidia driver configuration on the worker start
  • add service monitor soperator and soperator checks
  • Soperatorchecks k8s job
  • Add worker features support.
  • Issue-654 Customize slurm healthcheck script and interval
  • #639 Change ActiveCheck CR status based on k8s jobs
  • #623 Advanced stateful sets
  • add manual job trigger for runAfterCreation
  • #720 [soperatorchecks] Use ScriptRefName if it exists for k8sjob type
  • #738 Remove CronJob on ActiveCheck deletion
  • #707 [soperatorchecks] Create slurmJob check type
  • #792 [soperatorchecks] Add Sbatchscript field for creating sbatch con…
  • Added templated PVC
  • #791 Helm for ActiveCheck
  • Basic Helm chart for DCGM exporter with HPC job mapping
  • Helm chart version sync for dcgm-exporter
  • Use Nebius public debian registry for package installation
  • Remove $GOARCH from docker images
  • Adding DCGM Exporter to helm chart and values
  • #711 Put SlurmJob state to ActiveCheck status
  • #874 Decouple ServiceAccountReconciler from ActiveCheckReconciler sync

πŸ§ͺ Tests

  • terraform apply/destroy scenario
  • fix env vars in e2e tests
  • source envrc in test step and bypass all envs to terraform
  • fix filestore_jail override
  • add o11y secret creation in tests
  • fix k8s config path
  • reduce logs in e2e

πŸ› Fixes

  • bump golang version to 1.24 and fix bug with parse of args
  • [BUG] do not reconcile on heartbeat
  • NOISSUE: reduction of DebugFlags and removal of the preStop hook
  • 453 controller for slurm clients
  • do not reconcile on updates
  • base image with ssh for k8s jobs
  • add remoteWrite to nebius #771
  • fix missing custom mounts validation
  • Fix tolerations in exporter
  • refactor(reconciler): simplify Service annotation merge with maps.Copy
  • [Bug] Fix rights on /etc/slurm directory #944

πŸ“¦ Dependencies

  • build(deps): bump docker/login-action from 327cd5a69de6c009b9ce71bce8395f28e651bf99 to 74a5d142397b4f367a81961eba4e8cd7edddf772
  • build(deps): bump actions/setup-go from 5.3.0 to 5.4.0
  • build(deps): bump sigs.k8s.io/controller-runtime from 0.19.4 to 0.20.3
  • build(deps): bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.79.2 to 0.81.0
  • build(deps): bump golang.org/x/crypto from 0.33.0 to 0.36.0
  • build(deps): bump google.golang.org/grpc from 1.70.0 to 1.71.0 in /images/worker/gpubench
  • build(deps): bump go.opentelemetry.io/otel/metric from 1.34.0 to 1.35.0 in /images/worker/gpubench
  • build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc from 1.34.0 to 1.35.0 in /images/worker/gpubench
  • build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp from 1.34.0 to 1.35.0 in /images/worker/gpubench
  • build(deps): bump actions/setup-go from 5.3.0 to 5.4.0
  • build(deps): bump k8s.io/client-go from 0.32.2 to 0.32.3 in /images/worker/gpubench
  • build(deps): bump github.com/onsi/gomega from 1.36.2 to 1.36.3
  • build(deps): bump k8s.io/api from 0.32.2 to 0.32.3
  • build(deps): bump mikepenz/release-changelog-builder-action from 5.2.0 to 5.3.0
  • build(deps): bump google.golang.org/grpc from 1.71.0 to 1.71.1 in /images/worker/gpubench
  • build(deps): bump step-security/harden-runner from 2.11.0 to 2.11.1
  • build(deps): bump golang.org/x/net from 0.35.0 to 0.36.0 in /images/worker/gpubench
  • build(deps): bump github.com/golang-jwt/jwt/v5 from 5.2.1 to 5.2.2
  • build(deps): bump k8s.io/client-go from 0.32.2 to 0.32.3
  • build(deps): bump github.com/mariadb-operator/mariadb-operator from 0.37.2-0.20250322213015-28afeb2813ef to 0.38.1
  • #623 Advanced stateful sets
  • build(deps): bump softprops/action-gh-release from 2.2.0 to 2.2.2
  • build(deps): bump step-security/harden-runner from 2.11.1 to 2.12.0
  • build(deps): bump k8s.io/api from 0.32.3 to 0.32.4 in /images/worker/gpubench
  • build(deps): bump k8s.io/client-go from 0.32.3 to 0.32.4 in /images/worker/gpubench
  • build(deps): bump google.golang.org/grpc from 1.71.1 to 1.72.0 in /images/worker/gpubench
  • build(deps): bump actions/setup-go from 5.4.0 to 5.5.0
  • build(deps): bump google.golang.org/grpc from 1.72.0 to 1.72.1 in /images/worker/gpubench

Other

  • Branch after release 1-19-0/0
  • Add workerAnnotations to worker definition
  • add dummy github action for e2e
  • allow to run one_job for PR from forks
  • fix wf trigger to not duplicate runs
  • feat: add support for command and args in NodeContainer
  • use vars instead of secrets, fix syntax
  • fix default value for terraform checkout, fix path
  • fix wrong path to installation
  • fix terraform repo ref
  • #618: [Speedup] Disable ldconfig on nvidia driver configuration.
  • Add ChessProfessor to CODEOWNERS
  • [fluxcd] add gpu-operator #653
  • [fluxcd] add ns to Kustomization #653
  • [fluxcd] add nvidia-network-operator #653
  • Rename createuser to screateuser
  • add e2e schedule
  • always run post summary and artifact
  • Add k8s job base image for SlurmJob check
  • Release 1.20.0

Contributors:
@asteny, @dependabot[bot], @andrei-pokhila, @angelbejarano, @itechdima, @karaimin, @Uburro, @rdjjke, @theyoprst, @ChessProfessor, @dstaroff, @andreineustroev, @ali-sattari, @apten-fors, @iamrajiv

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
6374 1059 433 181214 39034

1.19.0

13 Mar 12:25
0e64cb0

Choose a tag to compare

Changes made since version 1.18.3 prior to version 1.19.0:

πŸš€ Features

  • 456: add cel validation nodeConfigurator
  • Separate package installation and remove unused packages from conyainers
  • Use Nebius container mirrored images because of docker hub limits
  • 480: do not run nccl test if some proccess running on gpus
  • Support customisable container mounts
  • 504: customisable slurm config
  • add Slurm topology config support
  • Release 1.19.0
  • Added new label for ConfigMaps with slurm configs

πŸ› Fixes

  • fix IsMaintenanceActive
  • Increase the default Slurm MessageTimeout
  • #485 fix bug with Replication mariadb.spec
  • #485 remove form reconcile immutable field
  • fix AccountingStorageHost fqdn name
  • fix missing column cmd in jail
  • Add comment in the beginning of custom_slurm.conf file
  • #526 Fix bug for cannot stat file /etc/slurm/slurm_rest.conf
  • Release 1.19.0
  • Revert "Added new label for ConfigMaps with slurm configs"
  • fix autohealing

πŸ“¦ Dependencies

  • build(deps): bump mikepenz/release-changelog-builder-action from 5.0.0 to 5.2.0
  • build(deps): bump github.com/containers/common from 0.59.0 to 0.60.4
  • build(deps): bump docker/setup-buildx-action from 3.9.0 to 3.10.0

Other

  • add image build and env var for soperatorchecks
  • NOTIC Fix mistake in license
  • images: do not bind-mount slurm configs when possible
  • Implement NodeSet CRD
  • fix false positive reboots

Contributors:
@Uburro, @rdjjke, @itechdima, @asteny, @dependabot[bot], @webconn, @dstaroff, @andrei-pokhila

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
1738 260 152 36370 8158

1.18.3

20 Feb 13:18
dfe10cf

Choose a tag to compare

Changes made since version 1.18.2 prior to version 1.18.3:

  • no changes
πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
3 26 23

1.18.2

18 Feb 20:16
ff2fef7

Choose a tag to compare

Changes made since version 1.18.1 prior to version 1.18.2:

  • no changes
πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
4 56 24

1.18.1

18 Feb 12:40
8d2265e

Choose a tag to compare

Changes made since version 1.18.0 prior to version 1.18.1:

  • Hotfix populate jail job reconciliation (issue)
πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
2 23 23

1.18.0

13 Feb 20:34
b580718

Choose a tag to compare

Changes made since version 1.17.0 prior to version 1.18.0:

πŸš€ Features

  • add downscaleAndOverwritePopulateJail
  • add priority class
  • Update Slurm to 24.05.5, fix MPI/PMIx bug, update README, and remove AppArmor hack
  • MSP-3516: settings of accounting to scrape jobs stats
  • Print actual command before executing it in bash scripts
  • Move gpubench to worker image and bind mount it
  • Move chroot plugin inside containers and bind mount it
  • Move enroot inside images and bind mount it
  • NOTASK: add debug logs
  • Move Pyxis from jail to images and bind-mount it
  • MSP-4080: add simple rebooter
  • MSP-4080: add CheckNodeCondition to rebooter
  • MSP-4080: add rebooting node check
  • MSP-4080: add reboot node and build image
  • MSP-4080: add handleNodeReboot, handleNodeDrain, handleNodeUnDrain and fix patch condition
  • Preinstall Nvidia mock packages issues/384
  • Install nvtop as deb package from repo and bind mount it from container to the jail filesystem
  • Preinstall dcgmi tools to the jail
  • MSP-4080: add render, reconcile rebooter and rbac
  • Remove Nvidia CUDA from worker image and apt clean
  • Build jail image based on own CUDA packages installation
  • Add Epilog and Prolog options
  • Pre-create enroot credentials, exclude enroot bind-mounts from motd, make Docker mount /dev/infiniband, store /tmp on disk and add tmpfs /mnt/memory

πŸ› Fixes

  • MSP-3918: Fix bug reconciliation logic for scenarios with maintenance=true and accounting=false
  • Update Slurm to 24.05.5, fix MPI/PMIx bug, update README, and remove AppArmor hack
  • NOTIC: Keep more failed NCCL benchmark jobs in the history instead of…
  • MSP-3515: fix mistake in values slurmdbdConfig and slurmConfig
  • [Fix] Install libpmix into nccl-benchmark image
  • Remove openmpi from controller
  • MSP-3992: fix bug with empty version of annotation
  • [FIX] Add patching for service annotations [MSP-3801]
  • fix: update AppArmor profile to allow creation of library links
  • NOTASK: fix bug invalid memory address or nil pointer when get role
  • Enable leader election for controller manager by default
  • Change watching ns mechanism
  • MSP-4080: fix bugs with stuck draining condition
  • Temporary remove expose_enroot_logs flag
  • Fix ci for external contributors
  • Fix non-zero error handling in gpu_healthcheck.sh
  • Pre-create enroot credentials, exclude enroot bind-mounts from motd, make Docker mount /dev/infiniband, store /tmp on disk and add tmpfs /mnt/memory

πŸ“¦ Dependencies

  • build(deps): bump alpine from b97e2a8 to 56fa17d
  • bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.78.2
  • build(deps): bump golang from 7ea4c9d to a6927f4
  • build(deps): bump golang from a6927f4 to 585103a
  • build(deps): bump k8s.io/apimachinery from 0.32.0 to 0.32.1
  • build(deps): bump k8s.io/api from 0.32.0 to 0.32.1
  • build(deps): bump golang from 585103a to 9820aca
  • build(deps): bump k8s.io/client-go from 0.32.0 to 0.32.1
  • build(deps): bump golang from 9820aca to 51a6466
  • bump golang.org/x/net to v0.33.0
  • build(deps): bump step-security/harden-runner from 2.10.2 to 2.10.4
  • build(deps): bump actions/setup-go from 5.2.0 to 5.3.0
  • build(deps): bump docker/login-action from 7ca345011ac4304463197fac0e56eab1bc7e6af0 to 327cd5a69de6c009b9ce71bce8395f28e651bf99
  • build(deps): bump google.golang.org/grpc from 1.69.2 to 1.69.4 in /images/worker/gpubench
  • build(deps): bump go.opentelemetry.io/otel/sdk from 1.33.0 to 1.34.0 in /images/worker/gpubench
  • build(deps): bump golang from 51a6466 to 8c10f21
  • build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp from 1.33.0 to 1.34.0 in /images/worker/gpubench
  • build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc from 1.33.0 to 1.34.0 in /images/worker/gpubench
  • build(deps): bump google.golang.org/grpc from 1.69.4 to 1.70.0 in /images/worker/gpubench
  • Bump kube-apiserver v0.32.1 in gpubench
  • Bump go version for gpubench
  • build(deps): bump golang from 8c10f21 to e213430
  • build(deps): bump golang from e213430 to 9271129
  • build(deps): bump docker/setup-buildx-action from 3.8.0 to 3.9.0
  • build(deps): bump golang.org/x/crypto from 0.32.0 to 0.33.0

Other

  • fix docs about GPUs are required #306
  • Revert "Print actual command before executing it in bash scripts"
  • Update pyxis version with container_image_save and expose_enroot_logs enagled

Contributors:
@Uburro, @dependabot[bot], @asteny, @rdjjke, @dstaroff, @itechdima, @nandexsp, @angelbejarano

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
5301 235 196 4604 1434