Releases: nebius/soperator
1.21.2
Changes made since version 1.21.1 prior to version 1.21.2:
π Fixes
- hotfix: change customMounts issue with empty array
- PR: #1132
- [FIX] Replace
kebabcasewith custom function- PR: #1103
- NOTIC: Fix cuda-pins versions
- PR: #1135
Contributors:
@Uburro, @dstaroff, @rdjjke
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 193 | 0 | 11 | 282 | 71 |
1.21.1
Changes made since version 1.21.0 prior to version 1.21.1:
π Features
- Accounting external DB SSL support
- PR: #1031
- add b200 support in health checker
- PR: #1092
- #1093: Update default CUDA 12.4->12.9, NCCL 2.21->2.26, and others
- PR: #1094
- NOTIC: Adapt Slurm config for B200
- PR: #1101
- Upgrade NCCL-tests v2.16.4
- PR: #1100
- Reorganize log directory structure by worker node
- PR: #1108
π Fixes
- Add node affinity to jail collector to exclude non-worker nodes
- PR: #1085
- #1037: Make health_checker.sh take only the first failed check name
- PR: #1095
- hotfix: add pollInterval to otel logs jail
- PR: #1098
- add region to o11y
- PR: #1102
- add explicit home directory for soperator users
- PR: #1096
- Disable OpenTelemetry collectors by default
- PR: #1112
π¦ Dependencies
- Bump step-security/harden-runner from 2.12.1 to 2.12.2
- PR: #1104
Contributors:
@webconn, @theyoprst, @itechdima, @Uburro, @rdjjke, @dependabot[bot], @asteny, @ChessProfessor
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 863 | 0 | 50 | 864 | 299 |
1.21.0
Changes made since version 1.20.1 prior to version 1.21.0:
π Features
- [sconfigcontroller] add annotation for storing path #562
- PR: #955
- Add Soperator exporter app.
- PR: #959
- Add Soperator Exporter infra.
- PR: #960
- [topology-aware] add topologyconfcontroller #427
- PR: #940
- Build multiarchitecture images (amd64/arm64)
- PR: #947
- Soperator Exporter: add node metrics.
- PR: #971
- feat: Add headless service for login
pod-to-podcommunication- PR: #945
- Automatically generate Slurm network topology using K8s node labels #427
- PR: #967
- Exporter: metrics for jobs.
- PR: #978
- Add slurm_job_alloc_gpu_seconds_total metric
- PR: #983
- #1000 Support eachWorkerJobArray in ActiveCheck spec
- PR: #981
- Use soperator exporter by default.
- PR: #1003
- #710 Add reactions to active checks
- PR: #998
- install health check library in jail
- PR: #1009
- Enhance node metrics with state labels and remove job GPU metric
- PR: #1013
- #1008 Add clear_enroot_check ActiveCheck
- PR: #1012
- bump slurm to the version 24.11.5
- PR: #1014
- Speed up CI builds 7min -> 4min
- PR: #1042
- Change default slurm config values
- PR: #1030
- Enable metrics in Nebius o11y agent #773
- PR: #1050
- Add controller RPC metrics export
- PR: #1053
- [soperator] add option supporing TopologyPlugin=topology/tree #1048
- PR: #1054
- [soperator] add option supporing TopologyPlugin=topology/tree #1048
- PR: #1059
- Disable NCCL benchmark in soperator by default
- PR: #1062
- #1044 Add all_reduce_perf nccl check
- PR: #1047
- add mockery to the make file
- PR: #1065
- NCCL Debug SPANK plugin
- PR: #832
- NCCL debug plugin deployment
- PR: #1036
- Implement centralized logging scheme for OpenTelemetry collector
- PR: #1064
- [EPIC] Replace K8s nodes by setting conditions #913
- PR: #1066
- Preinstall Soperator utility scripts to jail
- PR: #1070
- add config map with all slurm scripts (prolog, epilog, hc program, etc.)
- PR: #1037
π Fixes
- add rbac list node to soperator nodetopology #427
- PR: #963
- [docker] containers runs outside parent cgroup of slurmd #563
- PR: #970
- Fix not working backups
- PR: #974
- Do not lower case for slurm node state (for consistency with jobs).
- PR: #999
- Fix slurm failed states
- PR: #1011
- Fix exporter disabling v2
- PR: #1023
- fix logs collector cluster name
- PR: #1056
- Create slurm job outputs dir using umask
- PR: #1069
- Remove validation for TaskPluginParam
- PR: #1079
π¦ Dependencies
- Bump docker/setup-buildx-action from 3.10.0 to 3.11.0
- PR: #1004
- Bump step-security/harden-runner from 2.12.0 to 2.12.1
- PR: #988
- Bump mikepenz/release-changelog-builder-action from 5.3.0 to 5.3.1
- PR: #956
- Bump docker/setup-buildx-action from 3.11.0 to 3.11.1
- PR: #1015
- Bump softprops/action-gh-release from 2.2.2 to 2.3.2
- PR: #989
Other
- Change docker hub images for busybox and ubuntu on nebius images
- PR: #969
- Add documentation about the components of the Helm chart.
- PR: #972
- Migrate from pkg/errors to standard library and enable depguard linter
- PR: #1002
- Add SLURM exporter documentation
- PR: #1058
Contributors:
@Uburro, @theyoprst, @itechdima, @asteny, @iamrajiv, @ChessProfessor, @dependabot[bot], @dstaroff, @rdjjke
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 3203 | 293 | 353 | 20628 | 3004 |
1.20.1
Changes made since version 1.20.0 prior to version 1.20.1:
Fixes:
- Wrong backup chart
- Missing
sizeLimitforin-memoryvolume
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 4 | 55 | 48 |
1.20.0
Changes made since version 1.19.0 prior to version 1.20.0:
π Features
- Change soperatorconfig CRD for k8sJob
- PR: #559
- 508: SConfigController | Added controller logic
- PR: #543
- bump versions of go 1.24 and controller-runtime 0.20.3
- PR: #588
- Support custom init containers
- PR: #584
- #618: Speedup nvidia driver configuration on the worker start
- PR: #622
- add service monitor soperator and soperator checks
- PR: #627
- Soperatorchecks k8s job
- PR: #582
- Add worker features support.
- PR: #655
- Issue-654 Customize slurm healthcheck script and interval
- PR: #673
- #639 Change ActiveCheck CR status based on k8s jobs
- PR: #703
- #623 Advanced stateful sets
- PR: #672
- add manual job trigger for runAfterCreation
- PR: #733
- #720 [soperatorchecks] Use ScriptRefName if it exists for k8sjob type
- PR: #735
- #738 Remove CronJob on ActiveCheck deletion
- PR: #766
- #707 [soperatorchecks] Create slurmJob check type
- PR: #743
- #792 [soperatorchecks] Add Sbatchscript field for creating sbatch conβ¦
- PR: #813
- Added templated PVC
- PR: #849
- #791 Helm for ActiveCheck
- PR: #856
- Basic Helm chart for DCGM exporter with HPC job mapping
- PR: #885
- Helm chart version sync for dcgm-exporter
- PR: #889
- Use Nebius public debian registry for package installation
- PR: #898
- Remove $GOARCH from docker images
- PR: #921
- Adding DCGM Exporter to helm chart and values
- PR: #873
- #711 Put SlurmJob state to ActiveCheck status
- PR: #942
- #874 Decouple
ServiceAccountReconcilerfromActiveCheckReconcilersync- PR: #950
π§ͺ Tests
- terraform apply/destroy scenario
- PR: #591
- fix env vars in e2e tests
- PR: #607
- source envrc in test step and bypass all envs to terraform
- PR: #608
- fix filestore_jail override
- PR: #610
- add o11y secret creation in tests
- PR: #613
- fix k8s config path
- PR: #614
- reduce logs in e2e
- PR: #756
π Fixes
- bump golang version to 1.24 and fix bug with parse of args
- PR: #590
- [BUG] do not reconcile on heartbeat
- PR: #625
- NOISSUE: reduction of DebugFlags and removal of the preStop hook
- PR: #757
- 453 controller for slurm clients
- PR: #841
- do not reconcile on updates
- PR: #844
- base image with ssh for k8s jobs
- PR: #848
- add remoteWrite to nebius #771
- PR: #855
- fix missing custom mounts validation
- PR: #875
- Fix tolerations in exporter
- PR: #916
- refactor(reconciler): simplify Service annotation merge with maps.Copy
- PR: #932
- [Bug] Fix rights on
/etc/slurmdirectory #944- PR: #946
π¦ Dependencies
- build(deps): bump docker/login-action from 327cd5a69de6c009b9ce71bce8395f28e651bf99 to 74a5d142397b4f367a81961eba4e8cd7edddf772
- PR: #555
- build(deps): bump actions/setup-go from 5.3.0 to 5.4.0
- PR: #564
- build(deps): bump sigs.k8s.io/controller-runtime from 0.19.4 to 0.20.3
- PR: #542
- build(deps): bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.79.2 to 0.81.0
- PR: #547
- build(deps): bump golang.org/x/crypto from 0.33.0 to 0.36.0
- PR: #532
- build(deps): bump google.golang.org/grpc from 1.70.0 to 1.71.0 in /images/worker/gpubench
- PR: #520
- build(deps): bump go.opentelemetry.io/otel/metric from 1.34.0 to 1.35.0 in /images/worker/gpubench
- PR: #531
- build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc from 1.34.0 to 1.35.0 in /images/worker/gpubench
- PR: #529
- build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp from 1.34.0 to 1.35.0 in /images/worker/gpubench
- PR: #528
- build(deps): bump actions/setup-go from 5.3.0 to 5.4.0
- PR: #579
- build(deps): bump k8s.io/client-go from 0.32.2 to 0.32.3 in /images/worker/gpubench
- PR: #595
- build(deps): bump github.com/onsi/gomega from 1.36.2 to 1.36.3
- PR: #594
- build(deps): bump k8s.io/api from 0.32.2 to 0.32.3
- PR: #592
- build(deps): bump mikepenz/release-changelog-builder-action from 5.2.0 to 5.3.0
- PR: #615
- build(deps): bump google.golang.org/grpc from 1.71.0 to 1.71.1 in /images/worker/gpubench
- PR: #626
- build(deps): bump step-security/harden-runner from 2.11.0 to 2.11.1
- PR: #637
- build(deps): bump golang.org/x/net from 0.35.0 to 0.36.0 in /images/worker/gpubench
- PR: #685
- build(deps): bump github.com/golang-jwt/jwt/v5 from 5.2.1 to 5.2.2
- PR: #686
- build(deps): bump k8s.io/client-go from 0.32.2 to 0.32.3
- PR: #693
- build(deps): bump github.com/mariadb-operator/mariadb-operator from 0.37.2-0.20250322213015-28afeb2813ef to 0.38.1
- PR: #718
- #623 Advanced stateful sets
- PR: #672
- build(deps): bump softprops/action-gh-release from 2.2.0 to 2.2.2
- PR: #729
- build(deps): bump step-security/harden-runner from 2.11.1 to 2.12.0
- PR: #737
- build(deps): bump k8s.io/api from 0.32.3 to 0.32.4 in /images/worker/gpubench
- PR: #741
- build(deps): bump k8s.io/client-go from 0.32.3 to 0.32.4 in /images/worker/gpubench
- PR: #740
- build(deps): bump google.golang.org/grpc from 1.71.1 to 1.72.0 in /images/worker/gpubench
- PR: #728
- build(deps): bump actions/setup-go from 5.4.0 to 5.5.0
- PR: #838
- build(deps): bump google.golang.org/grpc from 1.72.0 to 1.72.1 in /images/worker/gpubench
- PR: #853
Other
- Branch after release 1-19-0/0
- PR: #535
- Add workerAnnotations to worker definition
- PR: #540
- add dummy github action for e2e
- PR: #561
- allow to run one_job for PR from forks
- PR: #575
- fix wf trigger to not duplicate runs
- PR: #580
- feat: add support for command and args in NodeContainer
- PR: #574
- use vars instead of secrets, fix syntax
- PR: #598
- fix default value for terraform checkout, fix path
- PR: #599
- fix wrong path to installation
- PR: #601
- fix terraform repo ref
- PR: #602
- #618: [Speedup] Disable ldconfig on nvidia driver configuration.
- PR: #634
- Add ChessProfessor to CODEOWNERS
- PR: #656
- [fluxcd] add gpu-operator #653
- PR: #674
- [fluxcd] add ns to Kustomization #653
- PR: #675
- [fluxcd] add nvidia-network-operator #653
- PR: #676
- Rename
createusertoscreateuser- PR: #732
- add e2e schedule
- PR: #748
- always run post summary and artifact
- PR: #765
- Add k8s job base image for SlurmJob check
- PR: #812
- Release 1.20.0
- PR: #936
Contributors:
@asteny, @dependabot[bot], @andrei-pokhila, @angelbejarano, @itechdima, @karaimin, @Uburro, @rdjjke, @theyoprst, @ChessProfessor, @dstaroff, @andreineustroev, @ali-sattari, @apten-fors, @iamrajiv
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 6374 | 1059 | 433 | 181214 | 39034 |
1.19.0
Changes made since version 1.18.3 prior to version 1.19.0:
π Features
- 456: add cel validation nodeConfigurator
- PR: #468
- Separate package installation and remove unused packages from conyainers
- PR: #475
- Use Nebius container mirrored images because of docker hub limits
- PR: #482
- 480: do not run nccl test if some proccess running on gpus
- PR: #496
- Support customisable container mounts
- PR: #498
- 504: customisable slurm config
- PR: #506
- add Slurm topology config support
- PR: #512
- Release 1.19.0
- PR: #523
- Added new label for ConfigMaps with slurm configs
- PR: #546
π Fixes
- fix IsMaintenanceActive
- PR: #459
- Increase the default Slurm MessageTimeout
- PR: #460
- #485 fix bug with Replication mariadb.spec
- PR: #501
- #485 remove form reconcile immutable field
- PR: #502
- fix AccountingStorageHost fqdn name
- PR: #514
- fix missing column cmd in jail
- PR: #525
- Add comment in the beginning of custom_slurm.conf file
- PR: #527
- #526 Fix bug for cannot stat file /etc/slurm/slurm_rest.conf
- PR: #533
- Release 1.19.0
- PR: #523
- Revert "Added new label for ConfigMaps with slurm configs"
- PR: #549
- fix autohealing
- PR: #548
π¦ Dependencies
- build(deps): bump mikepenz/release-changelog-builder-action from 5.0.0 to 5.2.0
- PR: #473
- build(deps): bump github.com/containers/common from 0.59.0 to 0.60.4
- PR: #478
- build(deps): bump docker/setup-buildx-action from 3.9.0 to 3.10.0
- PR: #500
Other
- add image build and env var for soperatorchecks
- PR: #467
- NOTIC Fix mistake in license
- PR: #477
- images: do not bind-mount slurm configs when possible
- PR: #497
- Implement NodeSet CRD
- PR: #505
- fix false positive reboots
- PR: #550
Contributors:
@Uburro, @rdjjke, @itechdima, @asteny, @dependabot[bot], @webconn, @dstaroff, @andrei-pokhila
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 1738 | 260 | 152 | 36370 | 8158 |
1.18.3
Changes made since version 1.18.2 prior to version 1.18.3:
- no changes
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 3 | 26 | 23 |
1.18.2
Changes made since version 1.18.1 prior to version 1.18.2:
- no changes
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 4 | 56 | 24 |
1.18.1
1.18.0
Changes made since version 1.17.0 prior to version 1.18.0:
π Features
- add downscaleAndOverwritePopulateJail
- PR: #311
- add priority class
- PR: #313
- Update Slurm to 24.05.5, fix MPI/PMIx bug, update README, and remove AppArmor hack
- PR: #316
- MSP-3516: settings of accounting to scrape jobs stats
- PR: #321
- Print actual command before executing it in bash scripts
- PR: #329
- Move gpubench to worker image and bind mount it
- PR: #333
- Move chroot plugin inside containers and bind mount it
- PR: #335
- Move enroot inside images and bind mount it
- PR: #339
- NOTASK: add debug logs
- PR: #357
- Move Pyxis from jail to images and bind-mount it
- PR: #361
- MSP-4080: add simple rebooter
- PR: #369
- MSP-4080: add CheckNodeCondition to rebooter
- PR: #372
- MSP-4080: add rebooting node check
- PR: #377
- MSP-4080: add reboot node and build image
- PR: #381
- MSP-4080: add handleNodeReboot, handleNodeDrain, handleNodeUnDrain and fix patch condition
- PR: #383
- Preinstall Nvidia mock packages issues/384
- PR: #387
- Install nvtop as deb package from repo and bind mount it from container to the jail filesystem
- PR: #390
- Preinstall dcgmi tools to the jail
- PR: #394
- MSP-4080: add render, reconcile rebooter and rbac
- PR: #391
- Remove Nvidia CUDA from worker image and apt clean
- PR: #397
- Build jail image based on own CUDA packages installation
- PR: #415
- Add Epilog and Prolog options
- PR: #411
- Pre-create enroot credentials, exclude enroot bind-mounts from motd, make Docker mount /dev/infiniband, store /tmp on disk and add tmpfs /mnt/memory
- PR: #389
π Fixes
- MSP-3918: Fix bug reconciliation logic for scenarios with maintenance=true and accounting=false
- PR: #309
- Update Slurm to 24.05.5, fix MPI/PMIx bug, update README, and remove AppArmor hack
- PR: #316
- NOTIC: Keep more failed NCCL benchmark jobs in the history instead ofβ¦
- PR: #315
- MSP-3515: fix mistake in values slurmdbdConfig and slurmConfig
- PR: #318
- [Fix] Install libpmix into nccl-benchmark image
- PR: #319
- Remove openmpi from controller
- PR: #320
- MSP-3992: fix bug with empty version of annotation
- PR: #334
- [FIX] Add patching for service annotations [MSP-3801]
- PR: #354
- fix: update AppArmor profile to allow creation of library links
- PR: #356
- NOTASK: fix bug invalid memory address or nil pointer when get role
- PR: #359
- Enable leader election for controller manager by default
- PR: #365
- Change watching ns mechanism
- PR: #366
- MSP-4080: fix bugs with stuck draining condition
- PR: #399
- Temporary remove
expose_enroot_logsflag- PR: #417
- Fix ci for external contributors
- PR: #419
- Fix non-zero error handling in gpu_healthcheck.sh
- PR: #418
- Pre-create enroot credentials, exclude enroot bind-mounts from motd, make Docker mount /dev/infiniband, store /tmp on disk and add tmpfs /mnt/memory
- PR: #389
π¦ Dependencies
- build(deps): bump alpine from
b97e2a8to56fa17d- PR: #310
- bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.78.2
- PR: #312
- build(deps): bump golang from
7ea4c9dtoa6927f4- PR: #322
- build(deps): bump golang from
a6927f4to585103a- PR: #323
- build(deps): bump k8s.io/apimachinery from 0.32.0 to 0.32.1
- PR: #325
- build(deps): bump k8s.io/api from 0.32.0 to 0.32.1
- PR: #324
- build(deps): bump golang from
585103ato9820aca- PR: #328
- build(deps): bump k8s.io/client-go from 0.32.0 to 0.32.1
- PR: #327
- build(deps): bump golang from
9820acato51a6466- PR: #331
- bump golang.org/x/net to v0.33.0
- PR: #340
- build(deps): bump step-security/harden-runner from 2.10.2 to 2.10.4
- PR: #341
- build(deps): bump actions/setup-go from 5.2.0 to 5.3.0
- PR: #342
- build(deps): bump docker/login-action from 7ca345011ac4304463197fac0e56eab1bc7e6af0 to 327cd5a69de6c009b9ce71bce8395f28e651bf99
- PR: #344
- build(deps): bump google.golang.org/grpc from 1.69.2 to 1.69.4 in /images/worker/gpubench
- PR: #345
- build(deps): bump go.opentelemetry.io/otel/sdk from 1.33.0 to 1.34.0 in /images/worker/gpubench
- PR: #346
- build(deps): bump golang from
51a6466to8c10f21- PR: #338
- build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp from 1.33.0 to 1.34.0 in /images/worker/gpubench
- PR: #349
- build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc from 1.33.0 to 1.34.0 in /images/worker/gpubench
- PR: #353
- build(deps): bump google.golang.org/grpc from 1.69.4 to 1.70.0 in /images/worker/gpubench
- PR: #358
- Bump kube-apiserver v0.32.1 in gpubench
- PR: #367
- Bump go version for gpubench
- PR: #368
- build(deps): bump golang from
8c10f21toe213430- PR: #386
- build(deps): bump golang from
e213430to9271129- PR: #392
- build(deps): bump docker/setup-buildx-action from 3.8.0 to 3.9.0
- PR: #402
- build(deps): bump golang.org/x/crypto from 0.32.0 to 0.33.0
- PR: #421
Other
- fix docs about GPUs are required #306
- PR: #317
- Revert "Print actual command before executing it in bash scripts"
- PR: #332
- Update pyxis version with
container_image_saveandexpose_enroot_logsenagled- PR: #376
Contributors:
@Uburro, @dependabot[bot], @asteny, @rdjjke, @dstaroff, @itechdima, @nandexsp, @angelbejarano
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 5301 | 235 | 196 | 4604 | 1434 |