1.22.0
Changes made since version 1.21.9 prior to version 1.22.0:
π Features
- [ha controller] add placeholder and single replicas
- PR: #1344
- Bump container toolkit and do not install it into the jail
- PR: #1347
- Bump slurm 24.11.6
- PR: #1352
- Improve job metrics handling in exporter
- PR: #1358
- Speedup populate_jail + don't overwrite existing data
- PR: #1278
- Implement async metrics collection for SLURM exporter
- PR: #1376
- add scrape node conditions and pod termination reasons metrics from ksm
- PR: #1379
- add gpu-fryer check
- PR: #1375
- Add memory latency and bandwidth Active Checks
- PR: #1368
- Add self-monitoring metrics on separate port 8081
- PR: #1382
- add cuda-samples test
- PR: #1381
- add health-checker upgrade
- PR: #1384
- [slurmcluster] maxUnavalaible for workes and support pre install images
- PR: #1391
- add comment reaction on the check
- PR: #1396
- [CD] Support for VMAgent external labels
- PR: #1403
- Add JailedConfig CR
- PR: #1287
- Undrain nodes after user problems + rewrite passive checks on Python
- PR: #1392
- Collecting health_checker_cmd_stdout logs
- PR: #1410
- Standalone Slurm exporter - No Kubernetes needed
- PR: #1415
- [SlurmCluster CRD] rewrite logic of printcolumn
- PR: #1434
- Clean old logs from /opt/soperator-outputs
- PR: #1431
- Add slurm scripts for managing per-job tmpfs directories
- PR: #1432
- move IB checks to health-checker
- PR: #1438
- add aggregation to the JailedConfig
- PR: #1435
- Log one-line JSON outputs for health-checker + rewrite in Python
- PR: #1447
- Chessprofessor/each worker jobs
- PR: #1443
- #1312 Add ib-write-bw/lat cpu checks
- PR: #1323
- add activechecks PrinterColumns
- PR: #1497
- add priorityclassname for components of clurm cluster
- PR: #1519
- Bump health checker 1.0.0-150
- PR: #1526
- Slurm scripts drop cache shmem
- PR: #1554
- remove appormore deny for libEGL_ for running docker active check image
- PR: #1574
- bump nc-health-checker_1.0.0-151.250904
- PR: #1573
- [slurm login] add SshdServiceLoadBalancerSourceRanges to login node
- PR: #1581
- Fix passive check filtering
- PR: #1582
π Fixes
- [nccl-debug] Use chmod instead of umask
- PR: #1345
- remove size from controller spec
- PR: #1356
- fix cache-sync-timeout to k8s default
- PR: #1367
- Don't plan eachWorkerJobArray active checks on bad nodes
- PR: #1378
- remove validation tool init container #1361
- PR: #1372
- [nodeTopology] re-generate CM node topology if cm deleted.
- PR: #1383
- SLURM exporter: use env configuration and improve docs
- PR: #1394
- [worker topology controller] initial cm topology until asts not found
- PR: #1401
- removing deprication fileds
- PR: #1407
- fix priorityclass name for controller
- PR: #1411
- fix issue with metadata.resourceVersion: Invalid value: 0x0: must be specified for an Patch in ASTS
- PR: #1420
- set default values as a defalt in CRD #1421
- PR: #1422
- Pre release fixes 1.22/0
- PR: #1423
- Fix pagination issue in cache
- PR: #1433
- [sopertochecs] change drain reason for maintenance
- PR: #1430
- Disable acctg by default
- PR: #1442
- add aggregation to the JailedConfig
- PR: #1435
- Fix error with read-only workdir in dcgmi_diag_r1 health-checker
- PR: #1450
- [sconfigcontroller] fix reconcile jailedconfig
- PR: #1455
- Fallback on
unix.renameat2toos.renamewhenrenameat2is not supported- PR: #1452
- fix getting error when [user_problem] reason #1468
- PR: #1474
- enable nodeLogs by default
- PR: #1477
- fix preemptionPolicy for controller
- PR: #1495
- fix patch cm with empty labels
- PR: #1508
- add resources values for spo
- PR: #1513
- Move Healthchekcer parts to optimize for build time
- PR: #1532
- Cherry-pick active checks in Helm
- PR: #1529
- [nccld-plugin] Make user responsible for correct rights of the output directory
- PR: #1541
- Add fixes for activechecks in 1.22
- PR: #1542
- Slurm scripts drop cache shmem
- PR: #1554
- DCGM Exporter fix for toolkit validation
- PR: #1583
- Add wait after users creation + Split all-reduce-perf + Fix dcgmi_diag_r1
- PR: #1572
- [TopologyController] add EnsureWorkerTopologyConfigMap to check existing of JailedConfig
- PR: #1580
- Fix passive check filtering
- PR: #1582
- change k8up-cleanup image
- PR: #1597
- Extract Enroot's config paths to config dir
- PR: #1599
π¦ Dependencies
- Bump docker/login-action from 3.4.0 to 3.5.0
- PR: #1370
- Bump github.com/getkin/kin-openapi from 0.122.0 to 0.131.0
- PR: #1387
- Bump golang.org/x/oauth2 from 0.24.0 to 0.27.0
- PR: #1386
- Bump golang.org/x/net from 0.37.0 to 0.38.0
- PR: #1388
- Bump sigs.k8s.io/yaml from 1.4.0 to 1.6.0
- PR: #1338
- Bump actions/checkout from 4.2.2 to 5.0.0
- PR: #1408
- Bump actions/download-artifact from 4 to 5
- PR: #1389
πDocs
- SLURM exporter: use env configuration and improve docs
- PR: #1394
Other
- Merge dev to main
- PR: #1332
- Set sconfigcontroller UID and GID from Helm values
- PR: #1327
- DCGM exporter Helm chart: support metricRelablings
- PR: #1425
- [nccl-debug] Don't try chmodding files and directories without a need
- PR: #1522
- Support passing extraArgs for node-exporter
- PR: #1534
- Add reason label to slurm_node_info metric for better observability
- PR: #1538
- [cherry-pick] Add node unavailability and draining duration metrics
- PR: #1561
- Support DCGM Exporter on driverful
- PR: #1578
- Change DCGM exporter image version
- PR: #1595
Contributors:
@theyoprst, @mcheshkov, @dstaroff, @Uburro, @asteny, @itechdima, @rdjjke, @ChessProfessor, @dependabot[bot], @ali-sattari, @mateusclira-nv, @dnugmanov, @github-actions[bot]
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 5613 | 593 | 250 | 9661 | 5211 |