Skip to content

1.22.0

Choose a tag to compare

@github-actions github-actions released this 24 Sep 11:53
· 1427 commits to main since this release
7d94720

Changes made since version 1.21.9 prior to version 1.22.0:

πŸš€ Features

  • [ha controller] add placeholder and single replicas
  • Bump container toolkit and do not install it into the jail
  • Bump slurm 24.11.6
  • Improve job metrics handling in exporter
  • Speedup populate_jail + don't overwrite existing data
  • Implement async metrics collection for SLURM exporter
  • add scrape node conditions and pod termination reasons metrics from ksm
  • add gpu-fryer check
  • Add memory latency and bandwidth Active Checks
  • Add self-monitoring metrics on separate port 8081
  • add cuda-samples test
  • add health-checker upgrade
  • [slurmcluster] maxUnavalaible for workes and support pre install images
  • add comment reaction on the check
  • [CD] Support for VMAgent external labels
  • Add JailedConfig CR
  • Undrain nodes after user problems + rewrite passive checks on Python
  • Collecting health_checker_cmd_stdout logs
  • Standalone Slurm exporter - No Kubernetes needed
  • [SlurmCluster CRD] rewrite logic of printcolumn
  • Clean old logs from /opt/soperator-outputs
  • Add slurm scripts for managing per-job tmpfs directories
  • move IB checks to health-checker
  • add aggregation to the JailedConfig
  • Log one-line JSON outputs for health-checker + rewrite in Python
  • Chessprofessor/each worker jobs
  • #1312 Add ib-write-bw/lat cpu checks
  • add activechecks PrinterColumns
  • add priorityclassname for components of clurm cluster
  • Bump health checker 1.0.0-150
  • Slurm scripts drop cache shmem
  • remove appormore deny for libEGL_ for running docker active check image
  • bump nc-health-checker_1.0.0-151.250904
  • [slurm login] add SshdServiceLoadBalancerSourceRanges to login node
  • Fix passive check filtering

πŸ› Fixes

  • [nccl-debug] Use chmod instead of umask
  • remove size from controller spec
  • fix cache-sync-timeout to k8s default
  • Don't plan eachWorkerJobArray active checks on bad nodes
  • remove validation tool init container #1361
  • [nodeTopology] re-generate CM node topology if cm deleted.
  • SLURM exporter: use env configuration and improve docs
  • [worker topology controller] initial cm topology until asts not found
  • removing deprication fileds
  • fix priorityclass name for controller
  • fix issue with metadata.resourceVersion: Invalid value: 0x0: must be specified for an Patch in ASTS
  • set default values as a defalt in CRD #1421
  • Pre release fixes 1.22/0
  • Fix pagination issue in cache
  • [sopertochecs] change drain reason for maintenance
  • Disable acctg by default
  • add aggregation to the JailedConfig
  • Fix error with read-only workdir in dcgmi_diag_r1 health-checker
  • [sconfigcontroller] fix reconcile jailedconfig
  • Fallback on unix.renameat2 to os.rename when renameat2 is not supported
  • fix getting error when [user_problem] reason #1468
  • enable nodeLogs by default
  • fix preemptionPolicy for controller
  • fix patch cm with empty labels
  • add resources values for spo
  • Move Healthchekcer parts to optimize for build time
  • Cherry-pick active checks in Helm
  • [nccld-plugin] Make user responsible for correct rights of the output directory
  • Add fixes for activechecks in 1.22
  • Slurm scripts drop cache shmem
  • DCGM Exporter fix for toolkit validation
  • Add wait after users creation + Split all-reduce-perf + Fix dcgmi_diag_r1
  • [TopologyController] add EnsureWorkerTopologyConfigMap to check existing of JailedConfig
  • Fix passive check filtering
  • change k8up-cleanup image
  • Extract Enroot's config paths to config dir

πŸ“¦ Dependencies

  • Bump docker/login-action from 3.4.0 to 3.5.0
  • Bump github.com/getkin/kin-openapi from 0.122.0 to 0.131.0
  • Bump golang.org/x/oauth2 from 0.24.0 to 0.27.0
  • Bump golang.org/x/net from 0.37.0 to 0.38.0
  • Bump sigs.k8s.io/yaml from 1.4.0 to 1.6.0
  • Bump actions/checkout from 4.2.2 to 5.0.0
  • Bump actions/download-artifact from 4 to 5

πŸ“”Docs

  • SLURM exporter: use env configuration and improve docs

Other

  • Merge dev to main
  • Set sconfigcontroller UID and GID from Helm values
  • DCGM exporter Helm chart: support metricRelablings
  • [nccl-debug] Don't try chmodding files and directories without a need
  • Support passing extraArgs for node-exporter
  • Add reason label to slurm_node_info metric for better observability
  • [cherry-pick] Add node unavailability and draining duration metrics
  • Support DCGM Exporter on driverful
  • Change DCGM exporter image version

Contributors:
@theyoprst, @mcheshkov, @dstaroff, @Uburro, @asteny, @itechdima, @rdjjke, @ChessProfessor, @dependabot[bot], @ali-sattari, @mateusclira-nv, @dnugmanov, @github-actions[bot]

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
5613 593 250 9661 5211