Skip to content

Latest commit

 

History

History
293 lines (226 loc) · 15.9 KB

File metadata and controls

293 lines (226 loc) · 15.9 KB

Active Checks – Health and system checks framework

This doc is prepared based on the 2.0 soperator version and includes description of all the features available in 2.0.

Overview

Active Checks is a framework for running benchmarks and system tests across Kubernetes and Slurm environments.

The framework helps to:

  • Verify node health and overall cluster stability.
  • Validate and accept new hardware in data centers.
  • Apply and confirm system changes (e.g., package updates).
  • Run GPU/CPU, network/IB, filesystem, and other targeted checks.

Active Checks supports two execution modes:

  • Kubernetes Jobs – checks running as scheduled or one-time jobs inside Kubernetes.
  • Slurm Jobs – checks running on some or all Slurm nodes to validate compute hardware and environment.

In addition, there is an extensive check pipeline that runs follow-up checks on suspicious Slurm nodes after a failed Active Check.

Deployment follows a GitOps model: Flux deploys ActiveCheck CRs to clusters. Starting from a specific version of soperator, all clusters include Active Checks automatically through Helm charts managed by Flux.

Slurm cluster customers may apply custom Active Check CRs to the cluster (independently of Flux) too.

Architecture diagram

Components

This section explains the main building blocks of the Active Checks framework, following the architecture diagram.

  • Flux
    GitOps tool that deploys the soperator-activechecks Helm chart.
    Through this chart, Flux ensures that ActiveCheck CRs and controllers are consistently deployed across clusters.

  • Active Check CRs
    The core custom resources used to define checks.

    • A single CR with a checkType field that can be either k8sJob or slurmJob.
    • The extensive check pipeline uses the same ActiveCheck CRD, configured as a Slurm check that targets suspicious nodes. (See ActiveCheck CRD for details.)
  • Active Check Controller
    Watches Active Check CRs and creates Kubernetes CronJobs for each of them (1:1 mapping).
    For details of reconciliation logic, see Controllers.

  • Active Check Jobs (K8s or Slurm)
    Standard Kubernetes Job resources created by CronJobs.

    • K8s Active Check Jobs: run directly inside Kubernetes.
    • Slurm Active Check Jobs: submit one Slurm Job batch per Active Check Job.
  • Active Check Jobs Controller
    Watches Active Check Jobs (K8s and Slurm).

  • Slurm Jobs
    Real Slurm jobs created by a Slurm Active Check K8s Job.

    • A batch may consist of one or many Slurm jobs.
    • Job output is written to /mnt/jail/opt/soperator-outputs/slurm_jobs/, which is also used by the observability stack.
  • Slurm Workers
    Compute nodes in the Slurm cluster where Slurm jobs execute.
    These are the targets for most hardware acceptance and performance tests.

  • Slurm Nodes Controller Watches Slurm nodes and drained workers.

    • Sets suspicious reservation for the nodes drained with [node_problem] reason prefix.
    • Sets the unhealthy flag for the nodes drained with [hardware_problem] reason prefix.
  • Image Storage Stores container images that provide the environments needed for running checks.

    • K8s Check Job Image – environment for checks executed as Kubernetes Jobs.
    • Slurm Check Job Image – environment for submitting Slurm jobs.
    • Active Checks Image – image used inside Slurm jobs (typically via srun) for most checks (not all).

ActiveCheck CRD

The ActiveCheck resource (slurm.nebius.ai/v1alpha1) defines one health check.
The Active Check Controller ensures each ActiveCheck CR corresponds to exactly one Kubernetes CronJob.
CronJobs create Kubernetes Jobs on schedule, which either run the check directly (K8s mode) or submit a Slurm batch (Slurm mode).

Top-level spec fields

  • spec.name (string) — Logical name used for generated CronJob/Jobs.
  • spec.slurmClusterRefName (string) — Name of the SlurmCluster to target.
  • spec.checkType (enum: k8sJob|slurmJob) — Selects execution mode.
  • spec.schedule (string) — Cron schedule in standard Kubernetes Cron format.
  • spec.suspend (bool) — If true, pauses scheduling without deleting resources.
  • spec.activeDeadlineSeconds (int64) — Timeout for each CronJob-run.
  • spec.successfulJobsHistoryLimit (int32) — How many successful Job objects to retain.
  • spec.failedJobsHistoryLimit (int32) — How many failed Job objects to retain.
  • spec.runAfterCreation (bool) — Run once immediately after the CronJob is created.
  • spec.dependsOn (string[]) — Names of other ActiveChecks (same namespace) that must complete before this one runs.
    A check will not run until all its dependencies have reached Complete status.

Mode-specific options

spec.k8sJobSpec (Kubernetes checks)

  • spec.k8sJobSpec.jobContainer (ContainerSpec) — Container image and command/args for the check.
  • spec.k8sJobSpec.mungeContainer (ContainerSpec) — Optional Munge sidecar for Slurm auth.
  • spec.k8sJobSpec.scriptRefName (string) — Name of a ConfigMap containing a custom script at key script.sh.

spec.slurmJobSpec (Slurm checks)

  • spec.slurmJobSpec.jobContainer (ContainerSpec) — Container image and command/args for Slurm submission.
  • spec.slurmJobSpec.mungeContainer (ContainerSpec) — Munge sidecar for Slurm auth.
  • spec.slurmJobSpec.sbatchScriptRefName (string) — Name of a ConfigMap containing an sbatch script at key sbatch.sh.
  • spec.slurmJobSpec.sbatchScript (string, multiline) — Inline sbatch script. May contain #SBATCH directives and shell logic; can invoke srun.
  • spec.slurmJobSpec.eachWorkerJobs (bool) — Run on each worker using separate Slurm jobs.
  • spec.slurmJobSpec.maxNumberOfJobs (int64) — Maximum number of simultaneous jobs. If less than the number of workers, only a subset runs. 0 = no limit.

Reactions fields (spec)

  • spec.successReactions (Reactions) — Actions to take when a Slurm run succeeds.
  • spec.failureReactions (Reactions) — Actions to take when a Slurm run fails.

Reactions supports:

  • drainSlurmNode — Drain affected Slurm nodes.
  • commentSlurmNode — Add a failure comment to affected nodes.
  • addReservation — Create a Slurm reservation with name <prefix>-<nodeName>.
  • removeReservation — Remove a reservation with name <prefix>-<nodeName>.

Reactions are evaluated by the Active Check Jobs Controller after Slurm runs.
Affected nodes are derived from the Slurm job’s node list (GetNodeList()).

Reservations are used to mark suspicious nodes after a failed Active Check and to drive the extensive-check pipeline.

status fields

status.k8sJobsStatus

  • lastTransitionTime (time) — Last status change.
  • lastJobScheduleTime (time) — Last CronJob schedule event.
  • lastJobSuccessfulTime (time) — Time of last successful Job.
  • lastJobName (string) — Name of the last Job.
  • lastJobStatus (enum) — One of:
    • Active — Job currently running.
    • Complete — Job finished successfully.
    • Failed — Job failed (reactions do not apply for K8s jobs).
    • Suspended — Job was suspended.
    • Pending — Job created but not yet started.
    • Unknown — State could not be determined.

status.slurmJobsStatus

  • lastTransitionTime (time) — Last status change.
  • lastRunId (string) — Identifier of the last Slurm batch run.
  • lastRunName (string) — Name of the last Slurm run.
  • lastRunStatus (enum) — One of:
    • InProgress — Run is still ongoing.
    • Complete — Run finished successfully (success reactions may apply).
    • Failed — Check failed (failure reactions may apply).
    • Error — Error in job submission or check implementation.
    • Cancelled — Check was cancelled.
  • lastRunFailJobsAndReasons (array) — List of { jobID, reason } for failed jobs in the last run.
  • lastRunErrorJobsAndReasons (array) — List of { jobID, reason } for error jobs in the last run.
  • lastRunCancelledJobs (array) — List of job IDs for cancelled jobs in the last run.
  • lastRunSubmitTime (time) — Submission time of the last run.

Execution Modes

Execution depends on spec.checkType.
In both cases, the Active Check Controller creates a CronJob (1:1 with CR) which then spawns Jobs.

Shared concepts

  • Kubernetes CronJob → schedules the runs.
  • Kubernetes Job → executes a single run.
  • Images:
    • K8s Check Job Image → used in k8sJob.
    • Slurm Check Job Image → used in slurmJob for submitting.
    • Active Checks Image → used inside the actual Slurm workload for most checks (not all).

Kubernetes Jobs (k8sJob)

  • The CronJob spawns a Kubernetes Job that runs the check inside the cluster.
  • Logs are available directly in the Job’s Pod.
  • No reactions are applied; only status is updated.

Slurm Jobs (slurmJob)

  • The CronJob spawns a Kubernetes Job that submits a Slurm batch job.
  • Batch may contain one or many Slurm jobs.
  • Logs are written under /mnt/jail/opt/soperator-outputs/slurm_jobs/.
  • After the run completes, the Jobs Controller may apply success/failure reactions (see Reactions fields (spec)); affected nodes come from Slurm’s GetNodeList().

Slurm job submission modes

  • Default — One Slurm batch per run.
  • eachWorkerJobs — Run once per worker using separate Slurm jobs. maxNumberOfJobs param may be used together with it to limit the number of jobs (if less than the number of workers, only a subset executes).

Slurm partitions (defaults)

There are two partitions by default:

  • main — used for clients’ jobs.
  • hidden — hidden partition with the same priority as main.

Active checks always use hidden partition for job submission.

If eachWorkerJobs is not specified the job is running in the default submission mode.

Extensive check pipeline

The extensive check pipeline is a follow-up flow for suspicious Slurm nodes, implemented using the same ActiveCheck CRD with checkType: slurmJob.

Flow

  1. Active Check fails → affected nodes are drained with the [node_problem] prefix.
  2. Reservation added → nodes move into the suspicious pool (reservation name uses the configured prefix).
  3. Extensive check runs → an ActiveCheck CronJob submits a Slurm batch that targets suspicious nodes only.
    • Each suspicious worker gets its own Slurm job in the batch.
    • If there are no suspicious nodes, no extensive-check jobs are created.
  4. Success → reservations are removed and nodes return to the healthy pool.
  5. Failure → nodes are drained with the [hardware_problem] prefix and marked for replacement by the Slurm Nodes Controller.

Observability

The Active Checks framework integrates with the cluster observability stack.

Logging

  • K8s jobs → logs are available from the Pods of the Kubernetes Jobs.
  • Slurm jobs → logs are written under /mnt/jail/opt/soperator-outputs/slurm_jobs/.
  • Other logs (e.g., passive checks) also exist under the broader /soperator-outputs/ path, but are out of scope for this doc.

Dashboards

Results and metrics are visualized in Grafana. Operators can quickly see:

  • Recent check runs and their outcomes.
  • Historical success/failure rates.
  • Node-level health across multiple checks.

Controllers

Active Checks are managed by three controllers. Together they implement a GitOps-friendly flow: ActiveCheck CR → CronJob (1:1) → Jobs → Status & (if Slurm) Reactions.

1. Active Check Controller

Purpose

  • Watches ActiveCheck CRs and reconciles each CR into exactly one Kubernetes CronJob.
  • Encodes CronJob settings from the CR (schedule, suspend, activeDeadlineSeconds, history limits, runAfterCreation).
  • Ensures optional script sources are wired (e.g., inline slurmJobSpec.sbatchScript via ConfigMap).
  • Adds/removes the slurm.nebius.ai/activecheck-finalizer to support safe teardown.

High-level flow

  1. On create/update:
    • Requeues reconciliation until Slurm cluster is ready and until all the checks from dependsOn list have finished successfully.
    • Render and reconcile the CronJob (and ConfigMap if needed).
    • Optionally trigger an immediate run if runAfterCreation is set and no prior transition exists.
  2. On delete:
    • Clean up the CronJob and inline-script ConfigMap (if any).
    • Remove the finalizer.

2. Active Check Jobs Controller

Purpose

  • Observes Jobs created by CronJobs (K8s and Slurm).
  • Aggregates results and updates status on the owning CRs (see Status fields).
  • Applies reactions for Slurm runs according to successReactions / failureReactions (see Reactions fields (spec)).

High-level flow

  1. Map each Kubernetes Job back to its owning ActiveCheck.
  2. Kubernetes mode
    • Compute lastJobStatus (Active, Complete, Failed, Suspended, Pending, Unknown) and update k8sJobsStatus.
    • No reactions applied.
  3. Slurm mode
    • Parse Slurm job IDs from Kubernetes Job annotations.
    • Query the Slurm API client for job states.
    • Aggregate results into slurmJobsStatus (run ID/name/status, failed/error jobs with reasons, submit time).
    • If terminal:
      • On Failed → apply failureReactions (e.g., drain/comment, add reservation).
      • On Complete → apply successReactions (e.g., remove reservation).
    • Requeue while jobs are in progress.
  4. Patch Job annotations with a “final state” timestamp to avoid reprocessing.

Error handling & GC (high level)

  • Requeues on transient errors (API reads/patches).
  • Job history is pruned by CronJob’s successfulJobsHistoryLimit / failedJobsHistoryLimit.

3. Slurm Nodes Controller

Purpose

  • Watches Slurm nodes, focusing on drained workers.
  • Sets suspicious reservations for workers drained with the [node_problem] prefix.
  • Sets the unhealthy flag on the corresponding Kubernetes nodes for workers drained with the [hardware_problem] prefix to mark them for replacement.

High-level flow (Active Check + extensive check pipeline)

  1. Periodically list Slurm nodes and filter drained nodes with well-known health check reasons.
  2. For [node_problem] health check failures:
    • If extensive checks are enabled, create a suspicious-node reservation and undrain the node so extensive checks can run.
    • If extensive checks are disabled, mark the node unhealthy immediately when node replacement is enabled; otherwise no-op.
  3. For [hardware_problem] failures, set the unhealthy flag on the Kubernetes node to trigger replacement.

Roadmap & Limitations

Current limitations

  • No retries for active checks: if a run fails immediately on creation, dependent checks won’t proceed until manual intervention.
  • Job history pruning: limited to CronJob’s successfulJobsHistoryLimit and failedJobsHistoryLimit; no long-term archival.

Planned improvements

  • Multi-node Slurm checks: running checks across multiple nodes per job is not yet supported.