This doc is prepared based on the 2.0 soperator version and includes description of all the features available in 2.0.
Active Checks is a framework for running benchmarks and system tests across Kubernetes and Slurm environments.
The framework helps to:
- Verify node health and overall cluster stability.
- Validate and accept new hardware in data centers.
- Apply and confirm system changes (e.g., package updates).
- Run GPU/CPU, network/IB, filesystem, and other targeted checks.
Active Checks supports two execution modes:
- Kubernetes Jobs – checks running as scheduled or one-time jobs inside Kubernetes.
- Slurm Jobs – checks running on some or all Slurm nodes to validate compute hardware and environment.
In addition, there is an extensive check pipeline that runs follow-up checks on suspicious Slurm nodes after a failed Active Check.
Deployment follows a GitOps model: Flux deploys ActiveCheck CRs to clusters. Starting from a specific version of soperator, all clusters include Active Checks automatically through Helm charts managed by Flux.
Slurm cluster customers may apply custom Active Check CRs to the cluster (independently of Flux) too.
This section explains the main building blocks of the Active Checks framework, following the architecture diagram.
-
Flux
GitOps tool that deploys the soperator-activechecks Helm chart.
Through this chart, Flux ensures that ActiveCheck CRs and controllers are consistently deployed across clusters. -
Active Check CRs
The core custom resources used to define checks.- A single CR with a
checkTypefield that can be eitherk8sJoborslurmJob. - The extensive check pipeline uses the same ActiveCheck CRD, configured as a Slurm check that targets suspicious nodes. (See ActiveCheck CRD for details.)
- A single CR with a
-
Active Check Controller
Watches Active Check CRs and creates Kubernetes CronJobs for each of them (1:1 mapping).
For details of reconciliation logic, see Controllers. -
Active Check Jobs (K8s or Slurm)
Standard Kubernetes Job resources created by CronJobs.- K8s Active Check Jobs: run directly inside Kubernetes.
- Slurm Active Check Jobs: submit one Slurm Job batch per Active Check Job.
-
Active Check Jobs Controller
Watches Active Check Jobs (K8s and Slurm).- Collects job results (for Slurm-based jobs via the Slurm API client).
- Updates statuses of the parent Active Check CRs (see Status fields).
- Executes reactions based on the outcome of the checks (see Reactions fields (spec)).
-
Slurm Jobs
Real Slurm jobs created by a Slurm Active Check K8s Job.- A batch may consist of one or many Slurm jobs.
- Job output is written to
/mnt/jail/opt/soperator-outputs/slurm_jobs/, which is also used by the observability stack.
-
Slurm Workers
Compute nodes in the Slurm cluster where Slurm jobs execute.
These are the targets for most hardware acceptance and performance tests. -
Slurm Nodes Controller Watches Slurm nodes and drained workers.
- Sets
suspiciousreservation for the nodes drained with[node_problem]reason prefix. - Sets the unhealthy flag for the nodes drained with
[hardware_problem]reason prefix.
- Sets
-
Image Storage Stores container images that provide the environments needed for running checks.
- K8s Check Job Image – environment for checks executed as Kubernetes Jobs.
- Slurm Check Job Image – environment for submitting Slurm jobs.
- Active Checks Image – image used inside Slurm jobs (typically via
srun) for most checks (not all).
The ActiveCheck resource (slurm.nebius.ai/v1alpha1) defines one health check.
The Active Check Controller ensures each ActiveCheck CR corresponds to exactly one Kubernetes CronJob.
CronJobs create Kubernetes Jobs on schedule, which either run the check directly (K8s mode) or submit a Slurm batch (Slurm mode).
spec.name(string) — Logical name used for generated CronJob/Jobs.spec.slurmClusterRefName(string) — Name of the SlurmCluster to target.spec.checkType(enum:k8sJob|slurmJob) — Selects execution mode.spec.schedule(string) — Cron schedule in standard Kubernetes Cron format.spec.suspend(bool) — Iftrue, pauses scheduling without deleting resources.spec.activeDeadlineSeconds(int64) — Timeout for each CronJob-run.spec.successfulJobsHistoryLimit(int32) — How many successful Job objects to retain.spec.failedJobsHistoryLimit(int32) — How many failed Job objects to retain.spec.runAfterCreation(bool) — Run once immediately after the CronJob is created.spec.dependsOn(string[]) — Names of other ActiveChecks (same namespace) that must complete before this one runs.
A check will not run until all its dependencies have reached Complete status.
spec.k8sJobSpec (Kubernetes checks)
spec.k8sJobSpec.jobContainer(ContainerSpec) — Container image and command/args for the check.spec.k8sJobSpec.mungeContainer(ContainerSpec) — Optional Munge sidecar for Slurm auth.spec.k8sJobSpec.scriptRefName(string) — Name of aConfigMapcontaining a custom script at keyscript.sh.
spec.slurmJobSpec (Slurm checks)
spec.slurmJobSpec.jobContainer(ContainerSpec) — Container image and command/args for Slurm submission.spec.slurmJobSpec.mungeContainer(ContainerSpec) — Munge sidecar for Slurm auth.spec.slurmJobSpec.sbatchScriptRefName(string) — Name of aConfigMapcontaining an sbatch script at keysbatch.sh.spec.slurmJobSpec.sbatchScript(string, multiline) — Inline sbatch script. May contain#SBATCHdirectives and shell logic; can invokesrun.spec.slurmJobSpec.eachWorkerJobs(bool) — Run on each worker using separate Slurm jobs.spec.slurmJobSpec.maxNumberOfJobs(int64) — Maximum number of simultaneous jobs. If less than the number of workers, only a subset runs.0= no limit.
spec.successReactions(Reactions) — Actions to take when a Slurm run succeeds.spec.failureReactions(Reactions) — Actions to take when a Slurm run fails.
Reactions supports:
drainSlurmNode— Drain affected Slurm nodes.commentSlurmNode— Add a failure comment to affected nodes.addReservation— Create a Slurm reservation with name<prefix>-<nodeName>.removeReservation— Remove a reservation with name<prefix>-<nodeName>.
Reactions are evaluated by the Active Check Jobs Controller after Slurm runs.
Affected nodes are derived from the Slurm job’s node list (GetNodeList()).
Reservations are used to mark suspicious nodes after a failed Active Check and to drive the extensive-check pipeline.
lastTransitionTime(time) — Last status change.lastJobScheduleTime(time) — Last CronJob schedule event.lastJobSuccessfulTime(time) — Time of last successful Job.lastJobName(string) — Name of the last Job.lastJobStatus(enum) — One of:Active— Job currently running.Complete— Job finished successfully.Failed— Job failed (reactions do not apply for K8s jobs).Suspended— Job was suspended.Pending— Job created but not yet started.Unknown— State could not be determined.
lastTransitionTime(time) — Last status change.lastRunId(string) — Identifier of the last Slurm batch run.lastRunName(string) — Name of the last Slurm run.lastRunStatus(enum) — One of:InProgress— Run is still ongoing.Complete— Run finished successfully (success reactions may apply).Failed— Check failed (failure reactions may apply).Error— Error in job submission or check implementation.Cancelled— Check was cancelled.
lastRunFailJobsAndReasons(array) — List of{ jobID, reason }for failed jobs in the last run.lastRunErrorJobsAndReasons(array) — List of{ jobID, reason }for error jobs in the last run.lastRunCancelledJobs(array) — List of job IDs for cancelled jobs in the last run.lastRunSubmitTime(time) — Submission time of the last run.
Execution depends on spec.checkType.
In both cases, the Active Check Controller creates a CronJob (1:1 with CR) which then spawns Jobs.
- Kubernetes CronJob → schedules the runs.
- Kubernetes Job → executes a single run.
- Images:
- K8s Check Job Image → used in
k8sJob. - Slurm Check Job Image → used in
slurmJobfor submitting. - Active Checks Image → used inside the actual Slurm workload for most checks (not all).
- K8s Check Job Image → used in
- The CronJob spawns a Kubernetes Job that runs the check inside the cluster.
- Logs are available directly in the Job’s Pod.
- No reactions are applied; only status is updated.
- The CronJob spawns a Kubernetes Job that submits a Slurm batch job.
- Batch may contain one or many Slurm jobs.
- Logs are written under
/mnt/jail/opt/soperator-outputs/slurm_jobs/. - After the run completes, the Jobs Controller may apply success/failure reactions (see Reactions fields (spec)); affected nodes come from Slurm’s
GetNodeList().
- Default — One Slurm batch per run.
- eachWorkerJobs — Run once per worker using separate Slurm jobs.
maxNumberOfJobsparam may be used together with it to limit the number of jobs (if less than the number of workers, only a subset executes).
There are two partitions by default:
- main — used for clients’ jobs.
- hidden — hidden partition with the same priority as
main.
Active checks always use hidden partition for job submission.
If eachWorkerJobs is not specified the job is running in the default submission mode.
The extensive check pipeline is a follow-up flow for suspicious Slurm nodes, implemented using the same ActiveCheck CRD with checkType: slurmJob.
Flow
- Active Check fails → affected nodes are drained with the
[node_problem]prefix. - Reservation added → nodes move into the suspicious pool (reservation name uses the configured prefix).
- Extensive check runs → an ActiveCheck CronJob submits a Slurm batch that targets suspicious nodes only.
- Each suspicious worker gets its own Slurm job in the batch.
- If there are no suspicious nodes, no extensive-check jobs are created.
- Success → reservations are removed and nodes return to the healthy pool.
- Failure → nodes are drained with the
[hardware_problem]prefix and marked for replacement by the Slurm Nodes Controller.
The Active Checks framework integrates with the cluster observability stack.
- K8s jobs → logs are available from the Pods of the Kubernetes Jobs.
- Slurm jobs → logs are written under
/mnt/jail/opt/soperator-outputs/slurm_jobs/. - Other logs (e.g., passive checks) also exist under the broader
/soperator-outputs/path, but are out of scope for this doc.
Results and metrics are visualized in Grafana. Operators can quickly see:
- Recent check runs and their outcomes.
- Historical success/failure rates.
- Node-level health across multiple checks.
Active Checks are managed by three controllers. Together they implement a GitOps-friendly flow: ActiveCheck CR → CronJob (1:1) → Jobs → Status & (if Slurm) Reactions.
Purpose
- Watches
ActiveCheckCRs and reconciles each CR into exactly one Kubernetes CronJob. - Encodes CronJob settings from the CR (
schedule,suspend,activeDeadlineSeconds, history limits,runAfterCreation). - Ensures optional script sources are wired (e.g., inline
slurmJobSpec.sbatchScriptvia ConfigMap). - Adds/removes the
slurm.nebius.ai/activecheck-finalizerto support safe teardown.
High-level flow
- On create/update:
- Requeues reconciliation until Slurm cluster is ready and until all the checks from
dependsOnlist have finished successfully. - Render and reconcile the CronJob (and ConfigMap if needed).
- Optionally trigger an immediate run if
runAfterCreationis set and no prior transition exists.
- Requeues reconciliation until Slurm cluster is ready and until all the checks from
- On delete:
- Clean up the CronJob and inline-script ConfigMap (if any).
- Remove the finalizer.
Purpose
- Observes Jobs created by CronJobs (K8s and Slurm).
- Aggregates results and updates status on the owning CRs (see Status fields).
- Applies reactions for Slurm runs according to
successReactions/failureReactions(see Reactions fields (spec)).
High-level flow
- Map each Kubernetes Job back to its owning
ActiveCheck. - Kubernetes mode
- Compute
lastJobStatus(Active,Complete,Failed,Suspended,Pending,Unknown) and updatek8sJobsStatus. - No reactions applied.
- Compute
- Slurm mode
- Parse Slurm job IDs from Kubernetes Job annotations.
- Query the Slurm API client for job states.
- Aggregate results into
slurmJobsStatus(run ID/name/status, failed/error jobs with reasons, submit time). - If terminal:
- On Failed → apply failureReactions (e.g., drain/comment, add reservation).
- On Complete → apply successReactions (e.g., remove reservation).
- Requeue while jobs are in progress.
- Patch Job annotations with a “final state” timestamp to avoid reprocessing.
Error handling & GC (high level)
- Requeues on transient errors (API reads/patches).
- Job history is pruned by CronJob’s
successfulJobsHistoryLimit/failedJobsHistoryLimit.
Purpose
- Watches Slurm nodes, focusing on drained workers.
- Sets
suspiciousreservations for workers drained with the[node_problem]prefix. - Sets the unhealthy flag on the corresponding Kubernetes nodes for workers drained with the
[hardware_problem]prefix to mark them for replacement.
High-level flow (Active Check + extensive check pipeline)
- Periodically list Slurm nodes and filter drained nodes with well-known health check reasons.
- For
[node_problem]health check failures:- If extensive checks are enabled, create a
suspicious-nodereservation and undrain the node so extensive checks can run. - If extensive checks are disabled, mark the node unhealthy immediately when node replacement is enabled; otherwise no-op.
- If extensive checks are enabled, create a
- For
[hardware_problem]failures, set the unhealthy flag on the Kubernetes node to trigger replacement.
- No retries for active checks: if a run fails immediately on creation, dependent checks won’t proceed until manual intervention.
- Job history pruning: limited to CronJob’s
successfulJobsHistoryLimitandfailedJobsHistoryLimit; no long-term archival.
- Multi-node Slurm checks: running checks across multiple nodes per job is not yet supported.
