Checkpoint / Restart tests on Exascale computing systems

For questions, please contact: Huihuo Zheng huihuo.zheng@anl.gov

Exascale computing systems often experience instabilities that can cause job terminations before completion.

To ensure large-scale simulations can continue efficiently, checkpoint/restart mechanisms are essential.

This repository provides: • Simple programs to simulate common job execution issues: (1) hanging, (2) mid-run failures, and (3) successful completion. • Example submission scripts that automatically detect failures and restart jobs using healthy nodes.

The key idea is to over-allocate nodes, allowing jobs to be restarted on a healthy subset of nodes if a failure occurs.

Install the package

git clone https://github.com/argonne-lcf/checkpoint_restart
cd checkpoint_restart
pip install -e .

This will install the check_hang.py, check_nan.py, and get_healthy_nodes.sh scripts into your environment.

Useful Scripts

This repository includes several scripts to help manage and monitor jobs. After installation, check_hang.py, check_nan.py, and get_healthy_nodes.sh will be available in your PATH.

check_hang.py: Monitors files for updates and kills a job if it stops changing for longer than a specified timeout. This is useful for detecting hung processes.
```
check_hang.py --timeout 600 --check 10 --command "mpiexec python train.py"
```
Arguments:
- --timeout: Seconds of inactivity after which the job will be killed (default: 300).
- --check: Seconds between file-activity checks (default: 5).
- --kill-command: Shell command to terminate the job (default: pkill -u $USER mpiexec).
- --outputs: Colon-separated list of output files to watch (default: chkpt/latest).
- --grace: Seconds to wait after sending the kill command before exiting (default: 10).
- --dry-run: If set, do not actually run the kill command—only log the action.
check_nan.py: Monitors text output files for NaN or Inf values and terminates the job if they are found. This is useful for catching numerical stability issues.
```
check_nan.py --outputs "logs/*.out" --check 15 --kill-command "scancel $SLURM_JOB_ID"
```
Arguments:
- --outputs: Glob pattern for files to watch.
- --recursive: Enable recursive globbing.
- --check: Polling interval in seconds (default: 15).
- --timeout: Exit with code 0 if no NaN/Inf found after this many seconds (0 disables timeout).
- --include-inf: Also treat 'inf' tokens as fatal.
- --pid: If set, send a signal to this PID on detection.
- --signal: Signal to send when using --pid (default: TERM).
- --grace: Seconds to wait before escalating to SIGKILL if --pid is used (default: 15).
- --kill-command: Arbitrary shell command to run on detection.
- --dry-run: Detect and report but do not kill or run commands.
- --verbose: Print verbose progress messages.
get_healthy_nodes.sh: Selects a subset of healthy nodes from a larger allocation, writing them to a new nodefile. This is key to the restart mechanism.
```
get_healthy_nodes.sh NODEFILE NUM_NODES_TO_SELECT NEW_NODEFILE
```
utils/flush.sh: A utility to clean up processes on allocated nodes, excluding the head node. This script is not installed via pip.
```
PBS_NODEFILE=NODEFILE ./utils/flush.sh
```

Simulation of job execution: hang, fail, success

The test_pyjob.py script allows you to simulate various job behaviors:

--hang N              # Hang for N seconds
--fail N              # Fail after N seconds
--compute T           # Compute time per iteration
--niters NITERS       # Total number of iterations
--checkpoint PATH     # Checkpoint file path
--checkpoint_time T   # Time to write a single checkpoint

python test_pyjob.py --fail 120 --checkpoint ./chkpt --niters 1000

Example submission scripts

qsub_multi_mpiexec.sc submission script doing continual trials of mpiexec until success or timeout

System Monitoring

system_monitoring/README.md Monitoring scripts and dashboard service for JSON-based node health visualization.

YAML-driven microkernel health checks

Config file: system_monitoring/health_checks.yaml
Runner: system_monitoring/run_health_checks.py
Build system (C/C++ microkernels): utils/check_healthy_tests/CMakeLists.txt

Typical usage:

# list active checks from YAML
run_health_checks.py --list

# configure/build microkernels, then run enabled checks
run_health_checks.py --build

# run a subset by group
run_health_checks.py --build --groups injection_bisection,memory

# include checks marked disabled in YAML
run_health_checks.py --build --include-disabled --checks triad,flops

The YAML controls:

which microkernels are enabled (enabled: true|false)
grouping (group) for selective execution
concrete launch command (command) and optional timeout/env
build commands (build.configure_command and build.build_command)

By default, the YAML keeps MPI/PBS-sensitive checks disabled for local development. Enable them on cluster allocations with:

run_health_checks.py --build --include-disabled --checks simple_injection_bisection,full_injection_bisection,triad,flops,topology

Various simulation examples

fail/: job failed after 100 seconds, restart
hang/: job hang, kill and restart
success/: job run seccessfully
nan/: NaN after a few iterations, restart

Checkpoint interval optimization utility

optimal_checkpointing.py Determine the optimal time interval of computation between checkpoints for a job of determined node size and checkpointed memory per node

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.docs/figures		.docs/figures
TODOs		TODOs
examples		examples
job_monitoring		job_monitoring
system_monitoring		system_monitoring
utils		utils
README.md		README.md
qsub_multi_mpiexec.sc		qsub_multi_mpiexec.sc
setup.py		setup.py
test_pyjob.py		test_pyjob.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Checkpoint / Restart tests on Exascale computing systems

Install the package

Useful Scripts

Simulation of job execution: hang, fail, success

Example submission scripts

System Monitoring

YAML-driven microkernel health checks

Various simulation examples

Checkpoint interval optimization utility

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Checkpoint / Restart tests on Exascale computing systems

Install the package

Useful Scripts

Simulation of job execution: hang, fail, success

Example submission scripts

System Monitoring

YAML-driven microkernel health checks

Various simulation examples

Checkpoint interval optimization utility

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages