ChReK: Checkpoint/Restore in Kubernetes

⚠️ Experimental Feature: ChReK is currently in beta/preview. It requires privileged mode for restore operations, which may not be suitable for all production environments. See Limitations for details.

ChReK (Checkpoint/Restore in Kubernetes) is an experimental infrastructure for fast-starting GPU applications using CRIU (Checkpoint/Restore in User-space). ChReK dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.

What is ChReK?

ChReK provides:

Fast cold starts: Restore GPU-accelerated applications in seconds instead of minutes
CUDA state preservation: Checkpoint and restore GPU memory and CUDA contexts
Kubernetes-native: Integrates seamlessly with Kubernetes primitives
Storage flexibility: PVC-based storage (S3/OCI planned for future releases)
Namespace isolation: Each namespace gets its own checkpoint infrastructure

Use Cases

1. With NVIDIA Dynamo Platform (Recommended)

Use ChReK as part of the Dynamo platform for automatic checkpoint management:

Automatic checkpoint creation and lifecycle management
Seamless integration with DynamoGraphDeployment CRDs
Built-in autoscaling with fast restore

📖 Read the Dynamo Integration Guide →

2. Standalone (Without Dynamo)

Use ChReK independently in your own Kubernetes applications:

Manual checkpoint job creation
Build your own restore-enabled container images
Full control over checkpoint lifecycle

📖 Read the Standalone Usage Guide →

Architecture

ChReK consists of two main components:

1. ChReK Helm Chart

Deploys the checkpoint/restore infrastructure:

DaemonSet: Runs on GPU nodes to perform CRIU checkpoint operations
PVC: Stores checkpoint data (rootfs diffs, CUDA memory state)
RBAC: Namespace-scoped or cluster-wide permissions
Seccomp Profile: Security policies for CRIU syscalls

2. Smart Entrypoint

A wrapper script that intelligently decides between:

Cold start: Normal application startup (when no checkpoint exists)
Restore: CRIU restore from checkpoint (when checkpoint available)

Quick Start

Install ChReK Infrastructure

helm install chrek nvidia/chrek \
  --namespace my-team \
  --create-namespace \
  --set storage.pvc.size=100Gi

Choose Your Integration Path

Using Dynamo Platform? → Follow the Dynamo Integration Guide
Using standalone? → Follow the Standalone Usage Guide

Key Features

✅ Currently Supported

✅ vLLM backend only (SGLang and TensorRT-LLM planned)
✅ Single-node, single-GPU checkpoints
✅ PVC storage backend (RWX for multi-node)
✅ CUDA checkpoint/restore
✅ PyTorch distributed state (with GLOO_SOCKET_IFNAME=lo)
✅ Namespace-scoped and cluster-wide RBAC
✅ Idempotent checkpoint creation
✅ Automatic signal-based checkpoint coordination

🚧 Planned Features

🚧 SGLang backend support
🚧 TensorRT-LLM backend support
🚧 S3/MinIO storage backend
🚧 OCI registry storage backend
🚧 Multi-GPU checkpoints
🚧 Multi-node distributed checkpoints

Limitations

⚠️ Important: ChReK has significant limitations that may impact production readiness:

Security Considerations

🔴 Privileged mode required: Restore pods must run in privileged mode for CRIU to function. This grants containers elevated host access and may violate security policies in many production environments.
Security Impact: Privileged containers can:
- Access all host devices
- Bypass most security restrictions
- Potentially compromise node security if the container is exploited

Technical Limitations

vLLM backend only: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
Single-node only: Checkpoints must be created and restored on the same node
Single-GPU only: Multi-GPU configurations not yet supported
Network state limitations: Active TCP connections are closed during restore (use tcp-close CRIU option)
Storage: Only PVC storage is currently implemented (S3/OCI planned)

Recommendation

ChReK is best suited for:

✅ Development and testing environments
✅ Research and experimentation
✅ Controlled production environments with appropriate security controls
❌ Security-sensitive production workloads without proper risk assessment

Documentation

Getting Started

Dynamo Integration Guide - Using ChReK with Dynamo Platform
Standalone Usage Guide - Using ChReK independently
ChReK Helm Chart README - Helm chart configuration

Prerequisites

Kubernetes 1.21+
GPU nodes with NVIDIA runtime (nvidia runtime class)
CRIU support in container runtime (containerd with CRIU plugin)
RWX storage class (for multi-node deployments)
Security clearance for privileged pods (required for restore operations)

Troubleshooting

Common Issues

DaemonSet not starting?

Check GPU node labels: kubectl get nodes -l nvidia.com/gpu.present=true
Verify NVIDIA runtime is available

Checkpoint fails?

Check DaemonSet logs: kubectl logs -l app.kubernetes.io/name=chrek -n <namespace>
Ensure application properly signals readiness
Verify CRIU is installed in the runtime

Restore fails?

Ensure restore pod uses the same volumes as checkpoint job
Verify hostIPC: true is set (required for CUDA)
Check for PSM3_DISABLED=1 and GLOO_SOCKET_IFNAME=lo environment variables

For detailed troubleshooting, see:

Contributing

ChReK is part of the NVIDIA Dynamo project. Contributions are welcome!

License

Apache License 2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ChReK: Checkpoint/Restore in Kubernetes

What is ChReK?

Use Cases

1. With NVIDIA Dynamo Platform (Recommended)

2. Standalone (Without Dynamo)

Architecture

1. ChReK Helm Chart

2. Smart Entrypoint

Quick Start

Install ChReK Infrastructure

Choose Your Integration Path

Key Features

✅ Currently Supported

🚧 Planned Features

Limitations

Security Considerations

Technical Limitations

Recommendation

Documentation

Getting Started

Related Documentation

Prerequisites

Troubleshooting

Common Issues

Contributing

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

ChReK: Checkpoint/Restore in Kubernetes

What is ChReK?

Use Cases

1. With NVIDIA Dynamo Platform (Recommended)

2. Standalone (Without Dynamo)

Architecture

1. ChReK Helm Chart

2. Smart Entrypoint

Quick Start

Install ChReK Infrastructure

Choose Your Integration Path

Key Features

✅ Currently Supported

🚧 Planned Features

Limitations

Security Considerations

Technical Limitations

Recommendation

Documentation

Getting Started

Related Documentation

Prerequisites

Troubleshooting

Common Issues

Contributing

License