This project addresses the high cold-start latency of microservice-based Retrieval-Augmented Generation (RAG) systems by introducing two key components:
- CRS (Checkpoint/Restore Service): A CLI wrapper around CRIU to checkpoint and restore containerized processes, significantly reducing the cold start time of pods by skipping time-consuming startup tasks like dependency loading and model initialization.
- FUSE: A service-aware orchestrator that uses a DAG-based RAG pipeline to pre-warm downstream services just in time to handle requests. It also supports dynamic runtime input, making it adaptable to various RAG architectures.
- LRZ (Leibniz Supercomputing Centre) Compute Cloud VMs (website);
- NVIDIA Tesla V100 16 GB RAM; (3 VM nodes, each with 1 GPU);
- Ubuntu 22.04 LTS (Jammy);
- NVIDIA Driver 570.133.20 , CUDA 12.8;
- Helm v3.17.3;
- Kubernetes 1.32.1;
- KEDA 2.17
A CLI wrapper around CRIU to checkpoint and restore containerized processes.
Contains the CRSDeployment CRD as well as the associated Controller. It automatically provisions a StatefulSet, mounts necessary files, assigns the necessary elevated privileges for CRIU, and creates
the Persistent Volume Claims needed for storing checkpoints. It also attaches a finalizer to ensure that the crs clear command is executed within the container before deletion.
This setup allows us to start the application normally when deploying new versions while still benefiting from checkpoint-based restores when scaling through KEDA.
Provides a helper function to send the create_checkpoint command to CRS, to trigger a snapshot.
A service-aware orchestrator that uses a DAG-based RAG pipeline to pre-warm downstream services just in time to handle requests. It also supports dynamic runtime input, making it adaptable to various RAG architectures.
counter: Simple counter that shows the usage of CRS.llm-applications: Full example of the usage of CRS and FUSE to reduce the cold start of RAG systems.
Prerequisite: Follow the steps outlined in infrastructure to create the Kubernetes cluster infrastructure, configure the cloud environment for NVIDIA GPUs and setup the necessary dependencies.
- Build and push CRS with the following command:
docker build -t liangleon/crs:latest -f ./crs/Dockerfile . docker push liangleon/crs:latest - It can now be copied into your Dockerfile and used to wrap your applications like specified below.
COPY --from=liangleon/crs:latest /usr/local/bin/crs /usr/local/bin/crs RUN chmod +x /usr/local/bin/crs ENTRYPOINT ["crs", "run", "--", "fastapi", "run", "main.py"] - The application can send a
create_checkpointcommand to CRS, which will trigger a snapshot to be created. The Python librarycrs-pythonprovides a helper function to send this command to CRS.from crs_python import Checkpoint checkpoint = Checkpoint() checkpoint.create_checkpoint() - When deploying the application to K8s, the
CRSDeploymentcustom resource needs to be installed first. From crs-deployment, run the following command to install the CRDs into the cluster.make install - Build and push the Controller:
make docker-build docker-push IMG=liangleon/controller:latest - Deploy the Controller to the cluster:
make deploy IMG=liangleon/controller:latest - You can now deploy your application with the
CRSDeploymentCRD, as shown below:apiVersion: crs.leonliang.lu/v1 kind: CRSDeployment metadata: name: classifier spec: replicas: 1 image: liangleon/classifier:latest containerName: classifier containerPort: 8002 imagePullPolicy: Always
- Build and push with the following command:
docker build -t liangleon/fuse:latest -f ./fuse/Dockerfile . docker push liangleon/fuse:latest - Overwrite
fuse.pipelineConfigby specifying the startup and processing time of each service, as well as their position within the dependency chain:services: - name: string startupTime: float processingTime: float Dependencies: [] Dependents: []
A full example of the usage of CRS and FUSE to reduce the cold start of RAG systems can be found in examples/llm-applications.
