GitHub - leon-liang/serverless-rag: Reduce cold start time of RAG systems using checkpointing and pre-warming

Enabling Efficient Retrieval Augmented Generation in Serverless Computing

This project addresses the high cold-start latency of microservice-based Retrieval-Augmented Generation (RAG) systems by introducing two key components:

CRS (Checkpoint/Restore Service): A CLI wrapper around CRIU to checkpoint and restore containerized processes, significantly reducing the cold start time of pods by skipping time-consuming startup tasks like dependency loading and model initialization.
FUSE: A service-aware orchestrator that uses a DAG-based RAG pipeline to pre-warm downstream services just in time to handle requests. It also supports dynamic runtime input, making it adaptable to various RAG architectures.

Hardware and System Setup

LRZ (Leibniz Supercomputing Centre) Compute Cloud VMs (website);
NVIDIA Tesla V100 16 GB RAM; (3 VM nodes, each with 1 GPU);
Ubuntu 22.04 LTS (Jammy);
NVIDIA Driver 570.133.20 , CUDA 12.8;
Helm v3.17.3;
Kubernetes 1.32.1;
KEDA 2.17

Project Structure

CRS

A CLI wrapper around CRIU to checkpoint and restore containerized processes.

CRS-Deployment

Contains the CRSDeployment CRD as well as the associated Controller. It automatically provisions a StatefulSet, mounts necessary files, assigns the necessary elevated privileges for CRIU, and creates the Persistent Volume Claims needed for storing checkpoints. It also attaches a finalizer to ensure that the crs clear command is executed within the container before deletion. This setup allows us to start the application normally when deploying new versions while still benefiting from checkpoint-based restores when scaling through KEDA.

CRS-Python

Provides a helper function to send the create_checkpoint command to CRS, to trigger a snapshot.

FUSE

A service-aware orchestrator that uses a DAG-based RAG pipeline to pre-warm downstream services just in time to handle requests. It also supports dynamic runtime input, making it adaptable to various RAG architectures.

Examples

counter: Simple counter that shows the usage of CRS.
llm-applications: Full example of the usage of CRS and FUSE to reduce the cold start of RAG systems.

Usage

Prerequisite: Follow the steps outlined in infrastructure to create the Kubernetes cluster infrastructure, configure the cloud environment for NVIDIA GPUs and setup the necessary dependencies.

Part 1: CRS

Build and push CRS with the following command:

docker build -t liangleon/crs:latest -f ./crs/Dockerfile .
docker push liangleon/crs:latest

It can now be copied into your Dockerfile and used to wrap your applications like specified below.

COPY --from=liangleon/crs:latest /usr/local/bin/crs /usr/local/bin/crs
RUN chmod +x /usr/local/bin/crs
ENTRYPOINT ["crs", "run", "--", "fastapi", "run", "main.py"]

The application can send a create_checkpoint command to CRS, which will trigger a snapshot to be created. The Python library crs-python provides a helper function to send this command to CRS.
```
from crs_python import Checkpoint

checkpoint = Checkpoint()
checkpoint.create_checkpoint()
```
When deploying the application to K8s, the CRSDeployment custom resource needs to be installed first. From crs-deployment, run the following command to install the CRDs into the cluster.
```
make install
```

Build and push the Controller:

make docker-build docker-push IMG=liangleon/controller:latest

Deploy the Controller to the cluster:

make deploy IMG=liangleon/controller:latest

You can now deploy your application with the CRSDeployment CRD, as shown below:

apiVersion: crs.leonliang.lu/v1
kind: CRSDeployment
metadata:
  name: classifier
spec:
  replicas: 1
  image: liangleon/classifier:latest
  containerName: classifier
  containerPort: 8002
  imagePullPolicy: Always

Part 2: FUSE

Build and push with the following command:

docker build -t liangleon/fuse:latest -f ./fuse/Dockerfile .
docker push liangleon/fuse:latest

Overwrite fuse.pipelineConfig by specifying the startup and processing time of each service, as well as their position within the dependency chain:
```
services:
  - name: string
    startupTime: float 
    processingTime: float
    Dependencies: []
    Dependents: []
```

A full example of the usage of CRS and FUSE to reduce the cold start of RAG systems can be found in examples/llm-applications.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enabling Efficient Retrieval Augmented Generation in Serverless Computing

Hardware and System Setup

Project Structure

CRS

CRS-Deployment

CRS-Python

FUSE

Examples

Usage

Part 1: CRS

Part 2: FUSE

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
crs-deployment		crs-deployment
crs-python		crs-python
crs		crs
docs		docs
examples		examples
fuse		fuse
infrastructure		infrastructure
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Enabling Efficient Retrieval Augmented Generation in Serverless Computing

Hardware and System Setup

Project Structure

CRS

CRS-Deployment

CRS-Python

FUSE

Examples

Usage

Part 1: CRS

Part 2: FUSE

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages