Skip to content

leon-liang/serverless-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Enabling Efficient Retrieval Augmented Generation in Serverless Computing

System Design

This project addresses the high cold-start latency of microservice-based Retrieval-Augmented Generation (RAG) systems by introducing two key components:

  • CRS (Checkpoint/Restore Service): A CLI wrapper around CRIU to checkpoint and restore containerized processes, significantly reducing the cold start time of pods by skipping time-consuming startup tasks like dependency loading and model initialization.
  • FUSE: A service-aware orchestrator that uses a DAG-based RAG pipeline to pre-warm downstream services just in time to handle requests. It also supports dynamic runtime input, making it adaptable to various RAG architectures.

Hardware and System Setup

  • LRZ (Leibniz Supercomputing Centre) Compute Cloud VMs (website);
  • NVIDIA Tesla V100 16 GB RAM; (3 VM nodes, each with 1 GPU);
  • Ubuntu 22.04 LTS (Jammy);
  • NVIDIA Driver 570.133.20 , CUDA 12.8;
  • Helm v3.17.3;
  • Kubernetes 1.32.1;
  • KEDA 2.17

Project Structure

CRS

A CLI wrapper around CRIU to checkpoint and restore containerized processes.

CRS-Deployment

Contains the CRSDeployment CRD as well as the associated Controller. It automatically provisions a StatefulSet, mounts necessary files, assigns the necessary elevated privileges for CRIU, and creates the Persistent Volume Claims needed for storing checkpoints. It also attaches a finalizer to ensure that the crs clear command is executed within the container before deletion. This setup allows us to start the application normally when deploying new versions while still benefiting from checkpoint-based restores when scaling through KEDA.

CRS-Python

Provides a helper function to send the create_checkpoint command to CRS, to trigger a snapshot.

FUSE

A service-aware orchestrator that uses a DAG-based RAG pipeline to pre-warm downstream services just in time to handle requests. It also supports dynamic runtime input, making it adaptable to various RAG architectures.

Examples

  • counter: Simple counter that shows the usage of CRS.
  • llm-applications: Full example of the usage of CRS and FUSE to reduce the cold start of RAG systems.

Usage

Prerequisite: Follow the steps outlined in infrastructure to create the Kubernetes cluster infrastructure, configure the cloud environment for NVIDIA GPUs and setup the necessary dependencies.

Part 1: CRS

  1. Build and push CRS with the following command:
    docker build -t liangleon/crs:latest -f ./crs/Dockerfile .
    docker push liangleon/crs:latest
    
  2. It can now be copied into your Dockerfile and used to wrap your applications like specified below.
    COPY --from=liangleon/crs:latest /usr/local/bin/crs /usr/local/bin/crs
    RUN chmod +x /usr/local/bin/crs
    ENTRYPOINT ["crs", "run", "--", "fastapi", "run", "main.py"]
    
  3. The application can send a create_checkpoint command to CRS, which will trigger a snapshot to be created. The Python library crs-python provides a helper function to send this command to CRS.
    from crs_python import Checkpoint
    
    checkpoint = Checkpoint()
    checkpoint.create_checkpoint()
    
  4. When deploying the application to K8s, the CRSDeployment custom resource needs to be installed first. From crs-deployment, run the following command to install the CRDs into the cluster.
    make install
    
  5. Build and push the Controller:
    make docker-build docker-push IMG=liangleon/controller:latest
    
  6. Deploy the Controller to the cluster:
    make deploy IMG=liangleon/controller:latest
    
  7. You can now deploy your application with the CRSDeployment CRD, as shown below:
    apiVersion: crs.leonliang.lu/v1
    kind: CRSDeployment
    metadata:
      name: classifier
    spec:
      replicas: 1
      image: liangleon/classifier:latest
      containerName: classifier
      containerPort: 8002
      imagePullPolicy: Always
    

Part 2: FUSE

  1. Build and push with the following command:
    docker build -t liangleon/fuse:latest -f ./fuse/Dockerfile .
    docker push liangleon/fuse:latest
    
  2. Overwrite fuse.pipelineConfig by specifying the startup and processing time of each service, as well as their position within the dependency chain:
    services:
      - name: string
        startupTime: float 
        processingTime: float
        Dependencies: []
        Dependents: []
    

A full example of the usage of CRS and FUSE to reduce the cold start of RAG systems can be found in examples/llm-applications.

About

Reduce cold start time of RAG systems using checkpointing and pre-warming

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors