Skip to content

RedHatQuickCourses/genai-rhaiis

Repository files navigation

Model Serving with Red Hat AI Inference Server

Overview and Objectives

Welcome to Model Serving with Red Hat AI Inference Server (RHAIIS). This is a hands-on lab designed to give you practical experience serving large language models (LLMs) with Red Hat AI Inference Server (RHAIIS).

By the end of this course, you will be able to:

  1. Deploy and serve an LLM for inference using two different Granite models (2B and 8B parameters), gaining firsthand experience with a powerful, enterprise-grade platform.

  2. Optimize GPU resource usage by monitoring memory consumption in real time, understanding how RHAIIS loads model weights and manages the KV Cache. You'll learn how to fine-tune performance by controlling key parameters like max_tokens.

  3. Troubleshoot and solve deployment challenges, specifically by working through the advanced steps needed to successfully launch a larger 8B model. This will build your skills for real-world scenarios.

  4. Lay a foundation for further exploration by using the Red Hat AI Model Repository on Hugging Face to serve and experiment with more models after the lab exercises.

While the lab environment initializes, take this opportunity to review the provided lab guide. It covers essential topics like RHAIIS requirements, supported deployments, and advanced vLLM configuration settings, giving you the context you need to succeed.

Ready to get started? Let’s dive into a powerful platform that helps you deploy AI models with flexibility and high performance across any hybrid cloud environment.

Outcomes

Upon completing this lab, you will be able to:

  • Deploy AI Inference server for Huggingface based models using Podman.
  • Verify the model is serving correctly by interacting with its API.
  • Monitor the GPU's video memory (VRAM) usage in real-time.
  • Tune server parameters to control memory consumption and context length.
  • Deploy and test an alternative model to see the platform's flexibility.
  • Determine the max-model-len for a given model.

Environment Prerequisites

Your lab environment has been pre-configured with the following:

  • A Red Hat Enterprise Linux 9.x system with a valid subscription.
  • An attached and configured NVIDIA data center GPU with drivers installed.
  • Podman and the NVIDIA Container Toolkit are pre-installed.
  • Credentials for Red Hat account to access registry.redhat.io. (Provided for this experience)
  • A Hugging Face account with a User Access Token with read permissions. (Provided for this experience)

Setting Up Your RHAIIS Lab Environment

A dedicated lab environment for this training is currently in development. In the meantime, you can use the same environment available on the Demo Platform, https://catalog.demo.redhat.com/catalog?item=babylon-catalog-prod/rhdp.rhaiis-on-rhel.prod&utm_source=webapp&utm_medium=share-link[*Base Red Hat AI Inference Server (RHAIIS)*, window=blank] - which is pre-configured with the Red Hat AI Inference Server (RHAIIS).

This environment includes some pre-configured bonus content:

  • A bonus lab that shows you how to connect to the AI model using Python.

  • An Qwen2.5 model running on RHAIIS as a system service. This model starts automatically when the system boots.

Once you have completed the initial exercises, you can stop this service to free up the environment for the this lab's main activities. To stop the service, simply run the following command:

[source,bash]

sudo systemctl stop rhaiis.service

About

Use RHAIIS to deploy a model with various settings to tune GPU memory.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published