-
-
Notifications
You must be signed in to change notification settings - Fork 17
Add a python guide which demonstrates using an LLAMA model in a compute #649
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,275 @@ | ||||||||||||
| --- | ||||||||||||
| description: Use Llama model with serverless compute to translate text and store results using Nitric | ||||||||||||
| tags: | ||||||||||||
| - Nitric | ||||||||||||
raksiv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||
| - API | ||||||||||||
| - AI & Machine Learning | ||||||||||||
| languages: | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. rebase and add start_steps. See go realtime guide for example. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. rebased, but start steps won't work with this repository - it requires them to download llama separately. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could the model be |
||||||||||||
| - python | ||||||||||||
| --- | ||||||||||||
|
|
||||||||||||
| # Using LLama models with serverless infrastructure | ||||||||||||
|
||||||||||||
|
|
||||||||||||
| This guide will walk you through setting up a lightweight translation service using the Llama model, combined with Nitric for API routing and bucket storage. | ||||||||||||
|
|
||||||||||||
| By leveraging serverless compute, you'll be able to deploy and run a machine learning model with minimal infrastructure overhead, making it a great fit for handling dynamic workloads such as real-time text translation. | ||||||||||||
|
|
||||||||||||
| ## What we'll be doing | ||||||||||||
|
|
||||||||||||
| We will use the [Llama](https://huggingface.co/) models from Hugging Face for natural language processing, combined with Nitric to manage the API routes and storage. | ||||||||||||
|
|
||||||||||||
| 1. Setting up the environment. | ||||||||||||
| 2. Creating the translation service. | ||||||||||||
| 3. Deploying the service. | ||||||||||||
| 4. Testing the translation functionality. | ||||||||||||
|
|
||||||||||||
| ## Prerequisites | ||||||||||||
|
|
||||||||||||
| - [uv](https://docs.astral.sh/uv/#getting-started) - for Python dependency management | ||||||||||||
| - The [Nitric CLI](/get-started/installation) | ||||||||||||
| - _(optional)_ An [AWS](https://aws.amazon.com) account | ||||||||||||
|
|
||||||||||||
| ## Project setup | ||||||||||||
|
|
||||||||||||
| We'll start by creating a new project for our translator service using Nitric's python starter template. | ||||||||||||
|
|
||||||||||||
| ```bash | ||||||||||||
| nitric new translator py-starter | ||||||||||||
| cd translator | ||||||||||||
| ``` | ||||||||||||
|
|
||||||||||||
| Next, let's install our base dependencies, then add the extra dependencies we need specifically for loading our language model. | ||||||||||||
|
|
||||||||||||
| ```bash | ||||||||||||
| # Install the base dependencies | ||||||||||||
| uv sync | ||||||||||||
| uv add llama-cpp-python | ||||||||||||
| ``` | ||||||||||||
|
|
||||||||||||
| You will also need to [download the Llama model](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/tree/main) file and ensure it is located in the `./models/` directory with the correct model file name. | ||||||||||||
|
|
||||||||||||
| In this guide we'll be using `Llama-3.2-1B-Instruct-Q4_K_M.gguf`, this model is ideal for serverless because its reduced size and efficient 4-bit quantization make it cost-effective and scalable, running smoothly within the resource limits of serverless compute environments while maintaining solid performance. | ||||||||||||
|
|
||||||||||||
| Your folder structure should look like this: | ||||||||||||
|
|
||||||||||||
| ```bash | ||||||||||||
| /translator | ||||||||||||
| /models | ||||||||||||
| Llama-3.2-1B-Instruct-Q4_K_M.gguf | ||||||||||||
| /services | ||||||||||||
| api.py | ||||||||||||
| nitric.yaml | ||||||||||||
| pyproject.toml | ||||||||||||
| python.dockerfile | ||||||||||||
| python.dockerfile.ignore | ||||||||||||
| README.md | ||||||||||||
| uv.lock | ||||||||||||
| ``` | ||||||||||||
|
|
||||||||||||
| ## Creating the translation service | ||||||||||||
|
|
||||||||||||
| Our project will use Nitric to handle API requests, and we will process the text translation using Llama. The results will be stored in a Nitric bucket. | ||||||||||||
|
|
||||||||||||
| Let's start by defining the translation logic using the Llama model: | ||||||||||||
|
|
||||||||||||
| Remove the contents of `services/api.py` and update it with the following code that will load the Llama model and implement the translation functionality. We'll also do some basic calculations for evaluation times for tokens: | ||||||||||||
|
|
||||||||||||
| ```python title:services/api.py | ||||||||||||
| from llama_cpp import Llama | ||||||||||||
| import time | ||||||||||||
|
|
||||||||||||
| # Load the locally stored Llama model | ||||||||||||
| llama_model = Llama(model_path="./models/Llama-3.2-1B-Instruct-Q4_K_M.gguf") | ||||||||||||
|
|
||||||||||||
| # Function to perform translation using the Llama model | ||||||||||||
| def translate_text(text): | ||||||||||||
|
||||||||||||
| prompt = f'Translate "{text}" to Spanish.' | ||||||||||||
|
|
||||||||||||
| start_time = time.time() | ||||||||||||
|
|
||||||||||||
| # Generate a response using the locally stored model | ||||||||||||
| response = llama_model( | ||||||||||||
| prompt=prompt, | ||||||||||||
| max_tokens=150, | ||||||||||||
| temperature=0.7, | ||||||||||||
| top_p=0.9, | ||||||||||||
| stop=["\n"] | ||||||||||||
| ) | ||||||||||||
|
||||||||||||
|
|
||||||||||||
| # Calculate evaluation time | ||||||||||||
| end_time = time.time() | ||||||||||||
| t_eval_ms = (end_time - start_time) * 1000 | ||||||||||||
|
|
||||||||||||
| translated_text = response['choices'][0]['text'].strip() | ||||||||||||
| return translated_text, response, t_eval_ms | ||||||||||||
| ``` | ||||||||||||
|
|
||||||||||||
| ## Building the API and adding storage | ||||||||||||
|
|
||||||||||||
| Now, let's integrate the translation logic into an API and store the results in a bucket. | ||||||||||||
|
|
||||||||||||
| Expand `api.py` with the following code: | ||||||||||||
|
|
||||||||||||
| ```python title:services/api.py | ||||||||||||
| import uuid | ||||||||||||
| from nitric.resources import bucket | ||||||||||||
| from nitric.application import Nitric | ||||||||||||
| from nitric.resources import api | ||||||||||||
| from nitric.context import HttpContext | ||||||||||||
|
|
||||||||||||
| # Define a Nitric bucket resource for storing translations | ||||||||||||
| translations_bucket = bucket("translations").allow("write") | ||||||||||||
|
|
||||||||||||
| # Define an API for the translation service | ||||||||||||
| main = api("main") | ||||||||||||
|
|
||||||||||||
| @main.post("/translate") | ||||||||||||
| async def handle_translation(ctx: HttpContext): | ||||||||||||
| text = ctx.req.json["text"] | ||||||||||||
|
|
||||||||||||
| unique_id = str(uuid.uuid4()) | ||||||||||||
|
|
||||||||||||
| try: | ||||||||||||
| translated_text, output, t_eval_ms = translate_text(text) | ||||||||||||
|
|
||||||||||||
| # Save the translated text to the Nitric bucket | ||||||||||||
| translated_bytes = translated_text.encode() | ||||||||||||
| file_path = f"translations/{unique_id}/translated.txt" | ||||||||||||
| await translations_bucket.file(file_path).write(translated_bytes) | ||||||||||||
|
|
||||||||||||
| ctx.res.body = { | ||||||||||||
| 'output': output, | ||||||||||||
| 't_eval_ms': t_eval_ms, | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| except Exception as e: | ||||||||||||
| ctx.res.body = {"error": str(e)} | ||||||||||||
| ctx.res.status = 500 | ||||||||||||
|
|
||||||||||||
| Nitric.run() | ||||||||||||
| ``` | ||||||||||||
|
|
||||||||||||
| ### Ok, let's run this thing! | ||||||||||||
|
|
||||||||||||
| Now that you have your API route defined, it's time to test it locally. | ||||||||||||
| The starter template for python uses a slim image `python3.11-bookworm-slim` which does have the right dependencies to load our llama model, let's update our dockerfile to use `python3.11-bookworm`. | ||||||||||||
|
|
||||||||||||
| ```python title:python.dockerfile | ||||||||||||
| # Update line 2 | ||||||||||||
| FROM ghcr.io/astral-sh/uv:python3.11-bookworm AS builder | ||||||||||||
|
|
||||||||||||
| # Update line 19: | ||||||||||||
| FROM python:3.11-bookworm | ||||||||||||
| ``` | ||||||||||||
|
|
||||||||||||
| Now we can run our services locally: | ||||||||||||
|
|
||||||||||||
| ``` | ||||||||||||
| nitric run | ||||||||||||
| ``` | ||||||||||||
|
|
||||||||||||
| <Note> | ||||||||||||
| Nitric runs your application in a container that already includes the dependencies to use `llama_cpp`. If you'd rather use `nitric start` you'll need to install dependencies for llama-cpp-python such as [CMake](https://cmake.org/download/) and [LLVM](https://releases.llvm.org/download.html). | ||||||||||||
| </Note> | ||||||||||||
|
|
||||||||||||
| Once it starts, you can easily test your application with the Nitric Dashboard. You can find the URL to the dashboard in the terminal running the Nitric CLI, by default, it is set to - http://localhost:49152. | ||||||||||||
|
|
||||||||||||
|  | ||||||||||||
|
|
||||||||||||
| ## Deploying to AWS | ||||||||||||
|
|
||||||||||||
| <Note> | ||||||||||||
| You are responsible for staying within the limits of the free tier or any costs associated with deployment. | ||||||||||||
| </Note> | ||||||||||||
|
|
||||||||||||
| Once your project is set up, create a new Nitric stack file for deployment to AWS: | ||||||||||||
|
|
||||||||||||
| ```bash | ||||||||||||
| nitric stack new dev aws | ||||||||||||
| ``` | ||||||||||||
|
|
||||||||||||
| Update the stack file `nitric.dev.yaml` with the appropriate AWS region and memory allocation to handle the model. | ||||||||||||
|
|
||||||||||||
| ```yaml title:nitric.dev.yaml | ||||||||||||
| provider: nitric/[email protected] | ||||||||||||
| region: us-east-1 | ||||||||||||
| # Configure your deployed functions/services | ||||||||||||
| config: | ||||||||||||
| # How functions without a type will be deployed | ||||||||||||
| default: | ||||||||||||
| # configure a sample rate for telemetry (between 0 and 1) e.g. 0.5 is 50% | ||||||||||||
| telemetry: 0 | ||||||||||||
| # configure functions to deploy to AWS lambda | ||||||||||||
| lambda: # Available since v0.26.0 | ||||||||||||
| # set 128MB of RAM | ||||||||||||
| # See lambda configuration docs here: | ||||||||||||
| # https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-memory-console | ||||||||||||
|
||||||||||||
| # set 128MB of RAM | |
| # See lambda configuration docs here: | |
| # https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-memory-console | |
| # set 6GB of RAM | |
| # Lambda vCPUs are proportional to memory allocation. And a larger amount of CPUs will improve LLM inference |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # set a timeout of 15 seconds | |
| # See lambda timeout values here: | |
| # https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-timeout-console |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update comment to explain why
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # # set a provisioned concurrency value | |
| # # For info on provisioned concurrency for AWS Lambda see: | |
| # # https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html | |
| provisioned-concurrency: 0 |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be really cool to follow on from this guide with a websocket chatbot.
Uh oh!
There was an error while loading. Please reload this page.