Skip to content
This repository was archived by the owner on May 20, 2025. It is now read-only.
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
275 changes: 275 additions & 0 deletions docs/guides/python/serverless-llama.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,275 @@
---
description: Use Llama model with serverless compute to translate text and store results using Nitric
tags:
- Nitric
- API
- AI & Machine Learning
languages:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase and add start_steps. See go realtime guide for example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebased, but start steps won't work with this repository - it requires them to download llama separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the model be curled?

- python
---

# Using LLama models with serverless infrastructure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A title of "Building AWS LLAMBDAS" just popped into my head, not sure if it's good as the last part is a bit hard to read :P. (I know if applied to other serverless compute as well but an opportunity for wordplay seems hard to pass up).


This guide will walk you through setting up a lightweight translation service using the Llama model, combined with Nitric for API routing and bucket storage.

By leveraging serverless compute, you'll be able to deploy and run a machine learning model with minimal infrastructure overhead, making it a great fit for handling dynamic workloads such as real-time text translation.

## What we'll be doing

We will use the [Llama](https://huggingface.co/) models from Hugging Face for natural language processing, combined with Nitric to manage the API routes and storage.

1. Setting up the environment.
2. Creating the translation service.
3. Deploying the service.
4. Testing the translation functionality.

## Prerequisites

- [uv](https://docs.astral.sh/uv/#getting-started) - for Python dependency management
- The [Nitric CLI](/get-started/installation)
- _(optional)_ An [AWS](https://aws.amazon.com) account

## Project setup

We'll start by creating a new project for our translator service using Nitric's python starter template.

```bash
nitric new translator py-starter
cd translator
```

Next, let's install our base dependencies, then add the extra dependencies we need specifically for loading our language model.

```bash
# Install the base dependencies
uv sync
uv add llama-cpp-python
```

You will also need to [download the Llama model](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/tree/main) file and ensure it is located in the `./models/` directory with the correct model file name.

In this guide we'll be using `Llama-3.2-1B-Instruct-Q4_K_M.gguf`, this model is ideal for serverless because its reduced size and efficient 4-bit quantization make it cost-effective and scalable, running smoothly within the resource limits of serverless compute environments while maintaining solid performance.

Your folder structure should look like this:

```bash
/translator
/models
Llama-3.2-1B-Instruct-Q4_K_M.gguf
/services
api.py
nitric.yaml
pyproject.toml
python.dockerfile
python.dockerfile.ignore
README.md
uv.lock
```

## Creating the translation service

Our project will use Nitric to handle API requests, and we will process the text translation using Llama. The results will be stored in a Nitric bucket.

Let's start by defining the translation logic using the Llama model:

Remove the contents of `services/api.py` and update it with the following code that will load the Llama model and implement the translation functionality. We'll also do some basic calculations for evaluation times for tokens:

```python title:services/api.py
from llama_cpp import Llama
import time

# Load the locally stored Llama model
llama_model = Llama(model_path="./models/Llama-3.2-1B-Instruct-Q4_K_M.gguf")

# Function to perform translation using the Llama model
def translate_text(text):
Copy link
Member

@tjholm tjholm Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think translating text is an interesting use case, but would it also be simpler to just pass through the prompt directly from the users request and allow them to test any prompt? e.g. What is the Capital of France? Especially if the goal is to demonstrate just running these models in serverless compute?

prompt = f'Translate "{text}" to Spanish.'

start_time = time.time()

# Generate a response using the locally stored model
response = llama_model(
prompt=prompt,
max_tokens=150,
temperature=0.7,
top_p=0.9,
stop=["\n"]
)
Copy link
Member

@tjholm tjholm Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's worthwhile, but making these options to show off a bit more configurability could be good.

e.g.

@main.post("/translate")
async def handle_translation(ctx: HttpContext):
    # Could still leave max_tokens hardcoded to make sure prompts don't exceed 30s
    max_tokens = ctx.req.query.get("max_tokens", default_max_tokens)
    preset = ctx.req.query.get("temperature", default_temperature)
  
    text = ctx.req.json["text"]

We also support using raw text in the dashboard api testing. So not all prompts need to be wrapped in JSON


# Calculate evaluation time
end_time = time.time()
t_eval_ms = (end_time - start_time) * 1000

translated_text = response['choices'][0]['text'].strip()
return translated_text, response, t_eval_ms
```

## Building the API and adding storage

Now, let's integrate the translation logic into an API and store the results in a bucket.

Expand `api.py` with the following code:

```python title:services/api.py
import uuid
from nitric.resources import bucket
from nitric.application import Nitric
from nitric.resources import api
from nitric.context import HttpContext

# Define a Nitric bucket resource for storing translations
translations_bucket = bucket("translations").allow("write")

# Define an API for the translation service
main = api("main")

@main.post("/translate")
async def handle_translation(ctx: HttpContext):
text = ctx.req.json["text"]

unique_id = str(uuid.uuid4())

try:
translated_text, output, t_eval_ms = translate_text(text)

# Save the translated text to the Nitric bucket
translated_bytes = translated_text.encode()
file_path = f"translations/{unique_id}/translated.txt"
await translations_bucket.file(file_path).write(translated_bytes)

ctx.res.body = {
'output': output,
't_eval_ms': t_eval_ms,
}

except Exception as e:
ctx.res.body = {"error": str(e)}
ctx.res.status = 500

Nitric.run()
```

### Ok, let's run this thing!

Now that you have your API route defined, it's time to test it locally.
The starter template for python uses a slim image `python3.11-bookworm-slim` which does have the right dependencies to load our llama model, let's update our dockerfile to use `python3.11-bookworm`.

```python title:python.dockerfile
# Update line 2
FROM ghcr.io/astral-sh/uv:python3.11-bookworm AS builder

# Update line 19:
FROM python:3.11-bookworm
```

Now we can run our services locally:

```
nitric run
```

<Note>
Nitric runs your application in a container that already includes the dependencies to use `llama_cpp`. If you'd rather use `nitric start` you'll need to install dependencies for llama-cpp-python such as [CMake](https://cmake.org/download/) and [LLVM](https://releases.llvm.org/download.html).
</Note>

Once it starts, you can easily test your application with the Nitric Dashboard. You can find the URL to the dashboard in the terminal running the Nitric CLI, by default, it is set to - http://localhost:49152.

![api dashboard](/docs/images/guides/serverless-llama/dashboard.png)

## Deploying to AWS

<Note>
You are responsible for staying within the limits of the free tier or any costs associated with deployment.
</Note>

Once your project is set up, create a new Nitric stack file for deployment to AWS:

```bash
nitric stack new dev aws
```

Update the stack file `nitric.dev.yaml` with the appropriate AWS region and memory allocation to handle the model.

```yaml title:nitric.dev.yaml
provider: nitric/[email protected]
region: us-east-1
# Configure your deployed functions/services
config:
# How functions without a type will be deployed
default:
# configure a sample rate for telemetry (between 0 and 1) e.g. 0.5 is 50%
telemetry: 0
# configure functions to deploy to AWS lambda
lambda: # Available since v0.26.0
# set 128MB of RAM
# See lambda configuration docs here:
# https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-memory-console
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# set 128MB of RAM
# See lambda configuration docs here:
# https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-memory-console
# set 6GB of RAM
# Lambda vCPUs are proportional to memory allocation. And a larger amount of CPUs will improve LLM inference

memory: 6144
# set a timeout of 15 seconds
# See lambda timeout values here:
# https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-timeout-console
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# set a timeout of 15 seconds
# See lambda timeout values here:
# https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-timeout-console

timeout: 30
# set the amount of ephemeral-storage: of 512MB
# For info on ephemeral-storage for AWS Lambda see:
# https://docs.aws.amazon.com/lambda/latest/dg/configuration-ephemeral-storage.html
ephemeral-storage: 1024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment to explain why

# # set a provisioned concurrency value
# # For info on provisioned concurrency for AWS Lambda see:
# # https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
provisioned-concurrency: 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# # set a provisioned concurrency value
# # For info on provisioned concurrency for AWS Lambda see:
# # https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
provisioned-concurrency: 0

```

You can then deploy using the following command:

```bash
nitric up
```

To undeploy run the following command:

```bash
nitric down
```

## Testing the translation functionality

To test the translation service, you can use any API testing tool such as Postman or cURL.

### Example request

Send a POST request to the `/translate` endpoint with the following JSON body:

```json
{
"text": "Hello, how are you?"
}
```

### Example response

The response will include the translation details, evaluation time, and tokens per second:

```json
{
"output": {
"choices": [
{
"text": "Hola, ¿cómo estás?"
}
],
"usage": {
"total_tokens": 15
}
},
"t_eval_ms": 200,
"tps": 75.0
}
```

The translated text will also be stored in the `translations` bucket with a unique ID.

## Conclusion

In this guide, we demonstrated how you can use a lightweight machine learning model like Llama with serverless compute, enabling you to efficiently handle real-time translation tasks without the need for constant infrastructure management.

The combination of serverless architecture and on-demand model execution provides scalability, flexibility, and cost-efficiency, ensuring that resources are only consumed when necessary. This setup allows you to run lightweight models in a cloud-native way, ideal for dynamic applications requiring minimal operational overhead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be really cool to follow on from this guide with a websocket chatbot.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading