Skip to content
This repository was archived by the owner on May 20, 2025. It is now read-only.

Commit d21e3e8

Browse files
committed
Add a guide which uses nitric and llama
Demonsrate how a lightweight llama model can be used with serverless compute
1 parent 4f10571 commit d21e3e8

File tree

2 files changed

+275
-0
lines changed

2 files changed

+275
-0
lines changed
Lines changed: 275 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,275 @@
1+
---
2+
description: Use Llama model with serverless compute to translate text and store results using Nitric
3+
tags:
4+
- Nitric
5+
- API
6+
- AI & Machine Learning
7+
languages:
8+
- python
9+
---
10+
11+
# Using LLama models with serverless infrastructure
12+
13+
This guide will walk you through setting up a lightweight translation service using the Llama model, combined with Nitric for API routing and bucket storage.
14+
15+
By leveraging serverless compute, you'll be able to deploy and run a machine learning model with minimal infrastructure overhead, making it a great fit for handling dynamic workloads such as real-time text translation.
16+
17+
## What we'll be doing
18+
19+
We will use the [Llama](https://huggingface.co/) models from Hugging Face for natural language processing, combined with Nitric to manage the API routes and storage.
20+
21+
1. Setting up the environment.
22+
2. Creating the translation service.
23+
3. Deploying the service.
24+
4. Testing the translation functionality.
25+
26+
## Prerequisites
27+
28+
- [uv](https://docs.astral.sh/uv/#getting-started) - for Python dependency management
29+
- The [Nitric CLI](/get-started/installation)
30+
- _(optional)_ An [AWS](https://aws.amazon.com) account
31+
32+
## Project setup
33+
34+
We'll start by creating a new project for our translator service using Nitric's python starter template.
35+
36+
```bash
37+
nitric new translator py-starter
38+
cd translator
39+
```
40+
41+
Next, let's install our base dependencies, then add the extra dependencies we need specifically loading our language model.
42+
43+
```bash
44+
# Install the base dependencies
45+
uv sync
46+
uv add llama-cpp-python
47+
```
48+
49+
You will also need to [download the Llama model](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/tree/main) file and ensure it is located in the `./models/` directory with the correct model file name.
50+
51+
In this guide we'll be using 'Llama-3.2-1B-Instruct-Q4_K_M.gguf, this model is ideal for serverless because its reduced size and efficient 4-bit quantization make it cost-effective and scalable, running smoothly within the resource limits of serverless compute environments while maintaining solid performance.
52+
53+
Your folder structure should look like this:
54+
55+
```bash
56+
/translator
57+
/models
58+
Llama-3.2-1B-Instruct-Q4_K_M.gguf
59+
/services
60+
api.py
61+
nitric.yaml
62+
pyproject.toml
63+
python.dockerfile
64+
python.dockerfile.ignore
65+
README.md
66+
uv.lock
67+
```
68+
69+
## Creating the translation service
70+
71+
Our project will use Nitric to handle API requests, and we will process the text translation using Llama. The results will be stored in a Nitric bucket.
72+
73+
Let's start by defining the translation logic using the Llama model:
74+
75+
Remove the contents of `services/api.py` and update it with the following code that will load the Llama model and implement the translation functionality. We'll also do some basic calculations for evaluation times for tokens:
76+
77+
```python title:services/api.py
78+
from llama_cpp import Llama
79+
import time
80+
81+
# Load the locally stored Llama model
82+
llama_model = Llama(model_path="./models/Llama-3.2-1B-Instruct-Q4_K_M.gguf")
83+
84+
# Function to perform translation using the Llama model
85+
def translate_text(text):
86+
prompt = f'Translate "{text}" to Spanish.'
87+
88+
start_time = time.time()
89+
90+
# Generate a response using the locally stored model
91+
response = llama_model(
92+
prompt=prompt,
93+
max_tokens=150,
94+
temperature=0.7,
95+
top_p=0.9,
96+
stop=["\n"]
97+
)
98+
99+
# Calculate evaluation time
100+
end_time = time.time()
101+
t_eval_ms = (end_time - start_time) * 1000
102+
103+
translated_text = response['choices'][0]['text'].strip()
104+
return translated_text, response, t_eval_ms
105+
```
106+
107+
## Building the API and adding storage
108+
109+
Now, let's integrate the translation logic into an API and store the results in a bucket.
110+
111+
Expand `api.py` with the following code:
112+
113+
```python title:services/api.py
114+
import uuid
115+
from nitric.resources import bucket
116+
from nitric.application import Nitric
117+
from nitric.resources import api
118+
from nitric.context import HttpContext
119+
120+
# Define a Nitric bucket resource for storing translations
121+
translations_bucket = bucket("translations").allow("write")
122+
123+
# Define an API for the translation service
124+
main = api("main")
125+
126+
@main.post("/translate")
127+
async def handle_translation(ctx: HttpContext):
128+
text = ctx.req.json["text"]
129+
130+
unique_id = str(uuid.uuid4())
131+
132+
try:
133+
translated_text, output, t_eval_ms = translate_text(text)
134+
135+
# Save the translated text to the Nitric bucket
136+
translated_bytes = translated_text.encode()
137+
file_path = f"translations/{unique_id}/translated.txt"
138+
await translations_bucket.file(file_path).write(translated_bytes)
139+
140+
ctx.res.body = {
141+
'output': output,
142+
't_eval_ms': t_eval_ms,
143+
}
144+
145+
except Exception as e:
146+
ctx.res.body = {"error": str(e)}
147+
ctx.res.status = 500
148+
149+
Nitric.run()
150+
```
151+
152+
### Ok, let's run this thing!
153+
154+
Now that you have your API route defined, it's time to test it locally.
155+
The starter template for python uses a slim image `python3.11-bookworm-slim` which does have the right dependencies to load our llama model, let's update our dockerfile to use `python3.11-bookworm`.
156+
157+
```python title:python.dockerfile
158+
# Update line 2
159+
FROM ghcr.io/astral-sh/uv:python3.11-bookworm AS builder
160+
161+
# Update line 19:
162+
FROM python:3.11-bookworm
163+
```
164+
165+
Now we can run our services locally:
166+
167+
```
168+
nitric run
169+
```
170+
171+
<Note>
172+
Nitric runs your application in a container that already includes the dependencies to use `llama_cpp`. If you'd rather use `nitric start` you'll need to install dependencies for llama-cpp-python such as [CMake](https://cmake.org/download/) and [LLVM](https://releases.llvm.org/download.html).
173+
</Note>
174+
175+
Once it starts, you can easily test your application with the Nitric Dashboard. You can find the URL to the dashboard in the terminal running the Nitric CLI, by default, it is set to - http://localhost:49152.
176+
177+
![api dashboard](/docs/images/guides/serverless-llama/dashboard.png)
178+
179+
## Deploying to AWS
180+
181+
<Note>
182+
You are responsible for staying within the limits of the free tier or any costs associated with deployment.
183+
</Note>
184+
185+
Once your project is set up, create a new Nitric stack file for deployment to AWS:
186+
187+
```bash
188+
nitric stack new dev aws
189+
```
190+
191+
Update the stack file `nitric.dev.yaml` with the appropriate AWS region and memory allocation to handle the model.
192+
193+
```yaml title:nitric.dev.yaml
194+
provider: nitric/[email protected]
195+
region: us-east-1
196+
# Configure your deployed functions/services
197+
config:
198+
# How functions without a type will be deployed
199+
default:
200+
# configure a sample rate for telemetry (between 0 and 1) e.g. 0.5 is 50%
201+
telemetry: 0
202+
# configure functions to deploy to AWS lambda
203+
lambda: # Available since v0.26.0
204+
# set 128MB of RAM
205+
# See lambda configuration docs here:
206+
# https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-memory-console
207+
memory: 6144
208+
# set a timeout of 15 seconds
209+
# See lambda timeout values here:
210+
# https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-timeout-console
211+
timeout: 30
212+
# set the amount of ephemeral-storage: of 512MB
213+
# For info on ephemeral-storage for AWS Lambda see:
214+
# https://docs.aws.amazon.com/lambda/latest/dg/configuration-ephemeral-storage.html
215+
ephemeral-storage: 1024
216+
# # set a provisioned concurrency value
217+
# # For info on provisioned concurrency for AWS Lambda see:
218+
# # https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
219+
provisioned-concurrency: 0
220+
```
221+
222+
You can then deploy using the following command:
223+
224+
```bash
225+
nitric up
226+
```
227+
228+
To undeploy run the following command:
229+
230+
```bash
231+
nitric down
232+
```
233+
234+
## Testing the translation functionality
235+
236+
To test the translation service, you can use any API testing tool such as Postman or cURL.
237+
238+
### Example request
239+
240+
Send a POST request to the `/translate` endpoint with the following JSON body:
241+
242+
```json
243+
{
244+
"text": "Hello, how are you?"
245+
}
246+
```
247+
248+
### Example response
249+
250+
The response will include the translation details, evaluation time, and tokens per second:
251+
252+
```json
253+
{
254+
"output": {
255+
"choices": [
256+
{
257+
"text": "Hola, ¿cómo estás?"
258+
}
259+
],
260+
"usage": {
261+
"total_tokens": 15
262+
}
263+
},
264+
"t_eval_ms": 200,
265+
"tps": 75.0
266+
}
267+
```
268+
269+
The translated text will also be stored in the `translations` bucket with a unique ID.
270+
271+
## Conclusion
272+
273+
In this guide, we demonstrated how you can use a lightweight machine learning model like Llama with serverless compute, enabling you to efficiently handle real-time translation tasks without the need for constant infrastructure management.
274+
275+
The combination of serverless architecture and on-demand model execution provides scalability, flexibility, and cost-efficiency, ensuring that resources are only consumed when necessary. This setup allows you to run lightweight models in a cloud-native way, ideal for dynamic applications requiring minimal operational overhead.
191 KB
Loading

0 commit comments

Comments
 (0)