Skip to content
This repository was archived by the owner on May 20, 2025. It is now read-only.

Commit ae30868

Browse files
raksivHomelessDinosaurjyecuschdavemooreuws
authored
Guide/serverless llama (#668)
Co-authored-by: Ryan Cartwright <[email protected]> Co-authored-by: Jye Cusch <[email protected]> Co-authored-by: David Moore <[email protected]> Co-authored-by: David Moore <[email protected]>
1 parent e594635 commit ae30868

File tree

9 files changed

+340
-22
lines changed

9 files changed

+340
-22
lines changed

contentlayer.config.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,10 @@ const baseFields: FieldDefs = {
3535
type: 'boolean',
3636
description: 'Disable the github edit button',
3737
},
38+
canonical_url: {
39+
type: 'string',
40+
description: 'The canonical url of the doc, if different from the url',
41+
},
3842
}
3943

4044
const computedFields: ComputedFields = {

cypress/e2e/seo.cy.ts

Lines changed: 18 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,18 @@
1-
import * as pages from '../fixtures/pages.json'
2-
3-
// redirects can go here
4-
const redirects = {}
5-
6-
describe('canonical urls', () => {
7-
pages.forEach((page) => {
8-
it(`Should test page ${page} for correct canonical url`, () => {
9-
cy.visit(page)
10-
11-
cy.get('link[rel="canonical"]')
12-
.invoke('attr', 'href')
13-
.should('equal', `http://localhost:3000${redirects[page] || page}`)
14-
15-
cy.get('meta[property="og:url"]')
16-
.invoke('attr', 'content')
17-
.should('equal', `http://localhost:3000${redirects[page] || page}`)
18-
})
19-
})
20-
})
1+
import * as pages from '../fixtures/pages.json'
2+
3+
// redirects can go here
4+
const redirects = {}
5+
6+
describe('canonical urls', () => {
7+
pages.forEach((page) => {
8+
it(`Should test page ${page} for correct canonical url`, () => {
9+
cy.visit(page)
10+
11+
cy.get('link[rel="canonical"]').should('exist')
12+
13+
cy.get('meta[property="og:url"]')
14+
.invoke('attr', 'content')
15+
.should('equal', `http://localhost:3000${redirects[page] || page}`)
16+
})
17+
})
18+
})

dictionary.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -219,6 +219,9 @@ ctx
219219
reproducibility
220220
misconfigurations
221221
DSL
222+
LM
223+
1B
224+
quantized
222225
UI
223226
init
224227
LLMs?
Lines changed: 310 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,310 @@
1+
---
2+
description: Llama 3.2 models on serverless infrastructure like AWS Lambda using Nitric
3+
tags:
4+
- API
5+
- AI & Machine Learning
6+
languages:
7+
- python
8+
image: /docs/images/guides/serverless-llama/banner.png
9+
image_alt: 'Serverless llama guide banner'
10+
featured:
11+
image: /docs/images/guides/serverless-llama/featured.png
12+
image_alt: 'Serverless llama guide featured image'
13+
published_at: 2024-12-16
14+
canonical_url: https://thenewstack.io/running-llama-3-2-on-aws-lambda
15+
---
16+
17+
# Llama 3.2 on serverless infrastructure like AWS Lambda
18+
19+
This guide will demonstrate using Llama 3.2 1B on serverless infrastructure. Llama 3.2 1B is a lightweight model, which makes it interesting for serverless applications since it can be run relatively quickly without requiring GPU acceleration. We'll use models from [Hugging Face](https://huggingface.co/) and Nitric to manage the surrounding infrastructure, such as API routes and deployments.
20+
21+
## Prerequisites
22+
23+
- [uv](https://docs.astral.sh/uv/#getting-started) - for Python dependency management
24+
- The [Nitric CLI](/get-started/installation)
25+
- _(optional)_ An [AWS](https://aws.amazon.com) account
26+
27+
## Project setup
28+
29+
Let's start by creating a new project using Nitric's python starter template.
30+
31+
```bash
32+
nitric new llama py-starter
33+
cd llama
34+
```
35+
36+
Next, let's install the base dependencies, then add the extra dependencies we need specifically for loading the language model.
37+
38+
```bash
39+
# Install the base dependencies
40+
uv sync
41+
# Add the llama-cpp-python dependency
42+
uv add llama-cpp-python
43+
```
44+
45+
## Choose a Llama model
46+
47+
Llama 3.2 is available in different sizes and configurations, each with its own trade-offs in terms of performance, accuracy, and resource requirements. For serverless applications without GPU acceleration (such as AWS Lambda), it's important to choose a model that is lightweight and efficient to ensure it runs within the constraints of that environment.
48+
49+
We'll use a quantized version of the lightweight Llama 1B model, specifically [Llama-3.2-1B-Instruct-Q4_K_M.gguf](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/blob/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf).
50+
51+
<Note>
52+
If you're not familiar with
53+
[Quantization](https://huggingface.co/docs/optimum/en/concept_guides/quantization)
54+
it's a technique that reduces a model's size and resource requirements, which
55+
in our case makes it suitable for serverless applications, but may impact the
56+
accuracy of the model.
57+
</Note>
58+
59+
The [LM Studio](https://lmstudio.ai) team provides several quantized versions of Llama 3.2 1B on [Hugging Face](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF). Consider trying different versions to find one that best fits your needs (e.g. `Q5_K_M` which is slightly larger but higher quality).
60+
61+
Let's download the chosen model and save it in a `models` directory in your project.
62+
63+
[Download link for Llama-3.2-1B-Instruct-Q4_K_M.gguf](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf)
64+
65+
```bash
66+
mkdir models
67+
cd models
68+
# This model is 0.81GB, it may take a little while to download
69+
curl -OL https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf
70+
cd ..
71+
```
72+
73+
Your folder structure should look like this:
74+
75+
```bash
76+
/llama
77+
/models
78+
Llama-3.2-1B-Instruct-Q4_K_M.gguf
79+
/services
80+
api.py
81+
nitric.yaml
82+
pyproject.toml
83+
python.dockerfile
84+
python.dockerfile.ignore
85+
README.md
86+
uv.lock
87+
```
88+
89+
## Create a service to run the model
90+
91+
Next, we'll use Nitric to create an HTTP API that allows you to send prompts to the Llama model and receive the output in a response. The API will return the raw output from the model, but you can adjust this as you see fit.
92+
93+
Replace the contents of `services/api.py` with the following code, which loads the Llama model and implements the prompt functionality. Take a little time to understand the code. It defines an API with a single endpoint `/prompt` that accepts a POST request with a prompt in the body. The `process_prompt` function sends the prompt to the Llama model and returns the response.
94+
95+
```python title:services/api.py
96+
from nitric.resources import api
97+
from nitric.application import Nitric
98+
from nitric.context import HttpContext
99+
from llama_cpp import Llama
100+
101+
# Load the locally stored Llama model
102+
llm = Llama(model_path="./models/Llama-3.2-1B-Instruct-Q4_K_M.gguf")
103+
104+
# Function to execute a prompt using the Llama model
105+
def process_prompt(user_prompt):
106+
system_prompt = "You are a helpful assistant."
107+
108+
# See https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/ for details about prompt format
109+
prompt = (
110+
# System Prompt
111+
f'<|start_header_id|>system<|end_header_id|>{system_prompt}<|eot_id|>'
112+
# User Prompt
113+
f'<|start_header_id|>user<|end_header_id|>{user_prompt}<|eot_id|>'
114+
# Start assistants turn (we leave this open ended as the assistant hasn't started its turn)
115+
f'<|start_header_id|>assistant<|end_header_id|>'
116+
)
117+
118+
response = llm(
119+
prompt=prompt,
120+
# Unlimited, consider setting a token limit
121+
max_tokens=-1,
122+
temperature=0.7,
123+
)
124+
125+
return response
126+
127+
# Define an API for the prompt service
128+
main = api("main")
129+
130+
@main.post("/prompt")
131+
async def handle_prompt(ctx: HttpContext):
132+
# assume the input is text/plain
133+
prompt = ctx.req.data
134+
135+
try:
136+
ctx.res.body = process_prompt(prompt)
137+
except Exception as e:
138+
print(f"Error processing prompt: {e}")
139+
ctx.res.body = {"error": str(e)}
140+
ctx.res.status = 500
141+
142+
Nitric.run()
143+
```
144+
145+
### Ok, let's run this thing!
146+
147+
Now that you have an API defined, we can test it locally. The python starter template uses `python3.11-bookworm-slim` as its basic container image, which doesn't have the right dependencies to load the llama model, let's update the dockerfile to use `python3.11-bookworm` (the non-slim version) instead.
148+
149+
```dockerfile title:python.dockerfile
150+
# The python version must match the version in .python-version
151+
# !diff -
152+
FROM ghcr.io/astral-sh/uv:python3.11-bookworm-slim AS builder
153+
# !diff +
154+
FROM ghcr.io/astral-sh/uv:python3.11-bookworm AS builder
155+
# !collapse(1:16) collapsed
156+
157+
ARG HANDLER
158+
ENV HANDLER=${HANDLER}
159+
160+
ENV UV_COMPILE_BYTECODE=1 UV_LINK_MODE=copy PYTHONPATH=.
161+
WORKDIR /app
162+
RUN --mount=type=cache,target=/root/.cache/uv \
163+
--mount=type=bind,source=uv.lock,target=uv.lock \
164+
--mount=type=bind,source=pyproject.toml,target=pyproject.toml \
165+
uv sync --frozen --no-install-project --no-dev --no-python-downloads
166+
COPY . /app
167+
RUN --mount=type=cache,target=/root/.cache/uv \
168+
uv sync --frozen --no-dev --no-python-downloads
169+
170+
171+
# Then, use a final image without uv
172+
# !diff -
173+
FROM python:3.11-slim-bookworm
174+
# !diff +
175+
FROM python:3.11-bookworm
176+
# !collapse(1:13) collapsed
177+
178+
ARG HANDLER
179+
ENV HANDLER=${HANDLER} PYTHONPATH=.
180+
181+
# Copy the application from the builder
182+
COPY --from=builder --chown=app:app /app /app
183+
WORKDIR /app
184+
185+
# Place executables in the environment at the front of the path
186+
ENV PATH="/app/.venv/bin:$PATH"
187+
188+
# Run the service using the path to the handler
189+
ENTRYPOINT python -u $HANDLER
190+
```
191+
192+
Now we can run our services locally:
193+
194+
```
195+
nitric run
196+
```
197+
198+
<Note>
199+
`nitric run` will start your application in a container that includes the
200+
dependencies to use `llama_cpp`. If you'd rather use `nitric start` you'll
201+
need to install dependencies for llama-cpp-python such as
202+
[CMake](https://cmake.org/download/) and
203+
[LLVM](https://releases.llvm.org/download.html).
204+
</Note>
205+
206+
Once it starts, you can test it with the Nitric Dashboard.
207+
208+
You can find the URL to the dashboard in the terminal running the Nitric CLI, by default it's http://localhost:49152. Add a prompt to the body of the request and send it to the `/prompt` endpoint.
209+
210+
![api dashboard](/docs/images/guides/serverless-llama/dashboard.png)
211+
212+
## Deploying to AWS
213+
214+
When you're ready to deploy the project, we can create a new Nitric [stack file](/get-started/foundations/projects?lang=python#stack-files) which will target AWS:
215+
216+
```bash
217+
nitric stack new dev aws
218+
```
219+
220+
Update the stack file `nitric.dev.yaml` with the appropriate AWS region and memory allocation to handle the model.
221+
222+
```yaml title:nitric.dev.yaml
223+
provider: nitric/[email protected]
224+
region: us-east-1
225+
config:
226+
# How services will be deployed by default, if you have other services not running models
227+
# you can add them here too so they don't use the same configuration
228+
default:
229+
lambda:
230+
# Set the memory to 6GB to handle the model, this automatically sets additional CPU allocation
231+
memory: 6144
232+
# Set a timeout of 30 seconds (this is the most API Gateway will wait for a response)
233+
timeout: 30
234+
# We add more storage to the lambda function, so it can store the model
235+
ephemeral-storage: 1024
236+
```
237+
238+
<Note>
239+
Nitric defaults aim to keep you within your free-tier limits. In this example,
240+
we recommend increasing memory and ephemeral values to allow the llama model
241+
to load correctly, therefore running this sample project will likely incur
242+
more costs than a Nitric guide using the defaults. You are responsible for
243+
staying within the limits of the free tier or any costs associated with
244+
deployment.
245+
</Note>
246+
247+
Since we'll use Nitric's default Pulumi AWS Provider make sure you're setup to deploy using that provider. You can find more information on how to set up the AWS provider in the [Nitric AWS Provider documentation](/providers/pulumi/aws).
248+
249+
<Note>
250+
If you'd like to deploy with Terraform or to another cloud provider, that's
251+
also possible. You can find more information about how Nitric can deploy to
252+
other platforms in the [Nitric Providers documentation](/providers).
253+
</Note>
254+
255+
You can then deploy using the following command:
256+
257+
```bash
258+
nitric up
259+
```
260+
261+
<Note>
262+
Take note of the API endpoint URL that is output after the deployment is
263+
complete.
264+
</Note>
265+
266+
<Note>
267+
If you're done with the project later, tear it down with `nitric down`
268+
</Note>
269+
270+
## Testing on AWS
271+
272+
To test the service, you can use any API testing tool you like, such as cURL, Postman, etc. Here's an example using cURL:
273+
274+
```bash
275+
curl -X POST {your endpoint URL here}/prompt -d "Hello, how are you?"
276+
```
277+
278+
### Example response
279+
280+
The response will include the results, plus other metadata. The output can be found in the `choices` array.
281+
282+
```json
283+
{
284+
"id": "cmpl-61064b38-45f9-496d-86d6-fdae4bc3db97",
285+
"object": "text_completion",
286+
"created": 1729655327,
287+
"model": "./models/Llama-3.2-1B-Instruct-Q4_K_M.gguf",
288+
"choices": [
289+
{
290+
"text": "\"I'm doing well, thank you for asking. I'm here and ready to assist you, so that's a good start! How can I help you today?\"",
291+
"index": 0,
292+
"logprobs": null,
293+
"finish_reason": "stop"
294+
}
295+
],
296+
"usage": {
297+
"prompt_tokens": 26,
298+
"completion_tokens": 33,
299+
"total_tokens": 59
300+
}
301+
}
302+
```
303+
304+
## Summary
305+
306+
At this point, we demonstrated how you can use a lightweight model like Llama 3.2 1B with serverless compute, enabling the application to quickly respond to prompts without the need for GPU acceleration, on relatively low-cost infrastructure.
307+
308+
As you've seen in the code example, we've setup a fairly basic prompt structure, but you can expand on this to include more complex prompts. Including system prompts that help restrict/guide the model's responses, or even more complex interactions with the model. Also, in this example we expose the model directly as an API, but this limits the response time to 30 seconds on AWS with API Gateway.
309+
310+
In future guides we'll show how you can go beyond simple one-time responses to more complex interactions, such as maintaining context between requests. We can also include Websockets and streamed responses to provide a better user experience for larger responses.
1.75 MB
Loading
349 KB
Loading
343 KB
Loading

src/app/[[...slug]]/page.tsx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ export async function generateMetadata({
7070
card: 'summary_large_image',
7171
},
7272
alternates: {
73-
canonical: url,
73+
canonical: doc.canonical_url ? doc.canonical_url : url,
7474
},
7575
}
7676
}

src/components/code/annotations/collapse.tsx

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,10 @@ export const collapseTrigger: AnnotationHandler = {
4444
name: 'CollapseTrigger',
4545
onlyIfAnnotated: true,
4646
AnnotatedLine: ({ annotation, ...props }) => (
47-
<CollapsibleTrigger className="group contents">
47+
<CollapsibleTrigger
48+
className="group contents"
49+
aria-label="Toggle show code"
50+
>
4851
<InnerLine merge={props} data={{ icon }} />
4952
</CollapsibleTrigger>
5053
),

0 commit comments

Comments
 (0)