Skip to content
This repository was archived by the owner on May 20, 2025. It is now read-only.

Conversation

@raksiv
Copy link
Member

@raksiv raksiv commented Oct 21, 2024

In this guide, we demonstrate how you can use a lightweight machine learning model like Llama with serverless compute. This example performs language translation using Llama-3.2-1B-Instruct-Q4_K_M

@vercel
Copy link

vercel bot commented Oct 21, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
nitric-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 23, 2024 2:07pm

- Nitric
- API
- AI & Machine Learning
languages:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase and add start_steps. See go realtime guide for example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebased, but start steps won't work with this repository - it requires them to download llama separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the model be curled?

Demonsrate how a lightweight llama model can be used with serverless compute
Comment on lines 204 to 206
# set 128MB of RAM
# See lambda configuration docs here:
# https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-memory-console
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# set 128MB of RAM
# See lambda configuration docs here:
# https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-memory-console
# set 6GB of RAM
# Lambda vCPUs are proportional to memory allocation. And a larger amount of CPUs will improve LLM inference

Comment on lines 208 to 210
# set a timeout of 15 seconds
# See lambda timeout values here:
# https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-timeout-console
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# set a timeout of 15 seconds
# See lambda timeout values here:
# https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-timeout-console

Comment on lines 216 to 219
# # set a provisioned concurrency value
# # For info on provisioned concurrency for AWS Lambda see:
# # https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
provisioned-concurrency: 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# # set a provisioned concurrency value
# # For info on provisioned concurrency for AWS Lambda see:
# # https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
provisioned-concurrency: 0

Comment on lines 212 to 215
# set the amount of ephemeral-storage: of 512MB
# For info on ephemeral-storage for AWS Lambda see:
# https://docs.aws.amazon.com/lambda/latest/dg/configuration-ephemeral-storage.html
ephemeral-storage: 1024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment to explain why

Comment on lines 91 to 97
response = llama_model(
prompt=prompt,
max_tokens=150,
temperature=0.7,
top_p=0.9,
stop=["\n"]
)
Copy link
Member

@tjholm tjholm Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's worthwhile, but making these options to show off a bit more configurability could be good.

e.g.

@main.post("/translate")
async def handle_translation(ctx: HttpContext):
    # Could still leave max_tokens hardcoded to make sure prompts don't exceed 30s
    max_tokens = ctx.req.query.get("max_tokens", default_max_tokens)
    preset = ctx.req.query.get("temperature", default_temperature)
  
    text = ctx.req.json["text"]

We also support using raw text in the dashboard api testing. So not all prompts need to be wrapped in JSON

Comment on lines 271 to 275
## Conclusion

In this guide, we demonstrated how you can use a lightweight machine learning model like Llama with serverless compute, enabling you to efficiently handle real-time translation tasks without the need for constant infrastructure management.

The combination of serverless architecture and on-demand model execution provides scalability, flexibility, and cost-efficiency, ensuring that resources are only consumed when necessary. This setup allows you to run lightweight models in a cloud-native way, ideal for dynamic applications requiring minimal operational overhead. No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be really cool to follow on from this guide with a websocket chatbot.

llama_model = Llama(model_path="./models/Llama-3.2-1B-Instruct-Q4_K_M.gguf")

# Function to perform translation using the Llama model
def translate_text(text):
Copy link
Member

@tjholm tjholm Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think translating text is an interesting use case, but would it also be simpler to just pass through the prompt directly from the users request and allow them to test any prompt? e.g. What is the Capital of France? Especially if the goal is to demonstrate just running these models in serverless compute?

- python
---

# Using LLama models with serverless infrastructure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A title of "Building AWS LLAMBDAS" just popped into my head, not sure if it's good as the last part is a bit hard to read :P. (I know if applied to other serverless compute as well but an opportunity for wordplay seems hard to pass up).

Comment on lines +228 to +229
# We add more storage to the lambda function, so it can store the model
ephemeral-storage: 1024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? Isn't the model baked into the container already?

@raksiv raksiv closed this Oct 23, 2024
@raksiv
Copy link
Member Author

raksiv commented Oct 23, 2024

Guide was retargeted, reviews are stale and will now cause confusion.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants