Skip to content

Commit f01bbe2

Browse files
Updates to benchmarks code (meta-llama#577)
2 parents 8374ea8 + bbc55b4 commit f01bbe2

File tree

12 files changed

+41
-28
lines changed

12 files changed

+41
-28
lines changed
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
In this folder, we show various examples in a notebook for running Llama model inference on Azure's serverless API offerings. We will cover:
2+
* HTTP requests API usage for Llama 3 instruct models in CLI
3+
* HTTP requests API usage for Llama 3 instruct models in Python
4+
* Plug the APIs into LangChain
5+
* Wire the model with Gradio to build a simple chatbot with memory

recipes/3p_integration/azure/azure_api_example.ipynb

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@
9696
"Streaming allows the generated tokens to be sent as data-only server-sent events whenever they become available. \n",
9797
"This is extremely important for interactive applications such as chatbots, so the user is always engaged. \n",
9898
"\n",
99-
"To use streaming, simply set `\"stream\":\"True\"` as part of the request payload. \n",
99+
"To use streaming, simply set `\"stream\":True` as part of the request payload. \n",
100100
"In the streaming mode, the REST API response will be different from non-streaming mode.\n",
101101
"\n",
102102
"Here is an example: "
@@ -108,7 +108,7 @@
108108
"metadata": {},
109109
"outputs": [],
110110
"source": [
111-
"!curl -X POST -L https://your-endpoint.inference.ai.azure.com/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{\"messages\":[{\"content\":\"You are a helpful assistant.\",\"role\":\"system\"},{\"content\":\"Who wrote the book Innovators dilemma?\",\"role\":\"user\"}], \"max_tokens\": 500, \"stream\": \"True\"}'"
111+
"!curl -X POST -L https://your-endpoint.inference.ai.azure.com/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{\"messages\":[{\"content\":\"You are a helpful assistant.\",\"role\":\"system\"},{\"content\":\"Who wrote the book Innovators dilemma?\",\"role\":\"user\"}], \"max_tokens\": 500, \"stream\": True}'"
112112
]
113113
},
114114
{
@@ -170,7 +170,7 @@
170170
" {\"role\":\"user\", \"content\":\"Who wrote the book Innovators dilemma?\"}], \n",
171171
" \"max_tokens\": 500,\n",
172172
" \"temperature\": 0.9,\n",
173-
" \"stream\": \"True\",\n",
173+
" \"stream\": True,\n",
174174
"}\n",
175175
"\n",
176176
"body = str.encode(json.dumps(data))\n",
@@ -230,7 +230,7 @@
230230
" {\"role\":\"user\", \"content\":\"Who wrote the book Innovators dilemma?\"}],\n",
231231
" \"max_tokens\": 500,\n",
232232
" \"temperature\": 0.9,\n",
233-
" \"stream\": \"True\"\n",
233+
" \"stream\": True\n",
234234
"}\n",
235235
"\n",
236236
"\n",

tools/benchmarks/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Benchmarks
2+
3+
* inference - a folder contains benchmark scripts that apply a throughput analysis for Llama models inference on various backends including on-prem, cloud and on-device.
4+
* llm_eval_harness - a folder contains a tool to evaluate fine-tuned Llama models including quantized models focusing on quality.

tools/benchmarks/inference/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Inference Throughput Benchmarks
2-
In this folder we provide a series of benchmark scripts that apply a throughput analysis for Llama 2 models inference on various backends:
2+
In this folder we provide a series of benchmark scripts that apply a throughput analysis for Llama models inference on various backends:
33
* On-prem - Popular serving frameworks and containers (i.e. vLLM)
4-
* [**WIP**]Cloud API - Popular API services (i.e. Azure Model-as-a-Service)
5-
* [**WIP**]On-device - Popular on-device inference solutions on Android and iOS (i.e. mlc-llm, QNN)
4+
* Cloud API - Popular API services (i.e. Azure Model-as-a-Service or Serverless API)
5+
* [**WIP**]On-device - Popular on-device inference solutions on mobile and desktop (i.e. ExecuTorch, MLC-LLM, Ollama)
66
* [**WIP**]Optimization - Popular optimization solutions for faster inference and quantization (i.e. AutoAWQ)
77

88
# Why
@@ -16,7 +16,7 @@ Here are the parameters (if applicable) that you can configure for running the b
1616
* **PROMPT** - Prompt sent in for inference (configure the length of prompt, choose from 5, 25, 50, 100, 500, 1k and 2k)
1717
* **MAX_NEW_TOKENS** - Max number of tokens generated
1818
* **CONCURRENT_LEVELS** - Max number of concurrent requests
19-
* **MODEL_PATH** - Model source
19+
* **MODEL_PATH** - Model source from Huggingface
2020
* **MODEL_HEADERS** - Request headers
2121
* **SAFE_CHECK** - Content safety check (either Azure service or simulated latency)
2222
* **THRESHOLD_TPS** - Threshold TPS (threshold for tokens per second below which we deem the query to be slow)

tools/benchmarks/inference/cloud/README.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,18 @@ To get started, there are certain steps we need to take to deploy the models:
1313
* Take a quick look on what is the [Azure AI Studio](https://learn.microsoft.com/en-us/azure/ai-studio/what-is-ai-studio?tabs=home) and navigate to the website from the link in the article
1414
* Follow the demos in the article to create a project and [resource](https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/manage-resource-groups-portal) group, or you can also follow the guide [here](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-llama?tabs=azure-studio)
1515
* Select Llama models from Model catalog
16-
* Deploy with "Pay-as-you-go"
16+
* Click the "Deploy" button
17+
* Select Serverless API with Azure AI Content Safety. Note that currently this API service is offered for Llama 2 pretrained model, chat model and Llama 3 instruct model
18+
* Select the project you created in previous step
19+
* Choose a deployment name then Go to deployment
1720

1821
Once deployed successfully, you should be assigned for an API endpoint and a security key for inference.
1922
For more information, you should consult Azure's official documentation [here](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-llama?tabs=azure-studio) for model deployment and inference.
2023

2124
Now, replace the endpoint url and API key in ```azure/parameters.json```. For parameter `MODEL_ENDPOINTS`, with chat models the suffix should be `v1/chat/completions` and with pretrained models the suffix should be `v1/completions`.
22-
Note that the API endpoint might implemented a rate limit for token generation in certain amount of time. If you encountered the error, you can try reduce `MAX_NEW_TOKEN` or start with smaller `CONCURRENT_LEVELs`.
25+
Note that the API endpoint might implemented a rate limit for token generation in certain amount of time. If you encountered the error, you can try reduce `MAX_NEW_TOKEN` or start with smaller `CONCURRENT_LEVELS`.
26+
27+
For `MODEL_PATH`, copy the model path from Huggingface under meta-llama organization. For Llama 2, make sure you copy the path of the model with hf format. This model path is used to retrieve corresponding tokenizer for your model of choice. Llama 3 used a different tokenizer compare to Llama 2.
2328

2429
Once everything configured, to run chat model benchmark:
2530
```python chat_azure_api_benchmark.py```

tools/benchmarks/inference/cloud/azure/chat_azure_api_benchmark.py

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
from concurrent.futures import ThreadPoolExecutor, as_completed
1111
from typing import Dict, Tuple, List
1212

13+
# Add your own prompt in input.jsonl for testing.
1314
with open('input.jsonl') as input:
1415
prompt_data = json.load(input)
1516

@@ -23,23 +24,20 @@
2324
CONCURRENT_LEVELS = params["CONCURRENT_LEVELS"]
2425
# Threshold for tokens per second below which we deem the query to be slow
2526
THRESHOLD_TPS = params["THRESHOLD_TPS"]
26-
# Default Llama 2 tokenizer, replace with your own tokenizer
27-
TOKENIZER_PATH = params["TOKENIZER_PATH"]
27+
MODEL_PATH = params["MODEL_PATH"]
2828
TEMPERATURE = params["TEMPERATURE"]
2929
TOP_P = params["TOP_P"]
3030
# Model endpoint provided with API provider
3131
MODEL_ENDPOINTS = params["MODEL_ENDPOINTS"]
3232
API_KEY = params["API_KEY"]
3333
SYS_PROMPT = params["SYS_PROMPT"]
3434

35-
36-
# This tokenizer is downloaded from Azure model catalog for each specific models. The main purpose is to decode the reponses for token calculation
37-
tokenizer = transformers.AutoTokenizer.from_pretrained(TOKENIZER_PATH)
35+
# This tokenizer is downloaded from huggingface based on MODEL_PATH. Llama 3 use tiktoken tokenizer which is different from Llama 2.
36+
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)
3837

3938
num_token_input_prompt = len(tokenizer.encode(PROMPT))
4039
print(f"Number of token for input prompt: {num_token_input_prompt}")
4140

42-
4341
def generate_text() -> Tuple[int, int]:
4442

4543
#Configure payload data sending to API endpoint
@@ -49,7 +47,7 @@ def generate_text() -> Tuple[int, int]:
4947
"max_tokens": MAX_NEW_TOKEN,
5048
"temperature": TEMPERATURE,
5149
"top_p" : TOP_P,
52-
"stream": "False"
50+
"stream": False
5351
}
5452
body = str.encode(json.dumps(payload))
5553
url = MODEL_ENDPOINTS

tools/benchmarks/inference/cloud/azure/parameters.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,11 @@
22
"MAX_NEW_TOKEN" : 256,
33
"CONCURRENT_LEVELS" : [1, 2, 4, 8, 16, 32, 64],
44
"THRESHOLD_TPS" : 7,
5-
"TOKENIZER_PATH" : "../../tokenizer",
6-
"RANDOM_PROMPT_LENGTH" : 1000,
5+
"MODEL_PATH" : "meta-llama/your-model-path",
6+
"RANDOM_PROMPT_LENGTH" : 25,
77
"TEMPERATURE" : 0.6,
88
"TOP_P" : 0.9,
9-
"MODEL_ENDPOINTS" : "https://your-endpoint.inference.ai.azure.com/v1/completions",
9+
"MODEL_ENDPOINTS" : "https://your-endpoint.inference.ai.azure.com/v1/chat/completions",
1010
"API_KEY" : "your-auth-key",
1111
"SYS_PROMPT" : "You are a helpful assistant."
1212
}

tools/benchmarks/inference/cloud/azure/pretrained_azure_api_benchmark.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
from concurrent.futures import ThreadPoolExecutor, as_completed
1212
from typing import Dict, Tuple, List
1313

14-
# Predefined inputs
14+
# Predefined inputs - optional
1515
with open('input.jsonl') as input:
1616
prompt_data = json.load(input)
1717

@@ -23,7 +23,7 @@
2323
# Threshold for tokens per second below which we deem the query to be slow
2424
THRESHOLD_TPS = params["THRESHOLD_TPS"]
2525
# Default Llama 2 tokenizer, replace with your own tokenizer
26-
TOKENIZER_PATH = params["TOKENIZER_PATH"]
26+
MODEL_PATH = params["MODEL_PATH"]
2727
RANDOM_PROMPT_LENGTH = params["RANDOM_PROMPT_LENGTH"]
2828
TEMPERATURE = params["TEMPERATURE"]
2929
TOP_P = params["TOP_P"]
@@ -32,8 +32,8 @@
3232
API_KEY = params["API_KEY"]
3333

3434

35-
# This tokenizer is downloaded from Azure model catalog for each specific models. The main purpose is to decode the reponses for token calculation
36-
tokenizer = transformers.AutoTokenizer.from_pretrained(TOKENIZER_PATH)
35+
# This tokenizer is downloaded from huggingface based on MODEL_PATH. Llama 3 use tiktoken tokenizer which is different from Llama 2.
36+
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)
3737

3838
# Select vocabulary that is longer than 2 tokens (closer to real words) and close to the English (not foolproof)
3939
vocab = [token for token in tokenizer.get_vocab().keys() if len(token) > 2 and all(ord(c) < 128 for c in token)]

tools/benchmarks/inference/on_prem/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,5 @@ To run pretrained model benchmark, follow the command below.
3737
```
3838
python pretrained_vllm_benchmark.py
3939
```
40+
41+
Refer to more vLLM benchmark details on their official Github repo [here](https://github.com/vllm-project/vllm/tree/main/benchmarks).

tools/benchmarks/inference/on_prem/vllm/chat_vllm_benchmark.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@
44
import csv
55
import json
66
import time
7-
import random
87
import threading
98
import numpy as np
109
import requests
@@ -18,7 +17,7 @@
1817
from azure.ai.contentsafety.models import AnalyzeTextOptions
1918

2019
from concurrent.futures import ThreadPoolExecutor, as_completed
21-
from typing import Dict, Tuple, List
20+
from typing import Tuple, List
2221

2322

2423

0 commit comments

Comments
 (0)