Skip to content

Commit 20010e7

Browse files
committed
2 parents 701dc05 + 9b966b1 commit 20010e7

File tree

2 files changed

+267
-257
lines changed

2 files changed

+267
-257
lines changed

recipes/benchmarks/fmbench/README.md

Lines changed: 8 additions & 257 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ The [`FMBench`](https://github.com/aws-samples/foundation-model-benchmarking-too
44

55
## The need for benchmarking
66

7-
Customers often wonder what is the best AWS service to run Llama models for _my specific use-case_ and _my specific price performance requirements_. While model evaluation metrics are available on several leaderboards ([`HELM`](https://crfm.stanford.edu/helm/lite/latest/#/leaderboard), [`LMSys`](https://chat.lmsys.org/?leaderboard)), but the price performance comparison can be notoriously hard to find and even more harder to trust. In such a scenario, we think it is best to be able to run performance benchmarking yourself on either on your own dataset or on a similar (task wise, prompt size wise) open-source dataset ([`LongBench`](https://huggingface.co/datasets/THUDM/LongBench)), [`QMSum`](https://paperswithcode.com/dataset/qmsum). This is the problem that [`FMBench`](https://github.com/aws-samples/foundation-model-benchmarking-tool/tree/main) solves.
7+
Customers often wonder what is the best AWS service to run Llama models for _my specific use-case_ and _my specific price performance requirements_. While model evaluation metrics are available on several leaderboards ([`HELM`](https://crfm.stanford.edu/helm/lite/latest/#/leaderboard), [`LMSys`](https://chat.lmsys.org/?leaderboard)), but the price performance comparison can be notoriously hard to find and even more harder to trust. In such a scenario, we think it is best to be able to run performance benchmarking yourself on either on your own dataset or on a similar (task wise, prompt size wise) open-source datasets such as ([`LongBench`](https://huggingface.co/datasets/THUDM/LongBench), [`QMSum`](https://paperswithcode.com/dataset/qmsum)). This is the problem that [`FMBench`](https://github.com/aws-samples/foundation-model-benchmarking-tool/tree/main) solves.
88

99
## [`FMBench`](https://github.com/aws-samples/foundation-model-benchmarking-tool/tree/main): an open-source Python package for FM benchmarking on AWS
1010

@@ -42,7 +42,7 @@ The report also includes latency Vs prompt size charts for different concurrency
4242

4343
### How to get started with `FMBench`
4444

45-
The following steps provide a Quick start guide for `FMBench`. For a more detailed DIY version, please see the [`FMBench Readme`](https://github.com/aws-samples/foundation-model-benchmarking-tool?tab=readme-ov-file#the-diy-version-with-gory-details).
45+
The following steps provide a [Quick start guide for `FMBench`](https://github.com/aws-samples/foundation-model-benchmarking-tool#quickstart). For a more detailed DIY version, please see the [`FMBench Readme`](https://github.com/aws-samples/foundation-model-benchmarking-tool?tab=readme-ov-file#the-diy-version-with-gory-details).
4646

4747
1. Launch the AWS CloudFormation template included in this repository using one of the buttons from the table below. The CloudFormation template creates the following resources within your AWS account: Amazon S3 buckets, Amazon IAM role and an Amazon SageMaker Notebook with this repository cloned. A read S3 bucket is created which contains all the files (configuration files, datasets) required to run `FMBench` and a write S3 bucket is created which will hold the metrics and reports generated by `FMBench`. The CloudFormation stack takes about 5-minutes to create.
4848

@@ -85,267 +85,14 @@ The following steps provide a Quick start guide for `FMBench`. For a more detail
8585
8686
Each `FMBench` run works with a configuration file that contains the information about the model, the deployment steps, and the tests to run. A typical `FMBench` workflow involves either directly using an already provided config file from the [`configs`](https://github.com/aws-samples/foundation-model-benchmarking-tool/tree/main/src/fmbench/configs) folder in the `FMBench` GitHub repo or editing an already provided config file as per your own requirements (say you want to try benchmarking on a different instance type, or a different inference container etc.).
8787
88-
A simple config file with some key parameters annotated is presented below. The file below benchmarks performance of Llama2-7b on an `ml.g5.xlarge` instance and an `ml.g5.2xlarge` instance.
89-
90-
```{markdown}
91-
general:
92-
name: "llama2-7b-v1"
93-
model_name: "Llama2-7b"
94-
95-
# AWS and SageMaker settings
96-
aws:
97-
# AWS region, this parameter is templatized, no need to change
98-
region: {region}
99-
# SageMaker execution role used to run FMBench, this parameter is templatized, no need to change
100-
sagemaker_execution_role: {role_arn}
101-
# S3 bucket to which metrics, plots and reports would be written to
102-
bucket: {write_bucket} ## add the name of your desired bucket
103-
104-
# directory paths in the write bucket, no need to change these
105-
dir_paths:
106-
data_prefix: data
107-
prompts_prefix: prompts
108-
all_prompts_file: all_prompts.csv
109-
metrics_dir: metrics
110-
models_dir: models
111-
metadata_dir: metadata
112-
113-
# S3 information for reading datasets, scripts and tokenizer
114-
s3_read_data:
115-
# read bucket name, templatized, if left unchanged will default to sagemaker-fmbench-read-{region}-{account_id}
116-
read_bucket: {read_bucket}
117-
118-
# S3 prefix in the read bucket where deployment and inference scripts should be placed
119-
scripts_prefix: scripts
120-
121-
# deployment and inference script files to be downloaded are placed in this list
122-
# only needed if you are creating a new deployment script or inference script
123-
# your HuggingFace token does need to be in this list and should be called "hf_token.txt"
124-
script_files:
125-
- hf_token.txt
126-
127-
# configuration files (like this one) are placed in this prefix
128-
configs_prefix: configs
129-
130-
# list of configuration files to download, for now only pricing.yml needs to be downloaded
131-
config_files:
132-
- pricing.yml
133-
134-
# S3 prefix for the dataset files
135-
source_data_prefix: source_data
136-
# list of dataset files, the list below is from the LongBench dataset https://huggingface.co/datasets/THUDM/LongBench
137-
source_data_files:
138-
- 2wikimqa_e.jsonl
139-
- 2wikimqa.jsonl
140-
- hotpotqa_e.jsonl
141-
- hotpotqa.jsonl
142-
- narrativeqa.jsonl
143-
- triviaqa_e.jsonl
144-
- triviaqa.jsonl
145-
146-
# S3 prefix for the tokenizer to be used with the models
147-
# NOTE 1: the same tokenizer is used with all the models being tested through a config file
148-
# NOTE 2: place your model specific tokenizers in a prefix named as <model_name>_tokenizer
149-
# so the mistral tokenizer goes in mistral_tokenizer, Llama2 tokenizer goes in llama2_tokenizer
150-
tokenizer_prefix: tokenizer
151-
152-
# S3 prefix for prompt templates
153-
prompt_template_dir: prompt_template
154-
155-
# prompt template to use, NOTE: same prompt template gets used for all models being tested through a config file
156-
# the FMBench repo already contains a bunch of prompt templates so review those first before creating a new one
157-
prompt_template_file: prompt_template_llama2.txt
158-
159-
# steps to run, usually all of these would be
160-
# set to yes so nothing needs to change here
161-
# you could, however, bypass some steps for example
162-
# set the 2_deploy_model.ipynb to no if you are re-running
163-
# the same config file and the model is already deployed
164-
run_steps:
165-
0_setup.ipynb: yes
166-
1_generate_data.ipynb: yes
167-
2_deploy_model.ipynb: yes
168-
3_run_inference.ipynb: yes
169-
4_model_metric_analysis.ipynb: yes
170-
5_cleanup.ipynb: yes
171-
172-
# dataset related configuration
173-
datasets:
174-
# Refer to the 1_generate_data.ipynb notebook
175-
# the dataset you use is expected to have the
176-
# columns you put in prompt_template_keys list
177-
# and your prompt template also needs to have
178-
# the same placeholders (refer to the prompt template folder)
179-
prompt_template_keys:
180-
- input
181-
- context
182-
183-
# if your dataset has multiple languages and it has a language
184-
# field then you could filter it for a language. Similarly,
185-
# you can filter your dataset to only keep prompts between
186-
# a certain token length limit (the token length is determined
187-
# using the tokenizer you provide in the tokenizer_prefix prefix in the
188-
# read S3 bucket). Each of the array entries below create a payload file
189-
# containing prompts matching the language and token length criteria.
190-
filters:
191-
- language: en
192-
min_length_in_tokens: 1
193-
max_length_in_tokens: 500
194-
payload_file: payload_en_1-500.jsonl
195-
- language: en
196-
min_length_in_tokens: 500
197-
max_length_in_tokens: 1000
198-
payload_file: payload_en_500-1000.jsonl
199-
- language: en
200-
min_length_in_tokens: 1000
201-
max_length_in_tokens: 2000
202-
payload_file: payload_en_1000-2000.jsonl
203-
- language: en
204-
min_length_in_tokens: 2000
205-
max_length_in_tokens: 3000
206-
payload_file: payload_en_2000-3000.jsonl
207-
- language: en
208-
min_length_in_tokens: 3000
209-
max_length_in_tokens: 3840
210-
payload_file: payload_en_3000-3840.jsonl
211-
212-
# While the tests would run on all the datasets
213-
# configured in the experiment entries below but
214-
# the price:performance analysis is only done for 1
215-
# dataset which is listed below as the dataset_of_interest
216-
metrics:
217-
dataset_of_interest: en_2000-3000
218-
219-
# all pricing information is in the pricing.yml file
220-
# this file is provided in the repo. You can add entries
221-
# to this file for new instance types and new Bedrock models
222-
pricing: pricing.yml
223-
224-
# inference parameters, these are added to the payload
225-
# for each inference request. The list here is not static
226-
# any parameter supported by the inference container can be
227-
# added to the list. Put the sagemaker parameters in the sagemaker
228-
# section, bedrock parameters in the bedrock section (not shown here).
229-
# Use the section name (sagemaker in this example) in the inference_spec.parameter_set
230-
# section under experiments.
231-
inference_parameters:
232-
sagemaker:
233-
do_sample: yes
234-
temperature: 0.1
235-
top_p: 0.92
236-
top_k: 120
237-
max_new_tokens: 100
238-
return_full_text: False
239-
240-
# Configuration for experiments to be run. The experiments section is an array
241-
# so more than one experiments can be added, these could belong to the same model
242-
# but different instance types, or different models, or even different hosting
243-
# options (such as one experiment is SageMaker and the other is Bedrock).
244-
experiments:
245-
- name: llama2-7b-g5.xlarge-huggingface-pytorch-tgi-inference-2.0.1-tgi1.1.0
246-
# model_id is interpreted in conjunction with the deployment_script, so if you
247-
# use a JumpStart model id then set the deployment_script to jumpstart.py.
248-
# if deploying directly from HuggingFace this would be a HuggingFace model id
249-
# see the DJL serving deployment script in the code repo for reference.
250-
model_id: meta-textgeneration-llama-2-7b-f
251-
model_version: "3.*"
252-
model_name: llama2-7b-f
253-
ep_name: llama-2-7b-g5xlarge
254-
instance_type: "ml.g5.xlarge"
255-
image_uri: '763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04'
256-
deploy: yes
257-
instance_count: 1
258-
# FMBench comes packaged with multiple deployment scripts, such as scripts for JumpStart
259-
# scripts for deploying using DJL DeepSpeed, tensorRT etc. You can also add your own.
260-
# See repo for details
261-
deployment_script: jumpstart.py
262-
# FMBench comes packaged with multiple inference scripts, such as scripts for SageMaker
263-
# and Bedrock. You can also add your own. See repo for details
264-
inference_script: sagemaker_predictor.py
265-
inference_spec:
266-
# this should match one of the sections in the inference_parameters section above
267-
parameter_set: sagemaker
268-
# runs are done for each combination of payload file and concurrency level
269-
payload_files:
270-
- payload_en_1-500.jsonl
271-
- payload_en_500-1000.jsonl
272-
- payload_en_1000-2000.jsonl
273-
- payload_en_2000-3000.jsonl
274-
# concurrency level refers to number of requests sent in parallel to an endpoint
275-
# the next set of requests is sent once responses for all concurrent requests have
276-
# been received.
277-
concurrency_levels:
278-
- 1
279-
- 2
280-
- 4
281-
# Added for models that require accepting a EULA
282-
accept_eula: true
283-
# Environment variables to be passed to the container
284-
# this is not a fixed list, you can add more parameters as applicable.
285-
env:
286-
SAGEMAKER_PROGRAM: "inference.py"
287-
ENDPOINT_SERVER_TIMEOUT: "3600"
288-
MODEL_CACHE_ROOT: "/opt/ml/model"
289-
SAGEMAKER_ENV: "1"
290-
HF_MODEL_ID: "/opt/ml/model"
291-
MAX_INPUT_LENGTH: "4095"
292-
MAX_TOTAL_TOKENS: "4096"
293-
SM_NUM_GPUS: "1"
294-
SAGEMAKER_MODEL_SERVER_WORKERS: "1"
295-
296-
- name: llama2-7b-g5.2xlarge-huggingface-pytorch-tgi-inference-2.0.1-tgi1.1.0
297-
model_id: meta-textgeneration-llama-2-7b-f
298-
model_version: "3.*"
299-
model_name: llama2-7b-f
300-
ep_name: llama-2-7b-g5-2xlarge
301-
instance_type: "ml.g5.2xlarge"
302-
image_uri: '763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04'
303-
deploy: yes
304-
instance_count: 1
305-
deployment_script: jumpstart.py
306-
inference_script: sagemaker_predictor.py
307-
inference_spec:
308-
parameter_set: sagemaker
309-
payload_files:
310-
- payload_en_1-500.jsonl
311-
- payload_en_500-1000.jsonl
312-
- payload_en_1000-2000.jsonl
313-
- payload_en_2000-3000.jsonl
314-
315-
concurrency_levels:
316-
- 1
317-
- 2
318-
- 4
319-
320-
accept_eula: true
321-
env:
322-
SAGEMAKER_PROGRAM: "inference.py"
323-
ENDPOINT_SERVER_TIMEOUT: "3600"
324-
MODEL_CACHE_ROOT: "/opt/ml/model"
325-
SAGEMAKER_ENV: "1"
326-
HF_MODEL_ID: "/opt/ml/model"
327-
MAX_INPUT_LENGTH: "4095"
328-
MAX_TOTAL_TOKENS: "4096"
329-
SM_NUM_GPUS: "1"
330-
SAGEMAKER_MODEL_SERVER_WORKERS: "1"
331-
332-
report:
333-
latency_budget: 2
334-
cost_per_10k_txn_budget: 20
335-
error_rate_budget: 0
336-
per_inference_request_file: per_inference_request_results.csv
337-
all_metrics_file: all_metrics.csv
338-
txn_count_for_showing_cost: 10000
339-
v_shift_w_single_instance: 0.025
340-
v_shift_w_gt_one_instance: 0.025
341-
```
88+
A simple config file with key parameters annotated is includes in this repo, see [`config.yml`](./config.yml). This file benchmarks performance of Llama2-7b on an `ml.g5.xlarge` instance and an `ml.g5.2xlarge` instance.
34289
34390
## 🚨 Benchmarking Llama3 on Amazon SageMaker 🚨
34491
34592
Llama3 is now available on SageMaker (read [blog post](https://aws.amazon.com/blogs/machine-learning/meta-llama-3-models-are-now-available-in-amazon-sagemaker-jumpstart/)), and you can now benchmark it using `FMBench`. Here are the config files for benchmarking `Llama3-8b-instruct` and `Llama3-70b-instruct` on `ml.p4d.24xlarge` and `ml.g5.12xlarge` instance.
34693
34794
- [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/config-llama3-8b-instruct-g5-p4d.yml) for `Llama3-8b-instruct` on `ml.p4d.24xlarge` and `ml.g5.12xlarge`
348-
- [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/config-llama3-70b-instruct-g5-p4d.yml) for `Llama3-70b-instruct` on `ml.p4d.24xlarge` and `ml.g5.12xlarge`
95+
- [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/config-llama3-70b-instruct-g5-p4d.yml) for `Llama3-70b-instruct` on `ml.p4d.24xlarge` and `ml.g5.48xlarge`
34996
35097
## Benchmarking Llama2 on Amazon SageMaker
35198
@@ -364,3 +111,7 @@ The Llama2-13b-chat and Llama2-70b-chat models are available on [Bedrock](https:
364111
- [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/config-bedrock.yml) for `Llama2-13b-chat` and `Llama2-70b-chat` on Bedrock for on-demand throughput.
365112
366113
- For testing provisioned throughput simply replace the `ep_name` parameter in `experiments` section of the config file with the ARN of your provisioned throughput.
114+
115+
## More..
116+
117+
For bug reports, enhancement requests and any questions please create a [GitHub issue](https://github.com/aws-samples/foundation-model-benchmarking-tool/issues) on the `FMBench` repo.

0 commit comments

Comments
 (0)