You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: recipes/benchmarks/fmbench/README.md
+8-257Lines changed: 8 additions & 257 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ The [`FMBench`](https://github.com/aws-samples/foundation-model-benchmarking-too
4
4
5
5
## The need for benchmarking
6
6
7
-
Customers often wonder what is the best AWS service to run Llama models for _my specific use-case_ and _my specific price performance requirements_. While model evaluation metrics are available on several leaderboards ([`HELM`](https://crfm.stanford.edu/helm/lite/latest/#/leaderboard), [`LMSys`](https://chat.lmsys.org/?leaderboard)), but the price performance comparison can be notoriously hard to find and even more harder to trust. In such a scenario, we think it is best to be able to run performance benchmarking yourself on either on your own dataset or on a similar (task wise, prompt size wise) open-source dataset ([`LongBench`](https://huggingface.co/datasets/THUDM/LongBench)), [`QMSum`](https://paperswithcode.com/dataset/qmsum). This is the problem that [`FMBench`](https://github.com/aws-samples/foundation-model-benchmarking-tool/tree/main) solves.
7
+
Customers often wonder what is the best AWS service to run Llama models for _my specific use-case_ and _my specific price performance requirements_. While model evaluation metrics are available on several leaderboards ([`HELM`](https://crfm.stanford.edu/helm/lite/latest/#/leaderboard), [`LMSys`](https://chat.lmsys.org/?leaderboard)), but the price performance comparison can be notoriously hard to find and even more harder to trust. In such a scenario, we think it is best to be able to run performance benchmarking yourself on either on your own dataset or on a similar (task wise, prompt size wise) open-source datasets such as ([`LongBench`](https://huggingface.co/datasets/THUDM/LongBench), [`QMSum`](https://paperswithcode.com/dataset/qmsum)). This is the problem that [`FMBench`](https://github.com/aws-samples/foundation-model-benchmarking-tool/tree/main) solves.
8
8
9
9
## [`FMBench`](https://github.com/aws-samples/foundation-model-benchmarking-tool/tree/main): an open-source Python package for FM benchmarking on AWS
10
10
@@ -42,7 +42,7 @@ The report also includes latency Vs prompt size charts for different concurrency
42
42
43
43
### How to get started with `FMBench`
44
44
45
-
The following steps provide a Quick start guide for `FMBench`. For a more detailed DIY version, please see the [`FMBench Readme`](https://github.com/aws-samples/foundation-model-benchmarking-tool?tab=readme-ov-file#the-diy-version-with-gory-details).
45
+
The following steps provide a [Quick start guide for `FMBench`](https://github.com/aws-samples/foundation-model-benchmarking-tool#quickstart). For a more detailed DIY version, please see the [`FMBench Readme`](https://github.com/aws-samples/foundation-model-benchmarking-tool?tab=readme-ov-file#the-diy-version-with-gory-details).
46
46
47
47
1. Launch the AWS CloudFormation template included in this repository using one of the buttons from the table below. The CloudFormation template creates the following resources within your AWS account: Amazon S3 buckets, Amazon IAM role and an Amazon SageMaker Notebook with this repository cloned. A read S3 bucket is created which contains all the files (configuration files, datasets) required to run `FMBench` and a write S3 bucket is created which will hold the metrics and reports generated by `FMBench`. The CloudFormation stack takes about 5-minutes to create.
48
48
@@ -85,267 +85,14 @@ The following steps provide a Quick start guide for `FMBench`. For a more detail
85
85
86
86
Each `FMBench` run works with a configuration file that contains the information about the model, the deployment steps, and the tests to run. A typical `FMBench` workflow involves either directly using an already provided config file from the [`configs`](https://github.com/aws-samples/foundation-model-benchmarking-tool/tree/main/src/fmbench/configs) folder in the `FMBench` GitHub repo or editing an already provided config file as per your own requirements (say you want to try benchmarking on a different instance type, or a different inference container etc.).
87
87
88
-
A simple config file with some key parameters annotated is presented below. The file below benchmarks performance of Llama2-7b on an `ml.g5.xlarge` instance and an `ml.g5.2xlarge` instance.
89
-
90
-
```{markdown}
91
-
general:
92
-
name: "llama2-7b-v1"
93
-
model_name: "Llama2-7b"
94
-
95
-
# AWS and SageMaker settings
96
-
aws:
97
-
# AWS region, this parameter is templatized, no need to change
98
-
region: {region}
99
-
# SageMaker execution role used to run FMBench, this parameter is templatized, no need to change
100
-
sagemaker_execution_role: {role_arn}
101
-
# S3 bucket to which metrics, plots and reports would be written to
102
-
bucket: {write_bucket} ## add the name of your desired bucket
103
-
104
-
# directory paths in the write bucket, no need to change these
105
-
dir_paths:
106
-
data_prefix: data
107
-
prompts_prefix: prompts
108
-
all_prompts_file: all_prompts.csv
109
-
metrics_dir: metrics
110
-
models_dir: models
111
-
metadata_dir: metadata
112
-
113
-
# S3 information for reading datasets, scripts and tokenizer
114
-
s3_read_data:
115
-
# read bucket name, templatized, if left unchanged will default to sagemaker-fmbench-read-{region}-{account_id}
116
-
read_bucket: {read_bucket}
117
-
118
-
# S3 prefix in the read bucket where deployment and inference scripts should be placed
119
-
scripts_prefix: scripts
120
-
121
-
# deployment and inference script files to be downloaded are placed in this list
122
-
# only needed if you are creating a new deployment script or inference script
123
-
# your HuggingFace token does need to be in this list and should be called "hf_token.txt"
124
-
script_files:
125
-
- hf_token.txt
126
-
127
-
# configuration files (like this one) are placed in this prefix
128
-
configs_prefix: configs
129
-
130
-
# list of configuration files to download, for now only pricing.yml needs to be downloaded
131
-
config_files:
132
-
- pricing.yml
133
-
134
-
# S3 prefix for the dataset files
135
-
source_data_prefix: source_data
136
-
# list of dataset files, the list below is from the LongBench dataset https://huggingface.co/datasets/THUDM/LongBench
137
-
source_data_files:
138
-
- 2wikimqa_e.jsonl
139
-
- 2wikimqa.jsonl
140
-
- hotpotqa_e.jsonl
141
-
- hotpotqa.jsonl
142
-
- narrativeqa.jsonl
143
-
- triviaqa_e.jsonl
144
-
- triviaqa.jsonl
145
-
146
-
# S3 prefix for the tokenizer to be used with the models
147
-
# NOTE 1: the same tokenizer is used with all the models being tested through a config file
148
-
# NOTE 2: place your model specific tokenizers in a prefix named as <model_name>_tokenizer
149
-
# so the mistral tokenizer goes in mistral_tokenizer, Llama2 tokenizer goes in llama2_tokenizer
150
-
tokenizer_prefix: tokenizer
151
-
152
-
# S3 prefix for prompt templates
153
-
prompt_template_dir: prompt_template
154
-
155
-
# prompt template to use, NOTE: same prompt template gets used for all models being tested through a config file
156
-
# the FMBench repo already contains a bunch of prompt templates so review those first before creating a new one
157
-
prompt_template_file: prompt_template_llama2.txt
158
-
159
-
# steps to run, usually all of these would be
160
-
# set to yes so nothing needs to change here
161
-
# you could, however, bypass some steps for example
162
-
# set the 2_deploy_model.ipynb to no if you are re-running
163
-
# the same config file and the model is already deployed
164
-
run_steps:
165
-
0_setup.ipynb: yes
166
-
1_generate_data.ipynb: yes
167
-
2_deploy_model.ipynb: yes
168
-
3_run_inference.ipynb: yes
169
-
4_model_metric_analysis.ipynb: yes
170
-
5_cleanup.ipynb: yes
171
-
172
-
# dataset related configuration
173
-
datasets:
174
-
# Refer to the 1_generate_data.ipynb notebook
175
-
# the dataset you use is expected to have the
176
-
# columns you put in prompt_template_keys list
177
-
# and your prompt template also needs to have
178
-
# the same placeholders (refer to the prompt template folder)
179
-
prompt_template_keys:
180
-
- input
181
-
- context
182
-
183
-
# if your dataset has multiple languages and it has a language
184
-
# field then you could filter it for a language. Similarly,
185
-
# you can filter your dataset to only keep prompts between
186
-
# a certain token length limit (the token length is determined
187
-
# using the tokenizer you provide in the tokenizer_prefix prefix in the
188
-
# read S3 bucket). Each of the array entries below create a payload file
189
-
# containing prompts matching the language and token length criteria.
190
-
filters:
191
-
- language: en
192
-
min_length_in_tokens: 1
193
-
max_length_in_tokens: 500
194
-
payload_file: payload_en_1-500.jsonl
195
-
- language: en
196
-
min_length_in_tokens: 500
197
-
max_length_in_tokens: 1000
198
-
payload_file: payload_en_500-1000.jsonl
199
-
- language: en
200
-
min_length_in_tokens: 1000
201
-
max_length_in_tokens: 2000
202
-
payload_file: payload_en_1000-2000.jsonl
203
-
- language: en
204
-
min_length_in_tokens: 2000
205
-
max_length_in_tokens: 3000
206
-
payload_file: payload_en_2000-3000.jsonl
207
-
- language: en
208
-
min_length_in_tokens: 3000
209
-
max_length_in_tokens: 3840
210
-
payload_file: payload_en_3000-3840.jsonl
211
-
212
-
# While the tests would run on all the datasets
213
-
# configured in the experiment entries below but
214
-
# the price:performance analysis is only done for 1
215
-
# dataset which is listed below as the dataset_of_interest
216
-
metrics:
217
-
dataset_of_interest: en_2000-3000
218
-
219
-
# all pricing information is in the pricing.yml file
220
-
# this file is provided in the repo. You can add entries
221
-
# to this file for new instance types and new Bedrock models
222
-
pricing: pricing.yml
223
-
224
-
# inference parameters, these are added to the payload
225
-
# for each inference request. The list here is not static
226
-
# any parameter supported by the inference container can be
227
-
# added to the list. Put the sagemaker parameters in the sagemaker
228
-
# section, bedrock parameters in the bedrock section (not shown here).
229
-
# Use the section name (sagemaker in this example) in the inference_spec.parameter_set
230
-
# section under experiments.
231
-
inference_parameters:
232
-
sagemaker:
233
-
do_sample: yes
234
-
temperature: 0.1
235
-
top_p: 0.92
236
-
top_k: 120
237
-
max_new_tokens: 100
238
-
return_full_text: False
239
-
240
-
# Configuration for experiments to be run. The experiments section is an array
241
-
# so more than one experiments can be added, these could belong to the same model
242
-
# but different instance types, or different models, or even different hosting
243
-
# options (such as one experiment is SageMaker and the other is Bedrock).
A simple config file with key parameters annotated is includes in this repo, see [`config.yml`](./config.yml). This file benchmarks performance of Llama2-7b on an `ml.g5.xlarge` instance and an `ml.g5.2xlarge` instance.
342
89
343
90
## 🚨 Benchmarking Llama3 on Amazon SageMaker 🚨
344
91
345
92
Llama3 is now available on SageMaker (read [blog post](https://aws.amazon.com/blogs/machine-learning/meta-llama-3-models-are-now-available-in-amazon-sagemaker-jumpstart/)), and you can now benchmark it using `FMBench`. Here are the config files for benchmarking `Llama3-8b-instruct` and `Llama3-70b-instruct` on `ml.p4d.24xlarge` and `ml.g5.12xlarge` instance.
346
93
347
94
- [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/config-llama3-8b-instruct-g5-p4d.yml) for `Llama3-8b-instruct` on `ml.p4d.24xlarge` and `ml.g5.12xlarge`
348
-
-[Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/config-llama3-70b-instruct-g5-p4d.yml) for `Llama3-70b-instruct` on `ml.p4d.24xlarge` and `ml.g5.12xlarge`
95
+
- [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/config-llama3-70b-instruct-g5-p4d.yml) for `Llama3-70b-instruct` on `ml.p4d.24xlarge` and `ml.g5.48xlarge`
349
96
350
97
## Benchmarking Llama2 on Amazon SageMaker
351
98
@@ -364,3 +111,7 @@ The Llama2-13b-chat and Llama2-70b-chat models are available on [Bedrock](https:
364
111
- [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/config-bedrock.yml) for `Llama2-13b-chat` and `Llama2-70b-chat` on Bedrock for on-demand throughput.
365
112
366
113
- For testing provisioned throughput simply replace the `ep_name` parameter in `experiments` section of the config file with the ARN of your provisioned throughput.
114
+
115
+
## More..
116
+
117
+
For bug reports, enhancement requests and any questions please create a [GitHub issue](https://github.com/aws-samples/foundation-model-benchmarking-tool/issues) on the `FMBench` repo.
0 commit comments