Skip to content

Commit d8983a0

Browse files
vishalg0wdaVishal Gowda
andauthored
Provide tooling for running benchmarks against pytorch connectors (#135)
Provide tooling for running benchmarks against pytorch connectors 2 CLIs are provided: - s3torch-benchmark: used to run benchmarks - s3torch-datagen: used to synthesize, upload and configure datasets for usage in benchmarking experiments. Some reference benchmarking scenarios have been defined that illustrate how similar experiments can be defined. `s3torchbenchmarking/README.md` has further instructions. --------- Co-authored-by: Vishal Gowda <[email protected]>
1 parent 8a94efc commit d8983a0

24 files changed

+1112
-0
lines changed

.github/workflows/python-checks.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,11 +51,14 @@ jobs:
5151
python -m pip install torch --extra-index-url https://download.pytorch.org/whl/cpu
5252
python -m pip install -e "s3torchconnectorclient[test]"
5353
python -m pip install -e "s3torchconnector[test]"
54+
python -m pip install -e "s3torchbenchmarking[test]"
5455
5556
- name: s3torchconnectorclient unit tests
5657
run: pytest s3torchconnectorclient/python/tst/unit --hypothesis-profile ci --hypothesis-show-statistics
5758
- name: s3torchconnector unit tests
5859
run: pytest s3torchconnector/tst/unit --ignore s3torchconnector/tst/unit/lightning --hypothesis-profile ci --hypothesis-show-statistics
60+
- name: s3torchbenchmarking unit tests
61+
run: pytest s3torchbenchmarking/tst --hypothesis-profile ci --hypothesis-show-statistics
5962

6063
lint:
6164
name: Python lints

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ venv/
2828
*.egg-info/
2929
.installed.cfg
3030
*.egg
31+
multirun/
3132

3233
# Prevent publishing file with third party licenses
3334
THIRD-PARTY-LICENSES

s3torchbenchmarking/README.md

Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
# Benchmarking the S3 Connector for PyTorch
2+
3+
This directory contains a modular component for the experimental evaluation of the performance of the Amazon S3 Connector for
4+
PyTorch.
5+
The goal of this component is to be able to run performance benchmarks for PyTorch connectors in an easy-to-reproduce and
6+
extensible fashion. This way, users can experiment with different settings and arrive at the optimal configuration for their workloads,
7+
before committing to a setup.
8+
9+
By managing complex configuration space with [Hydra](https://hydra.cc/) we are able to define modular configuration pieces mapped to various
10+
stages of the training pipeline. This approach allows one to mix and match configurations and measure the performance
11+
impact to the end-to-end training process. To achieve this, we split configuration to 4 pieces. Namely:
12+
13+
**dataset**: The `dataset` configuration keeps information about where the data resides. While we support sharded objects, only loading
14+
from TAR objects is supported currently.
15+
16+
**dataloader**: Used to configure the [PyTorch DataLoader](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html). Parameters like
17+
`batch_size` and `num_workers` are self-explanatory. `kind` is used to specify which PyTorch dataset to use(`s3iterabledataset`, `s3mapdataset`, `fsspec`).
18+
19+
**training**: Specify what model to learn and how many epochs to execute the training for. Currently, we implemented two
20+
models `Entitlement` and [ViT](https://huggingface.co/docs/transformers/model_doc/vit). To make it easier to add new models, and abstract the learning-sample processing logic from configuration, this module
21+
defines a Model interface where each model expected to implement `load_sample`, `train`, and `save` methods.
22+
23+
**checkpoint**: Defines where(`disk`, `s3`) and how frequently checkpoints are to be saved.
24+
25+
Once the sub-configurations are defined, one can easily create an experimental configuration that will use the Hydra Sweeper
26+
to launch multiple experiments sequentially.
27+
28+
For example, the `dataloading` experiment stored at `./conf/dataloading.yaml` has the following
29+
content:
30+
31+
```
32+
defaults:
33+
- dataloader: ???
34+
- dataset: unsharded_dataset
35+
- training: entitlement
36+
- checkpoint: none
37+
38+
39+
hydra:
40+
mode: MULTIRUN
41+
42+
sweeper:
43+
params:
44+
dataloader: s3iterabledataset, fsspec
45+
dataloader.num_workers: 2,4,8,16
46+
```
47+
48+
This configuration pins the `dataset` and `training` model while overriding the `dataloader` to change `kind`
49+
and `num_workers`. Running this benchmark will result in sequentially running 8 different scenarios,
50+
each with the different combinations of swept parameters. As `Entitlement` is not really performing any training, this
51+
experiment is helpful to see upper-limit of dataloader throughput without being susceptible to GPU backpressure.
52+
53+
## Getting Started
54+
55+
The benchmarking code is available within the `s3torchbenchmarking`. First navigate into the directory:
56+
57+
cd s3torchbenchmkaring
58+
59+
The tests can be run locally, or you can launch an EC2 instance with a GPU(we used a [g5.2xlarge](https://aws.amazon.com/ec2/instance-types/g5/)), choosing
60+
the [AWS Deep Learning AMI GPU PyTorch 2.0.1 (Amazon Linux 2)](https://aws.amazon.com/releasenotes/aws-deep-learning-ami-gpu-pytorch-2-0-amazon-linux-2/) as your AMI. Activate the venv within this machine
61+
by running:
62+
63+
source pytorch activate
64+
65+
If running locally you can optionally configure a Python virtualenv:
66+
67+
python -m venv <ENV-NAME>
68+
python <PATH-TO-VENV>/bin/activate
69+
70+
71+
Then from this directory, install the dependencies:
72+
73+
python -m pip install .
74+
75+
This would make the `s3torch-benchmark` and `s3torch-datagen` commands available to you. Note: the installation would
76+
recommend $PATH modifications if necessary, allowing you to use the commands directly.
77+
78+
### (Pre-requisite) Configure AWS Credentials
79+
80+
The commands provided below(`datagen.py`, `benchmark.py`) rely on the standard [AWS credential discovery mechanism](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html).
81+
Supplement the command as necessary to ensure the AWS credentials are made available to the process. For eg: by setting
82+
the `AWS_PROFILE` environment variable.
83+
84+
### Configuring the dataset
85+
86+
_Note: This is a one-time setup for each dataset configuration. The dataset configuration files, once created locally
87+
and can be used in subsequent benchmarks, as long as the dataset on the S3 bucket is intact._
88+
89+
If you already have a dataset, you only need upload it to an S3 bucket and setup a YAML file under
90+
`./conf/dataset/` in the following format:
91+
92+
```yaml
93+
# custom_dataset.yaml
94+
95+
prefix_uri: s3://<S3_BUCKET>/<S3_PREFIX>/
96+
region: <AWS_REGION>
97+
sharding: TAR|null # if the samples have been packed into TAR archives.
98+
```
99+
100+
This dataset can then be referenced in an experiment with an entry like `dataset: custom_dataset` (note that we're
101+
omitting the *.yaml extension). This will result in running the benchmarks against this dataset. Some experiments have
102+
already been defined for reference - see `./conf/dataloading.yaml` or `./conf/sharding.yaml`.
103+
104+
_Note: Ensure the bucket is in the same region as the EC2 instance to eliminate network latency effects in your
105+
measurements._
106+
107+
Alternatively, you can use the `s3torch-datagen` command to procedurally generate an image dataset and upload it to
108+
Amazon S3. The script also creates a Hydra configuration file at the appropriate path.
109+
110+
```
111+
$ s3torch-datagen --help
112+
Usage: s3torch-datagen [OPTIONS]
113+
114+
Synthesizes a dataset that will be used for benchmarking and uploads it to
115+
an S3 bucket.
116+
117+
Options:
118+
-n, --num-samples FLOAT Number of samples to generate. Can be supplied as
119+
an IEC or SI prefix. Eg: 1k, 2M. Note: these are
120+
case-sensitive notations. [default: 1k]
121+
--resolution TEXT Resolution written in 'widthxheight' format
122+
[default: 496x387]
123+
--shard-size TEXT If supplied, the images are grouped into tar files
124+
of the given size. Size can be supplied as an IEC
125+
or SI prefix. Eg: 16Mib, 4Kb, 1Gib. Note: these are
126+
case-sensitive notations.
127+
--s3-bucket TEXT S3 Bucket name. Note: Ensure the credentials are
128+
made available either through environment variables
129+
or a shared credentials file. [required]
130+
--s3-prefix TEXT Optional S3 Key prefix where the dataset will be
131+
uploaded. Note: a prefix will be autogenerated. eg:
132+
s3://<BUCKET>/1k_256x256_16Mib_sharded/
133+
--region TEXT Region where the S3 bucket is hosted. [default:
134+
us-east-1]
135+
--help Show this message and exit.
136+
137+
```
138+
139+
Here are some sample dataset configurations that we ran our benchmarks against:
140+
141+
- `-n 20k --resolution 496x387`
142+
- `-n 20k --resolution 496x387 --shard-size {4, 8, 16, 32, 64}MiB`
143+
144+
Example:
145+
146+
```
147+
$ s3torch-datagen -n 20k \
148+
--resolution 496x387 \
149+
--shard-size 4MB \
150+
--s3-bucket swift-benchmark-dataset \
151+
--region eu-west-2
152+
153+
Generating data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 1243.50it/s]
154+
Uploading to S3: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 3378.87it/s]
155+
Dataset uploaded to: s3://swift-benchmark-dataset/20k_496x387_images_4MB_shards/
156+
Dataset Configuration created at: ./conf/dataset/20k_496x387_images_4MB_shards.yaml
157+
Configure your experiment by setting the entry:
158+
dataset: 20k_496x387_images_4MB_shards
159+
Alternatively, you can run specify it on the cmd-line when running the benchmark like so:
160+
s3torch-benchmark -cd conf -m -cn <CONFIG-NAME> 'dataset=20k_496x387_images_4MB_shards'
161+
```
162+
163+
---
164+
165+
Finally, once the dataset and other configuration modules have been defined, you can kick off the benchmark by running:
166+
167+
$ cd s3torchbenchmarking/
168+
169+
$ s3torch-benchmark -cd conf -m -cn YOUR-TEST-CONFIGURATION
170+
171+
# Example-1:
172+
$ s3torch-benchmark -cd conf -m -cn dataloading 'dataset.prefix_uri=<S3-PREFIX>' 'dataset.region=eu-west-2'
173+
174+
# Example-2:
175+
$ s3torch-benchmark -cd conf -m -cn checkpointing 'dataset.prefix_uri=<S3-PREFIX>' 'dataset.region=eu-west-2'
176+
177+
_Note: For overriding any other benchmark parameters,
178+
see [Hydra Overrides](https://hydra.cc/docs/advanced/override_grammar/basic/). You can also run `s3torch-benchmark --hydra-help` to learn more._
179+
180+
Experiments will report total training time, number of training samples as well as host-level metrics like CPU
181+
Utilisation, GPU Utilisation (if available) etc.
182+
183+
## Next Steps
184+
185+
- Use [Hydra Callbacks](https://hydra.cc/docs/experimental/callbacks/) to aggregate and plot benchmark results.
186+
- Add more models (LLMs?) to monitor training performance.
187+
- Support plugging in user-defined models and automatic discovery of the same.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
save_one_in: 25
2+
destination: disk
3+
uri: checkpoints/
4+
region: eu-west-2
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
save_one_in: 0
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
save_one_in: 25
2+
destination: s3
3+
uri: s3://swift-benchmark-dataset/checkpoints/
4+
region: eu-west-2
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
defaults:
2+
- _self_
3+
- dataloader: s3iterabledataset
4+
- dataset: unsharded_dataset
5+
- training: vit
6+
- checkpoint: ???
7+
8+
9+
hydra:
10+
mode: MULTIRUN
11+
sweeper:
12+
params:
13+
checkpoint: disk, s3
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kind: fsspec
2+
batch_size: 128
3+
num_workers: 8
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kind: s3iterabledataset
2+
batch_size: 128
3+
num_workers: 8
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kind: s3mapdataset
2+
batch_size: 128
3+
num_workers: 8

0 commit comments

Comments
 (0)