Skip to content

Commit 066134b

Browse files
authored
refactor(benchmarks): rework "dataset" scenarios + other improvements (#285)
Move "dataset" scenario in their own directory, just like "dcp" and "lightning_checkpointing". Create dedicated S3 and DynamoDB Hydra config files. Delete last unused Hydra config files. Make "dataset" scenario compatible with existing results schema + update that schema. Save job results automatically through Hydra callbacks. Simplify further benchmark scripts.
1 parent 18c7c12 commit 066134b

34 files changed

+307
-500
lines changed

s3torchbenchmarking/README.md

Lines changed: 43 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -19,79 +19,55 @@ There are **three scenarios** available:
1919
- **PyTorch’s Distributed Checkpointing (DCP) benchmarks**: measure our connector against PyTorch default distributed
2020
checkpointing mechanism — learn more in [this dedicated README](src/s3torchbenchmarking/dcp/README.md).
2121

22-
For example, the `dataloading` experiment stored at `./conf/dataloading.yaml` has the following
23-
content:
24-
25-
```
26-
defaults:
27-
- dataloader: ???
28-
- dataset: unsharded_dataset
29-
- training: entitlement
30-
- checkpoint: none
31-
32-
hydra:
33-
mode: MULTIRUN
34-
sweeper:
35-
params:
36-
dataloader: s3iterabledataset, fsspec
37-
dataloader.num_workers: 2,4,8,16
38-
```
39-
40-
This configuration pins the `dataset` and `training` model while overriding the `dataloader` to change `kind`
41-
and `num_workers`. Running this benchmark will result in sequentially running 8 different scenarios,
42-
each with the different combinations of swept parameters. As `Entitlement` is not really performing any training, this
43-
experiment is helpful to see upper-limit of dataloader throughput without being susceptible to GPU backpressure.
44-
4522
## Getting Started
4623

47-
The benchmarking code is available within the `src/s3torchbenchmarking` module. First, from here, navigate into the
48-
directory:
24+
The benchmarking code is available within the `src/s3torchbenchmarking` module.
4925

50-
cd src/s3torchbenchmarking
26+
The tests can be run locally, or you can launch an EC2 instance with a GPU (we used a [g5.2xlarge][g5.2xlarge]),
27+
choosing the [AWS Deep Learning AMI GPU PyTorch 2.5 (Ubuntu 22.04)][dl-ami] as your AMI.
5128

52-
The tests can be run locally, or you can launch an EC2 instance with a GPU (we used a [g5.2xlarge](https://aws.amazon.com/ec2/instance-types/g5/)), choosing
53-
the [AWS Deep Learning AMI GPU PyTorch 2.0.1 (Amazon Linux 2)](https://aws.amazon.com/releasenotes/aws-deep-learning-ami-gpu-pytorch-2-0-amazon-linux-2/) as your AMI. Activate the venv within this machine
54-
by running:
29+
First, activate the Conda env within this machine by running:
5530

56-
source activate pytorch
31+
```shell
32+
source activate pytorch
33+
```
5734

5835
If running locally you can optionally configure a Python venv:
5936

60-
python -m venv <ENV-NAME>
61-
source <PATH-TO-VENV>/bin/activate
62-
63-
64-
Then from this directory, install the dependencies:
65-
66-
python -m pip install .
67-
68-
This would make some commands available to you, which you can find under the [pyproject.toml](pyproject.toml) file.
69-
Note: the installation would recommend `$PATH` modifications if necessary, allowing you to use the commands directly.
70-
71-
**(Optional) Install Mountpoint**
72-
73-
Required only if you're running benchmarks using PyTorch
74-
with [Mountpoint for Amazon S3](https://github.com/awslabs/mountpoint-s3).
37+
```shell
38+
python -m venv <ENV-NAME>
39+
source <PATH-TO-VENV>/bin/activate
40+
```
7541

76-
wget https://s3.amazonaws.com/mountpoint-s3-release/latest/x86_64/mount-s3.rpm
77-
sudo yum install ./mount-s3.rpm # For an RHEL system
42+
Then, `cd` to the `s3torchbenchmarking` directory, and run the `utils/prepare_ec2_instance.sh` script: the latter will
43+
take care of updating the instance's packages (through either `yum` or `apt`), install Mountpoint for Amazon S3, and
44+
install the required Python packages.
7845

79-
For other distros see [Installing Mountpoint for Amazon S3](https://github.com/awslabs/mountpoint-s3/blob/main/doc/INSTALL.md).
46+
> [!NOTE]
47+
> Some errors may arise while trying to run the benchmarks; below are some workarounds to execute in such cases.
8048
81-
_Note: Mountpoint benchmarks are currently only supported on *nix-based systems and rely on `sudo` capabilities._
49+
- Error `RuntimeError: operator torchvision::nms does not exist` while trying the run the benchmarks:
50+
```shell
51+
conda install -y pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
52+
```
53+
- Error `TypeError: canonicalize_version() got an unexpected keyword argument 'strip_trailing_zero'` while trying to
54+
install `s3torchbenchmarking` package:
55+
```shell
56+
pip install "setuptools<71"
57+
```
8258

8359
### (Pre-requisite) Configure AWS Credentials
8460

85-
The commands provided below (`datagen.py`, `benchmark.py`) rely on the standard [AWS credential discovery mechanism](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html).
86-
Supplement the command as necessary to ensure the AWS credentials are made available to the process. For eg: by setting
87-
the `AWS_PROFILE` environment variable.
61+
The commands provided below (`datagen.py`, `benchmark.py`) rely on the
62+
standard [AWS credential discovery mechanism][credentials]. Supplement the command as necessary to ensure the AWS
63+
credentials are made available to the process, e.g., by setting the `AWS_PROFILE` environment variable.
8864

8965
### Configuring the dataset
9066

9167
_Note: This is a one-time setup for each dataset configuration. The dataset configuration files, once created locally
9268
and can be used in subsequent benchmarks, as long as the dataset on the S3 bucket is intact._
9369

94-
If you already have a dataset, you only need upload it to an S3 bucket and setup a YAML file under
70+
If you already have a dataset, you only need upload it to an S3 bucket and set up a YAML file under
9571
`./conf/dataset/` in the following format:
9672

9773
```yaml
@@ -171,7 +147,7 @@ Finally, once the dataset and other configuration modules have been defined, you
171147
172148
```shell
173149
# For data loading benchmarks:
174-
$ . utils/prepare_and_run_benchmark.sh s3iterabledataset "./dataset" my-bucket eu-west-1 my-bucket-results "" my-prefix
150+
$ . utils/run_dataset_benchmarks.sh
175151
176152
# For PyTorch Lightning Checkpointing benchmarks:
177153
$ . utils/run_lighning_benchmarks.sh
@@ -180,15 +156,22 @@ $ . utils/run_lighning_benchmarks.sh
180156
$ . utils/run_dcp_benchmarks.sh
181157
```
182158

183-
_Note: For overriding any other benchmark parameters,
184-
see [Hydra Overrides](https://hydra.cc/docs/advanced/override_grammar/basic/). You can also run `s3torch-benchmark --hydra-help` to learn more._
159+
_Note: For overriding any other benchmark parameters, see [Hydra Overrides][hydra-overrides]. You can also run
160+
`s3torch-benchmark --hydra-help` to learn more._
185161

186-
Experiments will report total training time, number of training samples as well as host-level metrics like CPU
187-
Utilisation, GPU Utilisation (if available) etc. The results for individual jobs will be written out to dedicated
188-
`result.json` files within their corresponding [output dirs](https://hydra.cc/docs/configure_hydra/intro/#hydraruntime).
189-
When using MULTIRUN mode, a `collated_results.json` will be written out to the [common sweep dir](https://hydra.cc/docs/configure_hydra/intro/#hydrasweep).
162+
Experiments will report various metrics, like throughput, processed time, etc. The results for individual jobs and runs
163+
(one run will contain 1 to N jobs) will be written out to dedicated files, respectively `job_results.json` and
164+
`run_results.json`, within their corresponding output directory (see the YAML config files).
190165

191166
## Next Steps
192167

193168
- Add more models (LLMs?) to monitor training performance.
194169
- Support plugging in user-defined models and automatic discovery of the same.
170+
171+
[g5.2xlarge]: https://aws.amazon.com/ec2/instance-types/g5/
172+
173+
[dl-ami]: https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html
174+
175+
[credentials]: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html
176+
177+
[hydra-overrides]: https://hydra.cc/docs/advanced/override_grammar/basic/
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# @package _global_
2+
# DynamoDB config; used to save run results
3+
dynamodb:
4+
region: ???
5+
table: ???
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# @package _global_
2+
# S3 config; used for checkpoint storage
3+
s3:
4+
region: ???
5+
uri: ???

s3torchbenchmarking/conf/dataloading.yaml

Lines changed: 0 additions & 15 deletions
This file was deleted.

s3torchbenchmarking/conf/dataloading_sharded_ent.yaml

Lines changed: 0 additions & 14 deletions
This file was deleted.

s3torchbenchmarking/conf/dataloading_sharded_vit.yaml

Lines changed: 0 additions & 14 deletions
This file was deleted.

s3torchbenchmarking/conf/dataloading_unsharded_1epochs.yaml

Lines changed: 0 additions & 15 deletions
This file was deleted.

s3torchbenchmarking/conf/dataloading_unsharded_ent_10epochs.yaml

Lines changed: 0 additions & 15 deletions
This file was deleted.

s3torchbenchmarking/conf/dataloading_unsharded_vit_10epochs.yaml

Lines changed: 0 additions & 15 deletions
This file was deleted.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
defaults:
2+
- hydra/callbacks:
3+
- collate_results
4+
- aws:
5+
- s3
6+
- dynamodb # save run results to DynamoDB (see also conf/aws/dynamodb.yaml) -- comment me if not required
7+
- _self_
8+
9+
prefix_uri: ??? # where the dataset are stored in S3
10+
region: ???
11+
sharding: False
12+
epochs: 1
13+
checkpoint:
14+
save_one_in: 25
15+
destination: disk
16+
uri: ./nvme/checkpoints/
17+
region: eu-west-2
18+
19+
hydra:
20+
mode: MULTIRUN
21+
sweep:
22+
dir: multirun/${hydra.job.config_name}/${now:%Y-%m-%d_%H-%M-%S}
23+
sweeper:
24+
params:
25+
+model: entitlement, vit
26+
+dataloader: s3iterabledataset, s3mapdataset, fsspec, mountpoint, mountpointcache

0 commit comments

Comments
 (0)