You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
refactor(benchmarks): rework "dataset" scenarios + other improvements (#285)
Move "dataset" scenario in their own directory, just like "dcp" and
"lightning_checkpointing". Create dedicated S3 and DynamoDB Hydra config
files. Delete last unused Hydra config files. Make "dataset" scenario
compatible with existing results schema + update that schema. Save job
results automatically through Hydra callbacks. Simplify further
benchmark scripts.
checkpointing mechanism — learn more in [this dedicated README](src/s3torchbenchmarking/dcp/README.md).
21
21
22
-
For example, the `dataloading` experiment stored at `./conf/dataloading.yaml` has the following
23
-
content:
24
-
25
-
```
26
-
defaults:
27
-
- dataloader: ???
28
-
- dataset: unsharded_dataset
29
-
- training: entitlement
30
-
- checkpoint: none
31
-
32
-
hydra:
33
-
mode: MULTIRUN
34
-
sweeper:
35
-
params:
36
-
dataloader: s3iterabledataset, fsspec
37
-
dataloader.num_workers: 2,4,8,16
38
-
```
39
-
40
-
This configuration pins the `dataset` and `training` model while overriding the `dataloader` to change `kind`
41
-
and `num_workers`. Running this benchmark will result in sequentially running 8 different scenarios,
42
-
each with the different combinations of swept parameters. As `Entitlement` is not really performing any training, this
43
-
experiment is helpful to see upper-limit of dataloader throughput without being susceptible to GPU backpressure.
44
-
45
22
## Getting Started
46
23
47
-
The benchmarking code is available within the `src/s3torchbenchmarking` module. First, from here, navigate into the
48
-
directory:
24
+
The benchmarking code is available within the `src/s3torchbenchmarking` module.
49
25
50
-
cd src/s3torchbenchmarking
26
+
The tests can be run locally, or you can launch an EC2 instance with a GPU (we used a [g5.2xlarge][g5.2xlarge]),
27
+
choosing the [AWS Deep Learning AMI GPU PyTorch 2.5 (Ubuntu 22.04)][dl-ami] as your AMI.
51
28
52
-
The tests can be run locally, or you can launch an EC2 instance with a GPU (we used a [g5.2xlarge](https://aws.amazon.com/ec2/instance-types/g5/)), choosing
53
-
the [AWS Deep Learning AMI GPU PyTorch 2.0.1 (Amazon Linux 2)](https://aws.amazon.com/releasenotes/aws-deep-learning-ami-gpu-pytorch-2-0-amazon-linux-2/) as your AMI. Activate the venv within this machine
54
-
by running:
29
+
First, activate the Conda env within this machine by running:
55
30
56
-
source activate pytorch
31
+
```shell
32
+
source activate pytorch
33
+
```
57
34
58
35
If running locally you can optionally configure a Python venv:
59
36
60
-
python -m venv <ENV-NAME>
61
-
source <PATH-TO-VENV>/bin/activate
62
-
63
-
64
-
Then from this directory, install the dependencies:
65
-
66
-
python -m pip install .
67
-
68
-
This would make some commands available to you, which you can find under the [pyproject.toml](pyproject.toml) file.
69
-
Note: the installation would recommend `$PATH` modifications if necessary, allowing you to use the commands directly.
70
-
71
-
**(Optional) Install Mountpoint**
72
-
73
-
Required only if you're running benchmarks using PyTorch
74
-
with [Mountpoint for Amazon S3](https://github.com/awslabs/mountpoint-s3).
- Error `TypeError: canonicalize_version() got an unexpected keyword argument 'strip_trailing_zero'` while trying to
54
+
install `s3torchbenchmarking` package:
55
+
```shell
56
+
pip install "setuptools<71"
57
+
```
82
58
83
59
### (Pre-requisite) Configure AWS Credentials
84
60
85
-
The commands provided below (`datagen.py`, `benchmark.py`) rely on the standard [AWS credential discovery mechanism](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html).
86
-
Supplement the command as necessary to ensure the AWS credentials are made available to the process. For eg: by setting
87
-
the `AWS_PROFILE` environment variable.
61
+
The commands provided below (`datagen.py`, `benchmark.py`) rely on the
62
+
standard [AWS credential discovery mechanism][credentials]. Supplement the command as necessary to ensure the AWS
63
+
credentials are made available to the process, e.g., by setting the `AWS_PROFILE` environment variable.
88
64
89
65
### Configuring the dataset
90
66
91
67
_Note: This is a one-time setup for each dataset configuration. The dataset configuration files, once created locally
92
68
and can be used in subsequent benchmarks, as long as the dataset on the S3 bucket is intact._
93
69
94
-
If you already have a dataset, you only need upload it to an S3 bucket and setup a YAML file under
70
+
If you already have a dataset, you only need upload it to an S3 bucket and set up a YAML file under
95
71
`./conf/dataset/` in the following format:
96
72
97
73
```yaml
@@ -171,7 +147,7 @@ Finally, once the dataset and other configuration modules have been defined, you
_Note: For overriding any other benchmark parameters,
184
-
see [Hydra Overrides](https://hydra.cc/docs/advanced/override_grammar/basic/). You can also run `s3torch-benchmark --hydra-help` to learn more._
159
+
_Note: For overriding any other benchmark parameters, see [Hydra Overrides][hydra-overrides]. You can also run
160
+
`s3torch-benchmark --hydra-help` to learn more._
185
161
186
-
Experiments will report total training time, number of training samples as well as host-level metrics like CPU
187
-
Utilisation, GPU Utilisation (if available) etc. The results for individual jobs will be written out to dedicated
188
-
`result.json` files within their corresponding [output dirs](https://hydra.cc/docs/configure_hydra/intro/#hydraruntime).
189
-
When using MULTIRUN mode, a `collated_results.json` will be written out to the [common sweep dir](https://hydra.cc/docs/configure_hydra/intro/#hydrasweep).
162
+
Experiments will report various metrics, like throughput, processed time, etc. The results for individual jobs and runs
163
+
(one run will contain 1 to N jobs) will be written out to dedicated files, respectively `job_results.json` and
164
+
`run_results.json`, within their corresponding output directory (see the YAML config files).
190
165
191
166
## Next Steps
192
167
193
168
- Add more models (LLMs?) to monitor training performance.
194
169
- Support plugging in user-defined models and automatic discovery of the same.
0 commit comments