|
| 1 | +# Benchmarking the S3 Connector for PyTorch |
| 2 | + |
| 3 | +This directory contains a modular component for the experimental evaluation of the performance of the Amazon S3 Connector for |
| 4 | +PyTorch. |
| 5 | +The goal of this component is to be able to run performance benchmarks for PyTorch connectors in an easy-to-reproduce and |
| 6 | +extensible fashion. This way, users can experiment with different settings and arrive at the optimal configuration for their workloads, |
| 7 | +before committing to a setup. |
| 8 | + |
| 9 | +By managing complex configuration space with [Hydra](https://hydra.cc/) we are able to define modular configuration pieces mapped to various |
| 10 | +stages of the training pipeline. This approach allows one to mix and match configurations and measure the performance |
| 11 | +impact to the end-to-end training process. To achieve this, we split configuration to 4 pieces. Namely: |
| 12 | + |
| 13 | +**dataset**: The `dataset` configuration keeps information about where the data resides. While we support sharded objects, only loading |
| 14 | +from TAR objects is supported currently. |
| 15 | + |
| 16 | +**dataloader**: Used to configure the [PyTorch DataLoader](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html). Parameters like |
| 17 | +`batch_size` and `num_workers` are self-explanatory. `kind` is used to specify which PyTorch dataset to use(`s3iterabledataset`, `s3mapdataset`, `fsspec`). |
| 18 | + |
| 19 | +**training**: Specify what model to learn and how many epochs to execute the training for. Currently, we implemented two |
| 20 | +models `Entitlement` and [ViT](https://huggingface.co/docs/transformers/model_doc/vit). To make it easier to add new models, and abstract the learning-sample processing logic from configuration, this module |
| 21 | +defines a Model interface where each model expected to implement `load_sample`, `train`, and `save` methods. |
| 22 | + |
| 23 | +**checkpoint**: Defines where(`disk`, `s3`) and how frequently checkpoints are to be saved. |
| 24 | + |
| 25 | +Once the sub-configurations are defined, one can easily create an experimental configuration that will use the Hydra Sweeper |
| 26 | +to launch multiple experiments sequentially. |
| 27 | + |
| 28 | +For example, the `dataloading` experiment stored at `./conf/dataloading.yaml` has the following |
| 29 | +content: |
| 30 | + |
| 31 | +``` |
| 32 | +defaults: |
| 33 | + - dataloader: ??? |
| 34 | + - dataset: unsharded_dataset |
| 35 | + - training: entitlement |
| 36 | + - checkpoint: none |
| 37 | +
|
| 38 | +
|
| 39 | +hydra: |
| 40 | + mode: MULTIRUN |
| 41 | +
|
| 42 | + sweeper: |
| 43 | + params: |
| 44 | + dataloader: s3iterabledataset, fsspec |
| 45 | + dataloader.num_workers: 2,4,8,16 |
| 46 | +``` |
| 47 | + |
| 48 | +This configuration pins the `dataset` and `training` model while overriding the `dataloader` to change `kind` |
| 49 | +and `num_workers`. Running this benchmark will result in sequentially running 8 different scenarios, |
| 50 | +each with the different combinations of swept parameters. As `Entitlement` is not really performing any training, this |
| 51 | +experiment is helpful to see upper-limit of dataloader throughput without being susceptible to GPU backpressure. |
| 52 | + |
| 53 | +## Getting Started |
| 54 | + |
| 55 | +The benchmarking code is available within the `s3torchbenchmarking`. First navigate into the directory: |
| 56 | + |
| 57 | + cd s3torchbenchmkaring |
| 58 | + |
| 59 | +The tests can be run locally, or you can launch an EC2 instance with a GPU(we used a [g5.2xlarge](https://aws.amazon.com/ec2/instance-types/g5/)), choosing |
| 60 | +the [AWS Deep Learning AMI GPU PyTorch 2.0.1 (Amazon Linux 2)](https://aws.amazon.com/releasenotes/aws-deep-learning-ami-gpu-pytorch-2-0-amazon-linux-2/) as your AMI. Activate the venv within this machine |
| 61 | +by running: |
| 62 | + |
| 63 | + source pytorch activate |
| 64 | + |
| 65 | +If running locally you can optionally configure a Python virtualenv: |
| 66 | + |
| 67 | + python -m venv <ENV-NAME> |
| 68 | + python <PATH-TO-VENV>/bin/activate |
| 69 | + |
| 70 | + |
| 71 | +Then from this directory, install the dependencies: |
| 72 | + |
| 73 | + python -m pip install . |
| 74 | + |
| 75 | +This would make the `s3torch-benchmark` and `s3torch-datagen` commands available to you. Note: the installation would |
| 76 | +recommend $PATH modifications if necessary, allowing you to use the commands directly. |
| 77 | + |
| 78 | +### (Pre-requisite) Configure AWS Credentials |
| 79 | + |
| 80 | +The commands provided below(`datagen.py`, `benchmark.py`) rely on the standard [AWS credential discovery mechanism](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html). |
| 81 | +Supplement the command as necessary to ensure the AWS credentials are made available to the process. For eg: by setting |
| 82 | +the `AWS_PROFILE` environment variable. |
| 83 | + |
| 84 | +### Configuring the dataset |
| 85 | + |
| 86 | +_Note: This is a one-time setup for each dataset configuration. The dataset configuration files, once created locally |
| 87 | +and can be used in subsequent benchmarks, as long as the dataset on the S3 bucket is intact._ |
| 88 | + |
| 89 | +If you already have a dataset, you only need upload it to an S3 bucket and setup a YAML file under |
| 90 | +`./conf/dataset/` in the following format: |
| 91 | + |
| 92 | +```yaml |
| 93 | +# custom_dataset.yaml |
| 94 | + |
| 95 | +prefix_uri: s3://<S3_BUCKET>/<S3_PREFIX>/ |
| 96 | +region: <AWS_REGION> |
| 97 | +sharding: TAR|null # if the samples have been packed into TAR archives. |
| 98 | +``` |
| 99 | +
|
| 100 | +This dataset can then be referenced in an experiment with an entry like `dataset: custom_dataset` (note that we're |
| 101 | +omitting the *.yaml extension). This will result in running the benchmarks against this dataset. Some experiments have |
| 102 | +already been defined for reference - see `./conf/dataloading.yaml` or `./conf/sharding.yaml`. |
| 103 | + |
| 104 | +_Note: Ensure the bucket is in the same region as the EC2 instance to eliminate network latency effects in your |
| 105 | +measurements._ |
| 106 | + |
| 107 | +Alternatively, you can use the `s3torch-datagen` command to procedurally generate an image dataset and upload it to |
| 108 | +Amazon S3. The script also creates a Hydra configuration file at the appropriate path. |
| 109 | + |
| 110 | +``` |
| 111 | +$ s3torch-datagen --help |
| 112 | +Usage: s3torch-datagen [OPTIONS] |
| 113 | + |
| 114 | + Synthesizes a dataset that will be used for benchmarking and uploads it to |
| 115 | + an S3 bucket. |
| 116 | + |
| 117 | +Options: |
| 118 | + -n, --num-samples FLOAT Number of samples to generate. Can be supplied as |
| 119 | + an IEC or SI prefix. Eg: 1k, 2M. Note: these are |
| 120 | + case-sensitive notations. [default: 1k] |
| 121 | + --resolution TEXT Resolution written in 'widthxheight' format |
| 122 | + [default: 496x387] |
| 123 | + --shard-size TEXT If supplied, the images are grouped into tar files |
| 124 | + of the given size. Size can be supplied as an IEC |
| 125 | + or SI prefix. Eg: 16Mib, 4Kb, 1Gib. Note: these are |
| 126 | + case-sensitive notations. |
| 127 | + --s3-bucket TEXT S3 Bucket name. Note: Ensure the credentials are |
| 128 | + made available either through environment variables |
| 129 | + or a shared credentials file. [required] |
| 130 | + --s3-prefix TEXT Optional S3 Key prefix where the dataset will be |
| 131 | + uploaded. Note: a prefix will be autogenerated. eg: |
| 132 | + s3://<BUCKET>/1k_256x256_16Mib_sharded/ |
| 133 | + --region TEXT Region where the S3 bucket is hosted. [default: |
| 134 | + us-east-1] |
| 135 | + --help Show this message and exit. |
| 136 | + |
| 137 | +``` |
| 138 | +
|
| 139 | +Here are some sample dataset configurations that we ran our benchmarks against: |
| 140 | +
|
| 141 | +- `-n 20k --resolution 496x387` |
| 142 | +- `-n 20k --resolution 496x387 --shard-size {4, 8, 16, 32, 64}MiB` |
| 143 | +
|
| 144 | +Example: |
| 145 | +
|
| 146 | +``` |
| 147 | +$ s3torch-datagen -n 20k \ |
| 148 | + --resolution 496x387 \ |
| 149 | + --shard-size 4MB \ |
| 150 | + --s3-bucket swift-benchmark-dataset \ |
| 151 | + --region eu-west-2 |
| 152 | + |
| 153 | +Generating data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 1243.50it/s] |
| 154 | +Uploading to S3: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 3378.87it/s] |
| 155 | +Dataset uploaded to: s3://swift-benchmark-dataset/20k_496x387_images_4MB_shards/ |
| 156 | +Dataset Configuration created at: ./conf/dataset/20k_496x387_images_4MB_shards.yaml |
| 157 | +Configure your experiment by setting the entry: |
| 158 | + dataset: 20k_496x387_images_4MB_shards |
| 159 | +Alternatively, you can run specify it on the cmd-line when running the benchmark like so: |
| 160 | + s3torch-benchmark -cd conf -m -cn <CONFIG-NAME> 'dataset=20k_496x387_images_4MB_shards' |
| 161 | +``` |
| 162 | +
|
| 163 | +--- |
| 164 | +
|
| 165 | +Finally, once the dataset and other configuration modules have been defined, you can kick off the benchmark by running: |
| 166 | +
|
| 167 | + $ cd s3torchbenchmarking/ |
| 168 | +
|
| 169 | + $ s3torch-benchmark -cd conf -m -cn YOUR-TEST-CONFIGURATION |
| 170 | + |
| 171 | + # Example-1: |
| 172 | + $ s3torch-benchmark -cd conf -m -cn dataloading 'dataset.prefix_uri=<S3-PREFIX>' 'dataset.region=eu-west-2' |
| 173 | +
|
| 174 | + # Example-2: |
| 175 | + $ s3torch-benchmark -cd conf -m -cn checkpointing 'dataset.prefix_uri=<S3-PREFIX>' 'dataset.region=eu-west-2' |
| 176 | +
|
| 177 | +_Note: For overriding any other benchmark parameters, |
| 178 | +see [Hydra Overrides](https://hydra.cc/docs/advanced/override_grammar/basic/). You can also run `s3torch-benchmark --hydra-help` to learn more._ |
| 179 | +
|
| 180 | +Experiments will report total training time, number of training samples as well as host-level metrics like CPU |
| 181 | +Utilisation, GPU Utilisation (if available) etc. |
| 182 | +
|
| 183 | +## Next Steps |
| 184 | +
|
| 185 | +- Use [Hydra Callbacks](https://hydra.cc/docs/experimental/callbacks/) to aggregate and plot benchmark results. |
| 186 | +- Add more models (LLMs?) to monitor training performance. |
| 187 | +- Support plugging in user-defined models and automatic discovery of the same. |
0 commit comments