Provide tooling for running benchmarks against pytorch connectors (#135)

vishalg0wda · Vishal Gowda · web-flow · commit d8983a0ec16d · 2024-02-22T16:19:27.000Z
Provide tooling for running benchmarks against pytorch connectors

2 CLIs are provided:
 - s3torch-benchmark: used to run benchmarks
 - s3torch-datagen: used to synthesize, upload and configure datasets 
for usage in benchmarking experiments.

Some reference benchmarking scenarios have been defined that illustrate 
how similar experiments can be defined. `s3torchbenchmarking/README.md` 
has further instructions.

---------

Co-authored-by: Vishal Gowda &lt;vgd@amazon.com&gt;
diff --git a/.github/workflows/python-checks.yml b/.github/workflows/python-checks.yml
@@ -51,11 +51,14 @@ jobs:
           python -m pip install torch --extra-index-url https://download.pytorch.org/whl/cpu
           python -m pip install -e "s3torchconnectorclient[test]"
           python -m pip install -e "s3torchconnector[test]"
+          python -m pip install -e "s3torchbenchmarking[test]"
 
       - name: s3torchconnectorclient unit tests
         run: pytest s3torchconnectorclient/python/tst/unit --hypothesis-profile ci --hypothesis-show-statistics
       - name: s3torchconnector unit tests
         run: pytest s3torchconnector/tst/unit --ignore s3torchconnector/tst/unit/lightning --hypothesis-profile ci --hypothesis-show-statistics
+      - name: s3torchbenchmarking unit tests
+        run: pytest s3torchbenchmarking/tst --hypothesis-profile ci --hypothesis-show-statistics
 
   lint:
     name: Python lints
diff --git a/.gitignore b/.gitignore
@@ -28,6 +28,7 @@ venv/
 *.egg-info/
 .installed.cfg
 *.egg
+multirun/
 
 # Prevent publishing file with third party licenses
 THIRD-PARTY-LICENSES
diff --git a/s3torchbenchmarking/README.md b/s3torchbenchmarking/README.md
@@ -0,0 +1,187 @@
+# Benchmarking the S3 Connector for PyTorch
+
+This directory contains a modular component for the experimental evaluation of the performance of the Amazon S3 Connector for
+PyTorch.
+The goal of this component is to be able to run performance benchmarks for PyTorch connectors in an easy-to-reproduce and
+extensible fashion. This way, users can experiment with different settings and arrive at the optimal configuration for their workloads,
+before committing to a setup.
+
+By managing complex configuration space with [Hydra](https://hydra.cc/) we are able to define modular configuration pieces mapped to various
+stages of the training pipeline. This approach allows one to mix and match configurations and measure the performance 
+impact to the end-to-end training process. To achieve this, we split configuration to 4 pieces. Namely:
+
+**dataset**: The `dataset` configuration keeps information about where the data resides. While we support sharded objects, only loading
+from TAR objects is supported currently.
+
+**dataloader**: Used to configure the [PyTorch DataLoader](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html). Parameters like
+`batch_size` and `num_workers` are self-explanatory. `kind` is used to specify which PyTorch dataset to use(`s3iterabledataset`, `s3mapdataset`, `fsspec`).
+
+**training**: Specify what model to learn and how many epochs to execute the training for. Currently, we implemented two
+models `Entitlement` and [ViT](https://huggingface.co/docs/transformers/model_doc/vit). To make it easier to add new models, and abstract the learning-sample processing logic from configuration, this module
+defines a Model interface where each model expected to implement `load_sample`, `train`, and `save` methods.
+
+**checkpoint**: Defines where(`disk`, `s3`) and how frequently checkpoints are to be saved.
+
+Once the sub-configurations are defined, one can easily create an experimental configuration that will use the Hydra Sweeper
+to launch multiple experiments sequentially.
+
+For example, the `dataloading` experiment stored at `./conf/dataloading.yaml` has the following
+content:
+
+```
+defaults:
+  - dataloader: ???
+  - dataset: unsharded_dataset
+  - training: entitlement
+  - checkpoint: none
+
+
+hydra:
+  mode: MULTIRUN
+
+  sweeper:
+    params:
+      dataloader: s3iterabledataset, fsspec
+      dataloader.num_workers: 2,4,8,16
+```
+
+This configuration pins the `dataset` and `training` model while overriding the `dataloader` to change `kind`
+and `num_workers`. Running this benchmark will result in sequentially running 8 different scenarios,
+each with the different combinations of swept parameters. As `Entitlement` is not really performing any training, this
+experiment is helpful to see upper-limit of dataloader throughput without being susceptible to GPU backpressure.
+
+## Getting Started
+
+The benchmarking code is available within the `s3torchbenchmarking`. First navigate into the directory:
+
+    cd s3torchbenchmkaring
+
+The tests can be run locally, or you can launch an EC2 instance with a GPU(we used a [g5.2xlarge](https://aws.amazon.com/ec2/instance-types/g5/)), choosing 
+the [AWS Deep Learning AMI GPU PyTorch 2.0.1 (Amazon Linux 2)](https://aws.amazon.com/releasenotes/aws-deep-learning-ami-gpu-pytorch-2-0-amazon-linux-2/) as your AMI. Activate the venv within this machine
+by running:
+
+    source pytorch activate
+
+If running locally you can optionally configure a Python virtualenv:
+
+    python -m venv <ENV-NAME>
+    python <PATH-TO-VENV>/bin/activate
+
+
+Then from this directory, install the dependencies:
+
+    python -m pip install .
+
+This would make the `s3torch-benchmark` and `s3torch-datagen` commands available to you. Note: the installation would
+recommend $PATH modifications if necessary, allowing you to use the commands directly.
+
+### (Pre-requisite) Configure AWS Credentials
+
+The commands provided below(`datagen.py`, `benchmark.py`) rely on the standard [AWS credential discovery mechanism](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html). 
+Supplement the command as necessary to ensure the AWS credentials are made available to the process. For eg: by setting
+the `AWS_PROFILE` environment variable.
+
+### Configuring the dataset
+
+_Note: This is a one-time setup for each dataset configuration. The dataset configuration files, once created locally
+and can be used in subsequent benchmarks, as long as the dataset on the S3 bucket is intact._
+
+If you already have a dataset, you only need upload it to an S3 bucket and setup a YAML file under
+`./conf/dataset/` in the following format:
+
+```yaml
+# custom_dataset.yaml
+
+prefix_uri: s3://<S3_BUCKET>/<S3_PREFIX>/
+region: <AWS_REGION>
+sharding: TAR|null # if the samples have been packed into TAR archives.
+```
+
+This dataset can then be referenced in an experiment with an entry like `dataset: custom_dataset` (note that we're 
+omitting the *.yaml extension). This will result in running the benchmarks against this dataset. Some experiments have 
+already been defined for reference - see `./conf/dataloading.yaml` or `./conf/sharding.yaml`.
+
+_Note: Ensure the bucket is in the same region as the EC2 instance to eliminate network latency effects in your
+measurements._
+
+Alternatively, you can use the `s3torch-datagen` command to procedurally generate an image dataset and upload it to 
+Amazon S3. The script also creates a Hydra configuration file at the appropriate path.
+
+```
+$ s3torch-datagen --help
+Usage: s3torch-datagen [OPTIONS]
+
+  Synthesizes a dataset that will be used for benchmarking and uploads it to
+  an S3 bucket.
+
+Options:
+  -n, --num-samples FLOAT  Number of samples to generate.  Can be supplied as
+                           an IEC or SI prefix. Eg: 1k, 2M. Note: these are
+                           case-sensitive notations. [default: 1k]
+  --resolution TEXT        Resolution written in 'widthxheight' format
+                           [default: 496x387]
+  --shard-size TEXT        If supplied, the images are grouped into tar files
+                           of the given size. Size can be supplied as an IEC
+                           or SI prefix. Eg: 16Mib, 4Kb, 1Gib. Note: these are
+                           case-sensitive notations.
+  --s3-bucket TEXT         S3 Bucket name. Note: Ensure the credentials are
+                           made available either through environment variables
+                           or a shared credentials file.  [required]
+  --s3-prefix TEXT         Optional S3 Key prefix where the dataset will be
+                           uploaded. Note: a prefix will be autogenerated. eg:
+                           s3://<BUCKET>/1k_256x256_16Mib_sharded/
+  --region TEXT            Region where the S3 bucket is hosted.  [default:
+                           us-east-1]
+  --help                   Show this message and exit.
+
+```
+
+Here are some sample dataset configurations that we ran our benchmarks against:
+
+- `-n 20k --resolution 496x387`
+- `-n 20k --resolution 496x387 --shard-size {4, 8, 16, 32, 64}MiB`
+
+Example:
+
+```
+$ s3torch-datagen -n 20k \
+   --resolution 496x387 \
+   --shard-size 4MB \
+   --s3-bucket swift-benchmark-dataset \
+   --region eu-west-2
+
+Generating data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 1243.50it/s]
+Uploading to S3: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 3378.87it/s]
+Dataset uploaded to: s3://swift-benchmark-dataset/20k_496x387_images_4MB_shards/
+Dataset Configuration created at: ./conf/dataset/20k_496x387_images_4MB_shards.yaml
+Configure your experiment by setting the entry:
+    dataset: 20k_496x387_images_4MB_shards
+Alternatively, you can run specify it on the cmd-line when running the benchmark like so:
+    s3torch-benchmark -cd conf  -m -cn <CONFIG-NAME> 'dataset=20k_496x387_images_4MB_shards'
+```
+
+---
+
+Finally, once the dataset and other configuration modules have been defined, you can kick off the benchmark by running:
+
+    $ cd s3torchbenchmarking/
+
+    $ s3torch-benchmark -cd conf -m -cn YOUR-TEST-CONFIGURATION 
+    
+    # Example-1:
+    $ s3torch-benchmark -cd conf -m -cn dataloading 'dataset.prefix_uri=<S3-PREFIX>' 'dataset.region=eu-west-2'
+
+    # Example-2:
+    $ s3torch-benchmark -cd conf -m -cn checkpointing 'dataset.prefix_uri=<S3-PREFIX>' 'dataset.region=eu-west-2'
+
+_Note: For overriding any other benchmark parameters,
+see [Hydra Overrides](https://hydra.cc/docs/advanced/override_grammar/basic/). You can also run `s3torch-benchmark --hydra-help` to learn more._
+
+Experiments will report total training time, number of training samples as well as host-level metrics like CPU
+Utilisation, GPU Utilisation (if available) etc.
+
+## Next Steps
+
+- Use [Hydra Callbacks](https://hydra.cc/docs/experimental/callbacks/) to aggregate and plot benchmark results.
+- Add more models (LLMs?) to monitor training performance.
+- Support plugging in user-defined models and automatic discovery of the same.
diff --git a/s3torchbenchmarking/conf/checkpoint/disk.yaml b/s3torchbenchmarking/conf/checkpoint/disk.yaml
@@ -0,0 +1,4 @@
+save_one_in: 25
+destination: disk
+uri: checkpoints/
+region: eu-west-2
diff --git a/s3torchbenchmarking/conf/checkpoint/none.yaml b/s3torchbenchmarking/conf/checkpoint/none.yaml
@@ -0,0 +1 @@
+save_one_in: 0
diff --git a/s3torchbenchmarking/conf/checkpoint/s3.yaml b/s3torchbenchmarking/conf/checkpoint/s3.yaml
@@ -0,0 +1,4 @@
+save_one_in: 25
+destination: s3
+uri: s3://swift-benchmark-dataset/checkpoints/
+region: eu-west-2
diff --git a/s3torchbenchmarking/conf/checkpointing.yaml b/s3torchbenchmarking/conf/checkpointing.yaml
@@ -0,0 +1,13 @@
+defaults:
+  - _self_
+  - dataloader: s3iterabledataset
+  - dataset: unsharded_dataset
+  - training: vit
+  - checkpoint: ???
+
+
+hydra:
+  mode: MULTIRUN
+  sweeper:
+    params:
+      checkpoint: disk, s3
diff --git a/s3torchbenchmarking/conf/dataloader/fsspec.yaml b/s3torchbenchmarking/conf/dataloader/fsspec.yaml
@@ -0,0 +1,3 @@
+kind: fsspec
+batch_size: 128
+num_workers: 8
diff --git a/s3torchbenchmarking/conf/dataloader/s3iterabledataset.yaml b/s3torchbenchmarking/conf/dataloader/s3iterabledataset.yaml
@@ -0,0 +1,3 @@
+kind: s3iterabledataset
+batch_size: 128
+num_workers: 8
diff --git a/s3torchbenchmarking/conf/dataloader/s3mapdataset.yaml b/s3torchbenchmarking/conf/dataloader/s3mapdataset.yaml
@@ -0,0 +1,3 @@
+kind: s3mapdataset
+batch_size: 128
+num_workers: 8
diff --git a/s3torchbenchmarking/conf/dataloading.yaml b/s3torchbenchmarking/conf/dataloading.yaml
@@ -0,0 +1,13 @@
+defaults:
+  - _self_
+  - dataloader: ???
+  - dataset: unsharded_dataset
+  - training: entitlement
+  - checkpoint: none
+
+hydra:
+  mode: MULTIRUN
+  sweeper:
+    params:
+      dataloader: s3iterabledataset, fsspec
+      dataloader.num_workers: 2,4,8,16
diff --git a/s3torchbenchmarking/conf/dataset/mp_dataset.yaml b/s3torchbenchmarking/conf/dataset/mp_dataset.yaml
@@ -0,0 +1,2 @@
+prefix_uri: /tmp/s3mounts/swift-benchmark-dataset/100_images
+sharding: False
diff --git a/s3torchbenchmarking/conf/dataset/sharded_dataset.yaml b/s3torchbenchmarking/conf/dataset/sharded_dataset.yaml
@@ -0,0 +1,3 @@
+prefix_uri: ??? # <S3-URI> eg: s3://example-bucket/sharded-dataset-prefix/
+region: ??? # <AWS-REGION> eg: us-east-1
+sharding: TAR
diff --git a/s3torchbenchmarking/conf/dataset/unsharded_dataset.yaml b/s3torchbenchmarking/conf/dataset/unsharded_dataset.yaml
@@ -0,0 +1,3 @@
+prefix_uri: ??? # <S3-URI> eg: s3://example-bucket/dataset-prefix/
+region: ??? # <AWS-REGION> eg: us-east-1
+sharding: null
diff --git a/s3torchbenchmarking/conf/sharding.yaml b/s3torchbenchmarking/conf/sharding.yaml
@@ -0,0 +1,14 @@
+defaults:
+  - _self_
+  - dataloader: ???
+  - dataset: sharded_dataset
+  - training: entitlement
+  - checkpoint: none
+
+
+hydra:
+  mode: MULTIRUN
+
+  sweeper:
+    params:
+      dataloader: s3iterabledataset, fsspec
diff --git a/s3torchbenchmarking/conf/training/entitlement.yaml b/s3torchbenchmarking/conf/training/entitlement.yaml
@@ -0,0 +1,2 @@
+model: entitlement
+max_epochs: 1
diff --git a/s3torchbenchmarking/conf/training/vit.yaml b/s3torchbenchmarking/conf/training/vit.yaml
@@ -0,0 +1,2 @@
+model: vit
+max_epochs: 1
diff --git a/s3torchbenchmarking/pyproject.toml b/s3torchbenchmarking/pyproject.toml
@@ -0,0 +1,32 @@
+[build-system]
+requires = ["setuptools", "build"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "s3torchbenchmarking"
+version = "0.0.1"
+description = "Tools to run and compare benchmarks against various PyTorch connectors like the s3torchconnector."
+requires-python = ">=3.8,<3.12"
+readme = "README.md"
+dependencies = [
+    "torch >= 2.0.1",
+    "s3torchconnector>=1.1.1",
+    "hydra-core",
+    "torchdata>=0.6.1",
+    "torchvision",
+    "s3fs",
+    "transformers",
+    "numpy",
+    "psutil",
+    "pynvml",
+    "boto3",
+    "prefixed",
+    "click",
+    "omegaconf",
+]
+optional-dependencies = { test = ["pytest"] }
+scripts = { s3torch-benchmark = "s3torchbenchmarking.benchmark:run_experiment", s3torch-datagen = "s3torchbenchmarking.datagen:synthesize_dataset" }
+
+[tool.setuptools.packages]
+# Pure Python packages/modules and configuration files
+find = { where = ["src"] }
diff --git a/s3torchbenchmarking/src/s3torchbenchmarking/__init__.py b/s3torchbenchmarking/src/s3torchbenchmarking/__init__.py
diff --git a/s3torchbenchmarking/src/s3torchbenchmarking/benchmark.py b/s3torchbenchmarking/src/s3torchbenchmarking/benchmark.py
diff --git a/s3torchbenchmarking/src/s3torchbenchmarking/benchmark_utils.py b/s3torchbenchmarking/src/s3torchbenchmarking/benchmark_utils.py
diff --git a/s3torchbenchmarking/src/s3torchbenchmarking/datagen.py b/s3torchbenchmarking/src/s3torchbenchmarking/datagen.py
diff --git a/s3torchbenchmarking/src/s3torchbenchmarking/models.py b/s3torchbenchmarking/src/s3torchbenchmarking/models.py
diff --git a/s3torchbenchmarking/tst/test_compatibility.py b/s3torchbenchmarking/tst/test_compatibility.py

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+kind: fsspec`
	`2`	`+batch_size: 128`
	`3`	`+num_workers: 8`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+kind: s3iterabledataset`
	`2`	`+batch_size: 128`
	`3`	`+num_workers: 8`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+kind: s3mapdataset`
	`2`	`+batch_size: 128`
	`3`	`+num_workers: 8`