Skip to content

Commit 2af7e3a

Browse files
matthieu-d4rIsaevIlya
authored andcommitted
fix(dcp): address bug in DCP + CPU benchmarks (gloo) (#266)
Write `model.to(device=torch.device("cpu")` for gloo-based benchmarks instead of `model.to(device_id)`. Also, update DCP benchmarks README.md and add one extra caveat and its solution.
1 parent 8c9aabe commit 2af7e3a

File tree

2 files changed

+23
-6
lines changed

2 files changed

+23
-6
lines changed

s3torchbenchmarking/src/s3torchbenchmarking/dcp/README.md

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@ These benchmarks are designed to:
1616
> [!IMPORTANT]
1717
> The benchmarks are designed to be run on a EC2 instance.
1818
19-
Install the `s3torchbenchmarking` package with `pip` (see the [root README](../../../README.md) for instructions); once
20-
installed, the DCP benchmarks can be run with:
19+
Install the `s3torchbenchmarking` package with `pip` (see the [root README](../../../README.md) for instructions),
20+
along with the `s3torchconnector[dcp]` extra; once installed, the DCP benchmarks can be run with:
2121

2222
```shell
2323
$ s3torch-benchmark-dcp -cd conf -cn dcp
@@ -32,13 +32,28 @@ The command must be executed from the package's root, where it can read from the
3232
3333
#### Potential caveats
3434

35-
If you encounter the following error during installation:
35+
If you encounter the following errors during installation, try the associated command:
36+
37+
**Error**:
38+
39+
```
40+
RuntimeError: Failed to import transformers.models.vit.modeling_vit because of the following error (look up to see its traceback):
41+
operator torchvision::nms does not exist
42+
```
43+
44+
**Try**:
45+
46+
```shell
47+
$ conda install -y pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
48+
```
49+
50+
**Error**:
3651

3752
```
3853
TypeError: canonicalize_version() got an unexpected keyword argument 'strip_trailing_zero'
3954
```
4055

41-
Run this command to resolve it:
56+
**Try**:
4257

4358
```shell
4459
$ pip install "setuptools<71"

s3torchbenchmarking/src/s3torchbenchmarking/dcp/benchmark.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -109,12 +109,14 @@ def run(
109109
if cfg.backend == "nccl":
110110
device_id = rank % torch.cuda.device_count()
111111
torch.cuda.set_device(device_id)
112+
model.to(device_id)
113+
model = DistributedDataParallel(model, device_ids=[device_id])
112114
else:
113115
device_id = rank % torch.cpu.device_count()
114116
torch.cpu.set_device(device_id)
117+
model.to(device=torch.device("cpu"))
118+
model = DistributedDataParallel(model)
115119

116-
model.to(device_id)
117-
model = DistributedDataParallel(model, device_ids=[device_id])
118120
state_dict = model.state_dict()
119121

120122
begin_save = perf_counter()

0 commit comments

Comments
 (0)