Skip to content

Commit c4b6463

Browse files
authored
TF 2.18 Training SM pin nccl to version<13 (#5193)
* Test heavy efa * enable efa logging * enable ap * pin nccl version * fix nccl version * revert libnccl * test efa with usr lib * add nvidia debug * add logging * fix debug call * remove -x * fix log * set +x * enable set -x * print output * try tf 2.19 * print output * run 2.18 * install aws-ofi-nccl efa and mpi from amazon * fix ofi * test pytorch * print output * test tf with correct logging * use prod image * build nccl cuda 12 * build test all * revert toml and formatting
1 parent df8a23b commit c4b6463

File tree

2 files changed

+16
-15
lines changed

2 files changed

+16
-15
lines changed

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,14 @@
11
*GitHub Issue #, if available:*
22

3-
**Note**:
4-
- If merging this PR should also close the associated Issue, please also add that Issue # to the Linked Issues section on the right.
3+
**Note**:
4+
- If merging this PR should also close the associated Issue, please also add that Issue # to the Linked Issues section on the right.
55

66
- All PR's are checked weekly for staleness. This PR will be closed if not updated in 30 days.
77

88
### Description
99

1010
### Tests Run
11+
1112
By default, docker image builds and tests are disabled. Two ways to run builds and tests:
1213
1. Using dlc_developer_config.toml
1314
2. Using this PR description (currently only supported for PyTorch, TensorFlow, vllm, and base images)
@@ -16,7 +17,7 @@ By default, docker image builds and tests are disabled. Two ways to run builds a
1617
<summary>How to use the helper utility for updating dlc_developer_config.toml</summary>
1718

1819
Assuming your remote is called `origin` (you can find out more with `git remote -v`)...
19-
20+
2021
- Run default builds and tests for a particular buildspec - also commits and pushes changes to remote; Example:
2122

2223
`python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -cp origin`
@@ -42,7 +43,7 @@ Use the code block below to uncomment commands and run the PR CodeBuild jobs. Th
4243

4344
- `# /buildspec <buildspec_path>`
4445
- e.g.: `# /buildspec pytorch/training/buildspec.yml`
45-
- If this line is commented out, dlc_developer_config.toml will be used.
46+
- If this line is commented out, dlc_developer_config.toml will be used.
4647
- `# /tests <test_list>`
4748
- e.g.: `# /tests sanity security ec2`
4849
- If this line is commented out, it will run the default set of tests (same as the defaults in dlc_developer_config.toml): `sanity, security, ec2, ecs, eks, sagemaker, sagemaker-local`.
@@ -51,13 +52,13 @@ Use the code block below to uncomment commands and run the PR CodeBuild jobs. Th
5152

5253
```
5354
# /buildspec <buildspec_path>
54-
# /tests <test_list>
55+
# /tests <test_list>
5556
```
5657

5758
### Formatting
5859
- [ ] I have run `black -l 100` on my code (formatting tool: https://black.readthedocs.io/en/stable/getting_started.html)
5960

60-
### PR Checklist
61+
### PR Checklist
6162
<details>
6263
<summary>Expand</summary>
6364

tensorflow/training/docker/2.18/py3/cu125/Dockerfile.gpu

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -106,8 +106,8 @@ RUN apt-get update && apt-get install -y --no-install-recommends --allow-unauthe
106106
&& apt-get install -y --no-install-recommends --allow-unauthenticated --allow-change-held-packages \
107107
libcublas-dev-${CUDA_DASH} \
108108
libcublas-${CUDA_DASH} \
109-
libnccl2 \
110-
libnccl-dev \
109+
libnccl2=2.27.7-1+cuda12.9 \
110+
libnccl-dev=2.27.7-1+cuda12.9 \
111111
&& rm -rf /var/lib/apt/lists/* \
112112
&& apt-get clean \
113113
&& mkdir -p /var/run/sshd
@@ -220,8 +220,8 @@ RUN ${PIP} install --no-cache-dir -U \
220220
h5py \
221221
absl-py \
222222
werkzeug \
223-
urllib3
224-
223+
urllib3
224+
225225
# Install AWS OFI NCCL plug-in
226226
RUN apt-get update && apt-get install -y \
227227
autoconf \
@@ -256,7 +256,7 @@ RUN mkdir -p /tmp/nvjpeg \
256256
&& rm -rf /tmp/nvjpeg \
257257
# patch cuobjdump and nvdisasm
258258
&& rm -rf /usr/local/cuda/bin/cuobjdump* \
259-
&& rm -rf /usr/local/cuda/bin/nvdisasm*
259+
&& rm -rf /usr/local/cuda/bin/nvdisasm*
260260

261261
# Allow OpenSSH to talk to containers without asking for confirmation
262262
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \
@@ -295,7 +295,7 @@ RUN ${PIP} install --no-cache-dir -U \
295295
${TF_URL} \
296296
"tensorflow-io==0.37.*" \
297297
"tensorflow-datasets==4.9.7" \
298-
opencv-python
298+
opencv-python
299299

300300
RUN HOME_DIR=/root \
301301
&& curl -o ${HOME_DIR}/oss_compliance.zip https://aws-dlinfra-utilities.s3.amazonaws.com/oss_compliance.zip \
@@ -381,7 +381,7 @@ RUN $PYTHON -m pip install --no-cache-dir -U \
381381

382382
RUN $PYTHON -m pip install --no-cache-dir -U \
383383
sagemaker-experiments==0.1.45
384-
384+
385385
RUN $PYTHON -m pip install --no-cache-dir -U \
386386
sagemaker-training
387387

@@ -392,11 +392,11 @@ RUN $PYTHON -m pip install --no-cache-dir -U \
392392
sagemaker-studio-analytics-extension==0.1.4
393393

394394
RUN $PYTHON -m pip install --no-cache-dir -U \
395-
sagemaker-studio-sparkmagic-lib==0.2.0
395+
sagemaker-studio-sparkmagic-lib==0.2.0
396396

397397
RUN $PYTHON -m pip install --no-cache-dir -U \
398398
sparkmagic==0.21.0 \
399-
smclarify
399+
smclarify
400400

401401

402402
# install boost

0 commit comments

Comments
 (0)