Skip to content

Commit f54d495

Browse files
dorotat-nvtrvachov
authored andcommitted
fix Evo2 training crash - TE commit (#796)
### Description `TE=9d4e11eaa508383e35b510dc338e58b09c30be73` solve issue with divergence of Evo2 ### Type of changes <!-- Mark the relevant option with an [x] --> - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. * If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) * If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Usage <!--- How does a user interact with the changed code --> ```python TODO: Add code snippet ``` ### Pre-submit Checklist <!--- Ensure all items are completed before submitting --> - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully
1 parent d61dfa9 commit f54d495

File tree

1 file changed

+8
-0
lines changed

1 file changed

+8
-0
lines changed

Dockerfile

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,13 @@ apt-get upgrade -qyy \
5656
rm -rf /tmp/* /var/tmp/*
5757
EOF
5858

59+
60+
## BUMP TE as a solution to the issue https://github.com/NVIDIA/bionemo-framework/issues/422. Drop this when pytorch images ship the fixed commit.
61+
ARG TE_TAG=9d4e11eaa508383e35b510dc338e58b09c30be73
62+
RUN PIP_CONSTRAINT= NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi \
63+
pip --disable-pip-version-check --no-cache-dir install \
64+
git+https://github.com/NVIDIA/TransformerEngine.git@${TE_TAG}
65+
5966
# Install AWS CLI based on architecture
6067
RUN if [ "$TARGETARCH" = "arm64" ]; then \
6168
curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"; \
@@ -68,6 +75,7 @@ RUN if [ "$TARGETARCH" = "arm64" ]; then \
6875
./aws/install && \
6976
rm -rf aws awscliv2.zip
7077

78+
7179
# Use a branch of causal_conv1d while the repository works on Blackwell support.
7280
ARG CAUSAL_CONV_TAG=52e06e3d5ca10af0c7eb94a520d768c48ef36f1f
7381
RUN CAUSAL_CONV1D_FORCE_BUILD=TRUE pip --disable-pip-version-check --no-cache-dir install git+https://github.com/trvachov/causal-conv1d.git@${CAUSAL_CONV_TAG}

0 commit comments

Comments
 (0)