Skip to content

Commit 9317eb1

Browse files
authored
Merge branch 'main' into jwilber/docs-chatbot
2 parents 91b3106 + 2d8923d commit 9317eb1

File tree

112 files changed

+6876
-5001
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

112 files changed

+6876
-5001
lines changed

3rdparty/Megatron-LM

Submodule Megatron-LM updated 969 files

3rdparty/NeMo

Submodule NeMo updated from b685967 to 6a78ab8

CODEOWNERS

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
# @sichu2023 - Simon Chu
3030
# @skothenhill-nv - Steven Kothen-Hill
3131
# @trvachov - Timur Rvachov
32+
# @tshimko-nv - Tyler Shimko
3233
# @yzhang123 - Yang Zhang
3334

3435
# TODO: make this a team of bionemo-core contributors
@@ -44,8 +45,8 @@ license_header @dorotat-nv @jstjohn @malcolmgreaves @trvachov
4445
#
4546
## DOCUMENTATION
4647
#
47-
**.md @dorotat-nv @jstjohn @malcolmgreaves @pstjohn @trvachov @jwilber
48-
docs @dorotat-nv @jstjohn @malcolmgreaves @pstjohn @trvachov @jwilber
48+
**.md @dorotat-nv @jstjohn @malcolmgreaves @pstjohn @trvachov @jwilber @tshimko-nv
49+
docs @dorotat-nv @jstjohn @malcolmgreaves @pstjohn @trvachov @jwilber @tshimko-nv
4950

5051

5152
#

Dockerfile

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
# https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/blob/main/.gitlab-ci.yml
2020
# We should keep versions in our container up to date to ensure that we get the latest tested perf improvements and
2121
# training loss curves from NeMo.
22-
ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:25.01-py3
22+
ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:25.04-py3
2323

2424
FROM rust:1.86.0 AS rust-env
2525

@@ -56,6 +56,13 @@ apt-get upgrade -qyy \
5656
rm -rf /tmp/* /var/tmp/*
5757
EOF
5858

59+
60+
## BUMP TE as a solution to the issue https://github.com/NVIDIA/bionemo-framework/issues/422. Drop this when pytorch images ship the fixed commit.
61+
ARG TE_TAG=9d4e11eaa508383e35b510dc338e58b09c30be73
62+
RUN PIP_CONSTRAINT= NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi \
63+
pip --disable-pip-version-check --no-cache-dir install \
64+
git+https://github.com/NVIDIA/TransformerEngine.git@${TE_TAG}
65+
5966
# Install AWS CLI based on architecture
6067
RUN if [ "$TARGETARCH" = "arm64" ]; then \
6168
curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"; \
@@ -68,6 +75,7 @@ RUN if [ "$TARGETARCH" = "arm64" ]; then \
6875
./aws/install && \
6976
rm -rf aws awscliv2.zip
7077

78+
7179
# Use a branch of causal_conv1d while the repository works on Blackwell support.
7280
ARG CAUSAL_CONV_TAG=52e06e3d5ca10af0c7eb94a520d768c48ef36f1f
7381
RUN CAUSAL_CONV1D_FORCE_BUILD=TRUE pip --disable-pip-version-check --no-cache-dir install git+https://github.com/trvachov/causal-conv1d.git@${CAUSAL_CONV_TAG}
@@ -123,6 +131,12 @@ fi
123131
###############################################################################
124132
# /end ARM
125133
###############################################################################
134+
# Fix the version of scikit-misc to 0.3.1 because newer versions of scikit-misc require numpy >= 2.0 to be built.
135+
# Since there are not pre-built wheels for arm64, we need to install this specific version.
136+
# Once bionemo is compatible with numpy >= 2.0, we can remove this.
137+
# Technically, this is only needed for the ARM build, but we apply to all architectures to avoid library version
138+
# divergence.
139+
RUN pip install scikit-misc==0.3.1
126140

127141
# Mamba dependancy installation
128142
RUN pip --disable-pip-version-check --no-cache-dir install \
@@ -188,14 +202,16 @@ rm -rf nvidia-resiliency-ext/
188202
sed -i "/ngcsdk/d" ./sub-packages/bionemo-core/pyproject.toml
189203
# Remove llama-index because bionemo doesn't use it and it adds CVEs to container
190204
sed -i "/llama-index/d" ./3rdparty/NeMo/requirements/requirements_nlp.txt
205+
# Pin 'nvidia-modelopt' to 0.27.1 due to an API incompatibility of version 0.25.0
206+
sed -i -E "s|nvidia-modelopt\[torch\]>=[^,]+,<=([^ ;]+)|nvidia-modelopt[torch]==\1|" ./3rdparty/NeMo/requirements/requirements_nlp.txt
191207
uv pip install --no-build-isolation \
192208
./3rdparty/* \
193209
./sub-packages/bionemo-* \
194210
-r /requirements-cve.txt \
195211
-r /requirements-test.txt
196212

197213
# Install back ngcsdk, as a WAR for the protobuf version conflict with nemo_toolkit.
198-
uv pip install ngcsdk
214+
uv pip install ngcsdk==3.64.3 # Temporary fix for changed filename, see https://nvidia.slack.com/archives/C074Z808N05/p1746231345981209
199215

200216
# Addressing security scan issue - CVE vulnerability https://github.com/advisories/GHSA-g4r7-86gm-pgqc The package is a
201217
# dependency of lm_eval from NeMo requirements_eval.txt. We also remove zstandard, another dependency of lm_eval, which
@@ -322,6 +338,8 @@ COPY ./docs ./docs
322338
COPY --from=rust-env /usr/local/cargo /usr/local/cargo
323339
COPY --from=rust-env /usr/local/rustup /usr/local/rustup
324340

341+
# Fix a CRIT vuln: https://github.com/advisories/GHSA-vqfr-h8mv-ghfj
342+
RUN uv pip install h11==0.16.0
325343

326344
# RUN rm -rf /usr/local/cargo /usr/local/rustup
327345
RUN chmod 777 -R /workspace/bionemo2/

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.5
1+
2.6

ci/benchmarks/partial-conv/geneformer_pretrain.yaml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,12 @@ script_args:
1313
data_path: /data/cellxgene_scdl
1414
model: geneformer
1515
variant: train
16-
config_name: geneformer_config
16+
config_name: 10M
1717
precision: [bf16-mixed]
1818
nodes: [2]
1919
gpus: 8
2020
batch_size: 32
21-
max_steps: 37000
21+
max_steps: 30000
2222
lr: 0.001
2323
val_check_interval: 500
2424
acc_grad: 1
@@ -38,10 +38,12 @@ script: |-
3838
--resume-if-exists \
3939
--log-every-n-steps 50 \
4040
--lr ${lr} \
41+
--create-tflops-callback \
4142
--create-tensorboard-logger \
4243
--result-dir=${tensorboard_dir} \
4344
--wandb-project ${wandb_project_name} \
4445
--wandb-job-type=${pipeline_label} \
46+
--wandb-group=${model}_${variant}_${config_name}__${target} \
4547
--cosine-rampup-frac 0.004331629559040111 \
4648
--cosine-hold-frac 0.021658147795200554 \
4749
--accumulate-grad-batches ${acc_grad} \

ci/benchmarks/perf/esm2_pretrain.yaml

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ time_limit: 1800
33
key_segments:
44
# Modify keys to be renamed (str) or excluded (False) from run identifier. By default, all args under script_args are included.
55
data_path: False
6+
dfpnl: False
67
script_args:
78
# All arguments referenced in the script string must be specified here.
89
# Arguments not referenced in the script string must have the 'arg' field specified.
@@ -17,24 +18,28 @@ script_args:
1718
stop_steps: 200
1819
gpus: 8
1920
acc_grad: 1
21+
dfpnl: ""
2022
products:
2123
- nodes: 1
2224
batch_size: 16
2325
pp: 1
2426
tp: 1
25-
# FIXME (broken pp): https://github.com/NVIDIA/bionemo-framework/issues/784
26-
# - nodes: 2
27-
# batch_size: 16
28-
# pp: 2
29-
# tp: 1
27+
dfpnl: ""
28+
- nodes: 2
29+
batch_size: 16
30+
pp: 2
31+
tp: 1
32+
dfpnl: "--decoder-first-pipeline-num-layers=17"
3033
- nodes: 2
3134
batch_size: 16
3235
pp: 1
3336
tp: 2
37+
dfpnl: ""
3438
- nodes: 2
3539
batch_size: 16
3640
pp: 1
3741
tp: 1
42+
dfpnl: ""
3843
script: |-
3944
WANDB_API_KEY=$BIONEMO_WANDB_API_KEY ${variant}_${model} \
4045
--train-cluster-path=${data_path}/train_clusters.parquet \
@@ -51,6 +56,7 @@ script: |-
5156
--min-seq-length=1024 \
5257
--max-seq-length=1024 \
5358
--num-layers=33 \
59+
${dfpnl} \
5460
--hidden-size=1280 \
5561
--num-attention-heads=20 \
5662
--ffn-hidden-size=5120 \

ci/benchmarks/perf/geneformer_pretrain.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ script_args:
1313
data_path: /data/cellxgene_scdl
1414
model: geneformer
1515
variant: train
16-
config_name: geneformer_config
16+
config_name: 10M
1717
precision: [bf16-mixed]
1818
gpus: 8
1919
max_steps: 1000
@@ -41,6 +41,7 @@ script: |-
4141
--micro-batch-size ${batch_size} \
4242
--resume-if-exists \
4343
--log-every-n-steps 50 \
44+
--create-tflops-callback \
4445
--lr ${lr} \
4546
--create-tensorboard-logger \
4647
--result-dir=${tensorboard_dir} \

ci/scripts/run_pytest.sh

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,13 @@ usage() {
2525
Usage: $(basename "$0") [OPTIONS]
2626
2727
Options:
28-
--skip-docs Skip running tests in the docs directory
29-
--no-nbval Skip jupyter notebook validation tests
30-
--skip-slow Skip tests marked as slow (@pytest.mark.slow)
31-
--only-slow Only run tests marked as slow (@pytest.mark.slow)
28+
--skip-docs Skip running tests in the docs directory
29+
--no-nbval Skip jupyter notebook validation tests
30+
--skip-slow Skip tests marked as slow (@pytest.mark.slow)
31+
--only-slow Only run tests marked as slow (@pytest.mark.slow)
3232
--allow-no-tests Allow sub-packages with no found tests (for example no slow tests if --only-slow is set)
33+
--ignore-files Skip files from tests using glob patterns (comma-separated, no spaces).
34+
Example: --ignore-files docs/*.ipynb,src/specific_test.py
3335
3436
Note: Documentation tests (docs/) are only run when notebook validation
3537
is enabled (--no-nbval not set) and docs are not skipped
@@ -54,6 +56,8 @@ NO_NBVAL=false
5456
SKIP_SLOW=false
5557
ONLY_SLOW=false
5658
ALLOW_NO_TESTS=false
59+
# TODO(@cspades): Ignore this Evo2 notebook test, which has a tendency to leave a 32GB orphaned process in GPU.
60+
declare -a IGNORE_FILES=("sub-packages/bionemo-evo2/examples/fine-tuning-tutorial.ipynb")
5761
error=false
5862

5963
# Parse command line arguments
@@ -64,6 +68,10 @@ while (( $# > 0 )); do
6468
--skip-slow) SKIP_SLOW=true ;;
6569
--only-slow) ONLY_SLOW=true ;;
6670
--allow-no-tests) ALLOW_NO_TESTS=true ;;
71+
--ignore-files)
72+
shift
73+
IFS=',' read -ra IGNORE_FILES <<< "$1"
74+
;;
6775
-h|--help) usage ;;
6876
*) echo "Unknown option: $1" >&2; usage 1 ;;
6977
esac
@@ -82,6 +90,10 @@ PYTEST_OPTIONS=(
8290
--cov-append
8391
--cov-report=xml:coverage.xml
8492
)
93+
# Add multiple file ignores if specified
94+
for ignore_file in "${IGNORE_FILES[@]}"; do
95+
PYTEST_OPTIONS+=(--ignore-glob="$ignore_file")
96+
done
8597
[[ "$NO_NBVAL" != true ]] && PYTEST_OPTIONS+=(--nbval-lax)
8698
[[ "$SKIP_SLOW" == true ]] && PYTEST_OPTIONS+=(-m "not slow")
8799
[[ "$ONLY_SLOW" == true ]] && PYTEST_OPTIONS+=(-m "slow")

docs/README.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,45 @@ docker run --rm -it -p 8000:8000 \
2121
And then navigate to [`http://0.0.0.0:8000`](http://0.0.0.0:8000) on your local
2222
machine.
2323

24+
## Sub-package Documentation
25+
26+
When adding documentation for a new sub-package, ensure it is properly integrated into the documentation site by:
27+
28+
1. Adding an entry to `docs/docs/user-guide/examples/SUMMARY.md` to include it in the Tutorials section
29+
2. Adding an entry to `docs/docs/user-guide/developer-guide/SUMMARY.md` to include it in the Developer Guide section
30+
31+
This ensures the sub-package documentation is properly indexed and accessible through the navigation menu.
32+
33+
The sub-package specific documentation itself must be placed alongside the sub-package code in the `sub-packages/bionemo-<sub-package-name>/` directory:
34+
35+
- `README.md` - A root level file that describes the sub-package and how to use it.
36+
- `examples/` - A directory that contains documentation or examples specific to the sub-package, in the form of `.md` or `.ipynb` files.
37+
- `assets/` - A folder that contains any static assets used in any of the above files, e.g. `.png` files.
38+
39+
When the docs are built, these documentation files will be fetched (via the [scripts/gen_ref_pages.py](./scripts/gen_ref_pages.py) script) for rendering in the main documentation site.
40+
41+
- The `README.md` will be rendered as an individual page in the `User Guide -> Developer Guide -> <sub-package-name>/` section of the documentation site.
42+
- Every file in the `examples/` directory will be rendered as an individual page in the `User Guide -> Tutorials -> <sub-package-name>/` section of the documentation site.
43+
44+
An example sub-package structure is shown below:
45+
46+
```
47+
bionemo-<sub-package-name>/
48+
└── assets/
49+
├── example_1.png
50+
├── examples/
51+
│ ├── example_1.md
52+
│ └── example_2.ipynb
53+
├── src/
54+
├── tests/
55+
├── LICENSE
56+
├── pyproject.toml
57+
├── README.md
58+
├── VERSION
59+
```
60+
2461
## Hiding/collapsing `.ipynb` cells
62+
2563
To remove cells from the rendered `mkdocs` html you can add a `remove-cell` tag to the cell. Note that `remove-output` is also an option to hide outputs but not the code cell. Unfortunately
2664
`remove-input` does not seem to be supported.
2765

0 commit comments

Comments
 (0)