Skip to content

Commit 98dea89

Browse files
committed
Merge branch 'main' into m4/5_initialize
# Conflicts: # src/megatron/bridge/training/config.py # src/megatron/bridge/training/gpt_step.py # src/megatron/bridge/training/optim.py
2 parents bd92363 + ea844b9 commit 98dea89

File tree

75 files changed

+1684
-785
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

75 files changed

+1684
-785
lines changed

.github/actions/test-template/action.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,10 @@ inputs:
5151
required: true
5252
test-data-path:
5353
description: "Test data path"
54-
required: false
55-
default: "/mnt/datadrive/TestData/nemo-fw/TestData"
54+
required: true
55+
runner:
56+
description: "Runner to use for test"
57+
required: true
5658

5759
runs:
5860
using: "composite"
@@ -103,6 +105,7 @@ runs:
103105
104106
- name: Install uuidgen
105107
shell: bash -x -e -u -o pipefail {0}
108+
if: ${{ contains(inputs.runner, 'aws') }}
106109
run: |
107110
apt-get update
108111
apt-get install -y uuid-runtime

.github/workflows/cicd-main.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ env:
6565

6666
jobs:
6767
pre-flight:
68-
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_cicd_preflight.yml@v0.69.0
68+
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_cicd_preflight.yml@v0.69.1
6969
with:
7070
default_runner_prefix: ${{ vars.DEFAULT_RUNNER_PREFIX }}
7171
non_nvidia_runner_prefix: ${{ vars.NON_NVIDIA_RUNNER_PREFIX }}
@@ -369,6 +369,7 @@ jobs:
369369
PAT: ${{ secrets.PAT }}
370370
container-image: ${{ env.container-registry }}/megatron-bridge:${{ github.sha }}
371371
test-data-path: ${{ needs.pre-flight.outputs.test_data_path }}
372+
runner: ${{ needs.pre-flight.outputs.runner_prefix }}-gpu-x2
372373

373374
cicd-functional-tests:
374375
strategy:
@@ -464,6 +465,7 @@ jobs:
464465
PAT: ${{ secrets.PAT }}
465466
container-image: ${{ env.container-registry }}/megatron-bridge:${{ github.sha }}
466467
test-data-path: ${{ needs.pre-flight.outputs.test_data_path }}
468+
runner: ${{ needs.pre-flight.outputs.runner_prefix }}-gpu-x2
467469

468470
Nemo_CICD_Test:
469471
needs:

.github/workflows/install-test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ concurrency:
5757

5858
jobs:
5959
pre-flight:
60-
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_cicd_preflight.yml@v0.69.0
60+
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_cicd_preflight.yml@v0.69.1
6161
with:
6262
default_runner_prefix: ${{ vars.INSTALL_TEST_DEFAULT_RUNNER_PREFIX }}
6363
non_nvidia_runner_prefix: ${{ vars.INSTALL_TEST_NON_NVIDIA_RUNNER_PREFIX }}

.github/workflows/release-docs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ jobs:
5858
dry-run: ${{ inputs.dry-run }}
5959
artifacts-name: docs-html
6060
artifacts-path: _build/html
61-
emails-csv: ${{ inputs.notify-emails }}
61+
emails-csv: ${{ inputs.notify-emails && format('{0},{1}', vars.docs_release_emails, inputs.notify-emails) || vars.docs_release_emails }}
6262
overwrite-latest-on-tag: false
6363
run-on-version-tag-only: ${{ github.ref_name != 'main' }}
6464
request-name: megatron-bridge-publish-docs-${{ github.run_id }}

3rdparty/Megatron-LM

Submodule Megatron-LM updated 876 files

docker/Dockerfile.ci

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
FROM nvcr.io/nvidia/pytorch:25.11-py3
15+
ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:25.11-py3
16+
FROM ${BASE_IMAGE} AS megatron_bridge
1617
WORKDIR /opt/Megatron-Bridge
1718
ENV PATH="/root/.local/bin:$PATH"
1819
ENV UV_PROJECT_ENVIRONMENT=/opt/venv
@@ -45,3 +46,5 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
4546
uv sync --link-mode copy --locked --all-extras --all-groups; \
4647
fi && \
4748
uv cache prune
49+
50+
COPY . /opt/Megatron-Bridge

docs/megatron-lm-to-megatron-bridge.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -281,7 +281,7 @@ Additional distributed/optimizer overlap settings:
281281
| --- | --- | --- |
282282
| `--error-injection-rate` | `rerun_state_machine.error_injection_rate` | Frequency of injected validation perturbations. |
283283
| `--error-injection-type` | `rerun_state_machine.error_injection_type` | Kind of injection (correct/transient/persistent). |
284-
| `--rerun-mode` | `rerun_state_machine.rerun_mode` | Disabled/validate_results/report_stats. |
284+
| `--rerun-mode` | `rerun_state_machine.rerun_mode` | Disabled/validate_results/report_determinism_stats. |
285285

286286
### Data / Tokenizer args
287287

docs/training/packed-sequences.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ The {py:class}`bridge.data.datasets.packed_sequence.PackedSequenceSpecs` class p
6060
| `packed_train_data_path` | `str` | `None` | Custom path for packed training dataset file (`.npy` format). |
6161
| `packed_val_data_path` | `str` | `None` | Custom path for packed validation dataset file (`.npy` format). |
6262
| `packed_metadata_path` | `str` | `None` | Custom path for packing metadata file (`.jsonl` format). |
63+
| `pad_seq_to_mult` | `int \| None` | `None` | Pad each sample to a multiple of this value when generating packed datasets (e.g., set to `2 * context_parallel_size` for THD CP). |
6364
| `pad_cu_seqlens` | `bool` | `False` | Whether to pad `cu_seqlens` to constant size, required for CUDA graphs. |
6465

6566
### Batch Size Considerations

docs/training/resiliency.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -438,7 +438,7 @@ from megatron.bridge.training.config import RerunStateMachineConfig
438438

439439
# Configure re-run state machine in your config
440440
config.rerun_state_machine = RerunStateMachineConfig(
441-
rerun_mode="validate_results", # or "report_stats" or "disabled"
441+
rerun_mode="validate_results", # or "report_determinism_stats" or "disabled"
442442
check_for_nan_in_loss=True,
443443
check_for_spiky_loss=False,
444444
error_injection_rate=0, # For testing only
@@ -450,7 +450,7 @@ config.rerun_state_machine = RerunStateMachineConfig(
450450

451451
| Parameter | Type | Default | Description |
452452
|-----------|------|---------|-------------|
453-
| `rerun_mode` | `str` | `"disabled"` | Operating mode: `"disabled"`, `"validate_results"`, or `"report_stats"` |
453+
| `rerun_mode` | `str` | `"disabled"` | Operating mode: `"disabled"`, `"validate_results"`, or `"report_determinism_stats"` |
454454
| `check_for_nan_in_loss` | `bool` | `True` | Check for NaN values in loss |
455455
| `check_for_spiky_loss` | `bool` | `False` | Check for unexpectedly large loss values |
456456
| `error_injection_rate` | `int` | `0` | Rate for injecting test errors (testing only) |
@@ -463,7 +463,7 @@ config.rerun_state_machine = RerunStateMachineConfig(
463463
- **Behavior**: Training proceeds normally without any result checking.
464464
- **Use Case**: When re-run overhead is not acceptable or validation is not needed.
465465

466-
#### 2. Report Stats Mode (`report_stats`)
466+
#### 2. Report Stats Mode (`report_determinism_stats`)
467467
- **Purpose**: Collect statistics on computational determinism.
468468
- **Behavior**: Re-runs every step once to measure variability.
469469
- **Output**: Reports on computational non-determinism without stopping training.

examples/conversion/hf_to_megatron_generate_nemotron_vlm.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -150,7 +150,8 @@ def process_image_inputs(processor, image_path: Optional[str], prompt: str, syst
150150
image_paths = image_path.split(",")
151151
content = []
152152
for i, path in enumerate(image_paths):
153-
content.append({"type": "text", "text": f"{'\n' if i > 0 else ''}Image-{i + 1}: "})
153+
prefix = "\n" if i > 0 else ""
154+
content.append({"type": "text", "text": f"{prefix}Image-{i + 1}: "})
154155
content.append({"type": "image", "image": path})
155156
content.append({"type": "text", "text": "\n" + prompt})
156157
else:

0 commit comments

Comments
 (0)