cp: [docs] 25.11 release notes into r0.2.0 (#1518)

ananthsub · web-flow · commit 7af9601c9da7 · 2025-11-26T09:08:34.000-08:00
Signed-off-by: Ananth Subramaniam &lt;ansubramania@nvidia.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,63 @@
 # Changelog
 
+## NVIDIA Megatron-Bridge 0.2.0
+
+* Model Collection Support
+
+  * LLM
+    * HuggingFace Conversion + training recipes:
+      * GPT-oss
+      * Qwen3 Next
+      * Nemotron-H
+      * Nemotron Nano v2
+      * Moonlight
+      * OlMoE
+      * GLM 4.5
+      * Gemma 3
+    * HuggingFace conversion support:
+      * Llama Nemotron
+      * Mistral
+      * Gemma
+      * Gemma 2
+  * VLM
+    * Nemotron Nano v2 VL
+    * Qwen 3 VL
+    * Qwen2.5 VL
+    * Gemma3 VL
+
+* Performance
+  * Megatron-Bridge support for new benchmarks
+      * Benchmarks (same workloads as GB200 system) for GB300 system
+      * GPT-OSS 120B
+      * Qwen3-Next 80B_a3B
+      * Support for linear attention on Blackwell - Gated Delta Networks
+      * Pre-training with NVFP4 precision: Llama3 8B, Lama3 70B, Llama3.1 405B
+  * Megatron-Bridge support for benchmarks previously existing only for NeMo 2.0
+    * Nemotron-H 56B
+    * Fine-tuning (SFT and LoRA): Llama3 8B and Llama3 70B
+  * HybridEP: DeepSeek V3 benchmarks on GB200 and GB300 systems now use HybridEP
+  * CUDA Graphs
+    * Full-model iteration CUDA graph used for dense models- Llama3 8B, Llama3 70B, Llama3.1 405B
+    * Fine-grained Transformer component specific CUDA Graphs used for MoE models
+
+* NVIDIA Model Optimization Integration
+  * Knowledge Distillation
+  * Post training quantization export
+  * quantization aware training
+
+* Enhanced LoRA support
+  * Support for expert layers
+  * Supported merging adapters for export to HuggingFace
+
+* Finetuning dataset improvements: OpenAI messages format conversion, chat template support
+* Integration with Tensor NVIDIA-DLFW-Inspect for tensor statistic collection & monitoring
+* Support for sample-based training
+
+## NVIDIA Megatron-Bridge 0.1.0rc4
+
+* Fix docs build
+* Update performance scripts
+
 ## NVIDIA Megatron-Bridge 0.1.0rc3
 
 * Model Collection Support
diff --git a/docs/releases/changelog.md b/docs/releases/changelog.md
@@ -1,50 +1,4 @@
-# Changelog
-
-## 25.09.01 NeMo Framework Container
-
-- Fix docs build
-- Update performance scripts
-
-## 25.09 NeMo Framework Container
-
-### Model Collection Support
-
-- Llama
-- Qwen 2, Qwen 3, Qwen 3 MoE
-- DeepSeek
-- Mamba
-- [Migration guide from Nemo 2 to Megatron Bridge](https://docs.nvidia.com/nemo/megatron-bridge/0.1.0/nemo2-migration-guide.html)
-- [Contribution guide for adding a new model](https://docs.nvidia.com/nemo/megatron-bridge/0.1.0/adding-new-models.html)
-- [Checkpoint conversion from Hugging Face to Megatron Bridge](https://docs.nvidia.com/nemo/megatron-bridge/0.1.0/bridge-guide.html#get-started-with-hugging-face-conversion)
-
-### [Performance](https://docs.nvidia.com/nemo/megatron-bridge/0.1.0/performance-summary.html)
-
-#### MoE LLM
-
-- Change the model to dropless with balanced gating
-- Fusion of operators in router function
-- Global permutation fusion with A2A dispatcher
-- EP A2A communication overlap with computation in both 1F1B pipelining and non-pipelined training
-- Precision-aware optimizer update to support BF16 states
-
-#### Megatron FSDP
-
-- Migration from mcore FSDP to megatron FSDP
-- Fusion of weight gradient copy to reduce-scatter communication buffer to WGRAD GEMM
-- Removed redundant optimizer operations
-- Use Zero1 (opt and master param sharding) in the replica domain of hybrid FSDP to further lower memory usage
-- IB-SHARP support for the IB AllReduce of hybrid FSDP in a patch with NCCL2.28
-
-#### MXFP8
-
-- Improved act grad all-gather overlap performance via userbuffer
-- Parameter all-gather overlap with computation while the communication buffer sharing with reduce-scatter
-- Fusion of MXFP8 scaling factor swizzling kernels
-- Use PDL (Programmatic Dependent Launch) for quantization kernels to lower CPU overhead
-
-#### Others
-
-- Full iteration cuda graph for dense model without pipelining
-- Fusion of activation and cast (currently tensor-wise scaling only)
-- Store SwiGLU input in FP8 to save activation memory
+```{include} ../../CHANGELOG.md
+:relative-docs: docs/
+```
 
diff --git a/docs/releases/known-issues.md b/docs/releases/known-issues.md
@@ -2,6 +2,12 @@
 
 This page lists known issues and limitations in the current release.
 
+## 25.11
+
+- Deepseek V3 on H100 has an issue when using DeepEP and fails with `RuntimeError: DeepEP error: timeout (dispatch CPU)`.
+- MODEL_TFLOP/s/GPU is printed as 0 to stdout for all Hybrid models, such as Nemotron-H 56B.
+
+
 ## 25.09
 
 - **Pretraining DeepSeek in subchannel FP8 precision is not working.** Pretraining DeepSeek with current scaling FP8 is a workaround, but MTP loss does not converge.
diff --git a/docs/releases/software-versions.md b/docs/releases/software-versions.md
@@ -1,5 +1,29 @@
 # Software Component Versions
 
+## NeMo Framework 25.11
+
+| Software Component | Version |
+|-------------------|---------|
+| PyTorch | 2.9.0a0 |
+| Megatron Core | dev:0.15.0 |
+| Transformer Engine | 2.9 |
+| Megatron-Bridge | 0.2.0 |
+| Megatron-FSDP | 0.2.0 |
+| Export-Deploy | 0.3.0 |
+| Evaluator | 0.2.0 |
+| NeMo | 2.6.0 |
+| NeMo Run | 0.7.0 |
+| TRT-ModelOpt | 0.37.0 |
+| NVRX | 0.4.1 |
+| CUDA | 13.0.1 |
+| cuDNN | 9.13.1.26 |
+| TRT-LLM | 1.1.0a0 |
+
+```{note}
+NVIDIA NeMo™ Framework Training container is built on top of NVIDIA Optimized Frameworks PyTorch 25.06 container: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html
+```
+
+
 ## NeMo Framework 25.09
 
 | Software Component | Version |
diff --git a/src/megatron/bridge/package_info.py b/src/megatron/bridge/package_info.py
@@ -16,7 +16,7 @@
 MAJOR = 0
 MINOR = 2
 PATCH = 0
-PRE_RELEASE = "rc7"
+PRE_RELEASE = ""
 
 # Use the following formatting: (major, minor, patch, pre-release)
 VERSION = (MAJOR, MINOR, PATCH, PRE_RELEASE)