Merge branch 'master' into deepspeed_exclude_frozen

Borda · web-flow · commit 66fb3c9a43af · 2025-08-27T17:15:10.000+02:00
diff --git a/docs/source-fabric/advanced/compile.rst b/docs/source-fabric/advanced/compile.rst
@@ -417,7 +417,7 @@ Additional Resources
 
 Here are a few resources for further reading after you complete this tutorial:
 
-- `PyTorch 2.0 Paper <https://pytorch.org/blog/pytorch-2-paper-tutorial/>`_
+- `PyTorch 2.0 Paper <https://pytorch.org/get-started/pytorch-2-x/>`_
 - `GenAI with PyTorch 2.0 blog post series <https://pytorch.org/blog/accelerating-generative-ai-4/>`_
 - `Training Production AI Models with PyTorch 2.0 <https://pytorch.org/blog/training-production-ai-models/>`_
 - `Empowering Models with Performance: The Art of Generalized Model Transformation Approach <https://pytorch.org/blog/empowering-models-performance/>`_
diff --git a/docs/source-pytorch/advanced/compile.rst b/docs/source-pytorch/advanced/compile.rst
@@ -396,7 +396,7 @@ Additional Resources
 
 Here are a few resources for further reading after you complete this tutorial:
 
-- `PyTorch 2.0 Paper <https://pytorch.org/blog/pytorch-2-paper-tutorial/>`_
+- `PyTorch 2.0 Paper <https://pytorch.org/get-started/pytorch-2-x/>`_
 - `GenAI with PyTorch 2.0 blog post series <https://pytorch.org/blog/accelerating-generative-ai-4/>`_
 - `Training Production AI Models with PyTorch 2.0 <https://pytorch.org/blog/training-production-ai-models/>`_
 - `Empowering Models with Performance: The Art of Generalized Model Transformation Approach <https://pytorch.org/blog/empowering-models-performance/>`_
diff --git a/docs/source-pytorch/versioning.rst b/docs/source-pytorch/versioning.rst
@@ -53,12 +53,8 @@ API Evolution
 
 Lightning's development is driven by research and best practices in a rapidly developing field of AI and machine learning. Change is inevitable and when it happens, the Lightning team is committed to minimizing user friction and maximizing ease of transition from one version to the next. We take backwards compatibility and reproducibility very seriously.
 
-For API removal, renaming or other forms of backwards-incompatible changes, the procedure is:
-
-#. A deprecation process is initiated at a minor version ``MAJOR.MINOR.PATCH`` (e.g. ``1.5.0``), producing a deprecation warning at runtime and removing it from the documentation.
-#. The deprecated API remains unchanged during the deprecation phase for two minor versions or the next major update, whichever comes first.
-#. The breaking change is done in version ``MAJOR.(MINOR+2).0`` (e.g. ``1.7.0``), or ``(MAJOR+1).0.0`` (e.g. ``2.0.0``), whichever comes first.
-#. From that version onward, the deprecation warning gets converted into a helpful error, which will remain until next major release.
+Excepting extenuating circumstances (e.g. a critical bug), API removal, renaming or other forms of backwards-incompatible changes are limited to major version upgrades — that is ``(MAJOR+1).0.0``.
+Concretely, a breaking change for an API introduced in ``2.x.x`` can be introduced with Lightning ``3.0.0``.
 
 This policy is not strict. Shorter or longer deprecation cycles may apply to some cases.
 For example, in the past DDP2 was removed without a deprecation process because the feature was broken and unusable beyond fixing as discussed in `#12584 <https://github.com/Lightning-AI/pytorch-lightning/issues/12584>`_.
@@ -69,6 +65,7 @@ Compatibility matrix
 
 PyTorch Lightning follows `NEP 29 <https://numpy.org/neps/nep-0029-deprecation_policy.html>`_ which PyTorch also follows (`#74203 <https://github.com/pytorch/pytorch/issues/74203>`_).
 The table below indicates the coverage of tested versions in our CI. Versions outside the ranges may unofficially work in some cases.
+Since the release of PyTorch `2.0`, Lightning strives to officially support the latest 5 PyTorch minor releases with no breaking changes within major versions [1]_.
 
 .. list-table::
    :header-rows: 1
@@ -82,102 +79,104 @@ The table below indicates the coverage of tested versions in our CI. Versions ou
    * - 2.5
      - 2.5
      - 2.5
-     - ≥2.1, ≤2.7
+     - ≥2.1, (last tested 2.8)
      - ≥0.7.0
-     - ≥3.9, ≤3.12
+     - ≥3.9, (last tested 3.12)
    * - 2.4
      - 2.4
      - 2.4
-     - ≥2.1, ≤2.6
+     - ≥2.1, (last tested 2.6)
      - ≥0.7.0
-     - ≥3.9, ≤3.12
+     - ≥3.9, (last tested 3.12)
    * - 2.3
      - 2.3
      - 2.3
-     - ≥2.0, ≤2.3
+     - ≥2.0, (last tested 2.3)
      - ≥0.7.0
-     - ≥3.8, ≤3.11
+     - ≥3.8, (last tested 3.11)
    * - 2.2
      - 2.2
      - 2.2
-     - ≥1.13, ≤2.2
+     - ≥1.13, (last tested 2.2)
      - ≥0.7.0
-     - ≥3.8, ≤3.11
+     - ≥3.8, (last tested 3.11)
    * - 2.1
      - 2.1
      - 2.1
-     - ≥1.12, ≤2.1
+     - ≥1.12, (last tested 2.1)
      - ≥0.7.0
-     - ≥3.8, ≤3.11
+     - ≥3.8, (last tested 3.11)
    * - 2.0
      - 2.0
      - 2.0 (GA)
-     - ≥1.11, ≤2.0
+     - ≥1.11, (last tested 2.0)
      - ≥0.7.0
-     - ≥3.8, ≤3.10
+     - ≥3.8, (last tested 3.10)
    * - 1.9
      - 1.9
      - 1.9 (experimental)
-     - ≥1.10, ≤1.13
+     - ≥1.10, (last tested 1.13)
      - ≥0.7.0
-     - ≥3.7, ≤3.10
+     - ≥3.7, (last tested 3.10)
    * - 1.8**
      - 1.8
      - n/a***
-     - ≥1.10, ≤1.13
+     - ≥1.10, (last tested 1.13)
      - ≥0.7.0
-     - ≥3.7, ≤3.10
+     - ≥3.7, (last tested 3.10)
    * - n/a
      - 1.7
      - n/a***
-     - ≥1.9, ≤1.12
+     - ≥1.9, (last tested 1.12)
      - ≥0.7.0
-     - ≥3.7, ≤3.10
+     - ≥3.7, (last tested 3.10)
    * - n/a
      - 1.6
      - n/a***
-     - ≥1.8, ≤1.11
+     - ≥1.8, (last tested 1.11)
      - ≥0.4.1
-     - ≥3.7, ≤3.9
+     - ≥3.7, (last tested 3.9)
    * - n/a
      - 1.5
      - n/a***
-     - ≥1.7, ≤1.10
+     - ≥1.7, (last tested 1.10)
      - ≥0.4.1
-     - ≥3.6, ≤3.9
+     - ≥3.6, (last tested 3.9)
    * - n/a
      - 1.4
      - n/a
-     - ≥1.6, ≤1.9
+     - ≥1.6, (last tested 1.9)
      - ≥0.4.0
-     - ≥3.6, ≤3.9
+     - ≥3.6, (last tested 3.9)
    * - n/a
      - 1.3
      - n/a
-     - ≥1.4, ≤1.8
+     - ≥1.4, (last tested 1.8)
      - ≥0.2.0
-     - ≥3.6, ≤3.9
+     - ≥3.6, (last tested 3.9)
    * - n/a
      - 1.2
      - n/a
-     - ≥1.4, ≤1.8
+     - ≥1.4, (last tested 1.8)
      - n/a*
-     - ≥3.6, ≤3.8
+     - ≥3.6, (last tested 3.8)
    * - n/a
      - 1.1
      - n/a
-     - ≥1.3, ≤1.8
+     - ≥1.3, (last tested 1.8)
      - n/a*
-     - ≥3.6, ≤3.8
+     - ≥3.6, (last tested 3.8)
    * - n/a
      - 1.0
      - n/a
-     - ≥1.3, ≤1.7
+     - ≥1.3, (last tested 1.7)
      - n/a*
-     - ≥3.6, ≤3.8
+     - ≥3.6, (last tested 3.8)
 
 \* ``torchmetrics`` was part of ``pytorch_lightning`` at the time and was decoupled to a separate package in v1.3.
 
 \*\* The joint ``lightning`` package was first published in version 1.8
 
 \*\*\* Fabric is the evolution of ``LightningLite`` which was released inside ``pytorch_lightning`` 1.5 and was decoupled to a separate package in v1.9
+
+.. [1] See `this community discussion <https://github.com/Lightning-AI/pytorch-lightning/issues/21073#issuecomment-3201706857>`_.
diff --git a/requirements/docs.txt b/requirements/docs.txt
@@ -1,5 +1,5 @@
 sphinx >5.0, <6.0
-myst-parser >=0.18.1, <4.0.0
+myst-parser >=0.18.1, <5.0.0
 nbsphinx >=0.8.5, <=0.9.7
 nbconvert >7.14, <7.17
 pandoc >=1.0, <=2.4
diff --git a/requirements/fabric/test.txt b/requirements/fabric/test.txt
@@ -1,4 +1,4 @@
-coverage ==7.10.4
+coverage ==7.10.5
 numpy >=1.21.0, <1.27.0
 pytest ==8.4.1
 pytest-cov ==6.2.1
diff --git a/requirements/pytorch/docs.txt b/requirements/pytorch/docs.txt
@@ -4,6 +4,6 @@ nbformat  # used for generate empty notebook
 ipython[notebook] <9.5.0
 setuptools<81.0  # workaround for `error in ipython setup command: use_2to3 is invalid.`
 
-onnxscript >= 0.2.2, <0.4.0
+onnxscript >= 0.2.2, < 0.5.0
 
 #-r ../../_notebooks/.actions/requires.txt
diff --git a/requirements/pytorch/test.txt b/requirements/pytorch/test.txt
@@ -1,4 +1,4 @@
-coverage ==7.10.4
+coverage ==7.10.5
 pytest ==8.4.1
 pytest-cov ==6.2.1
 pytest-timeout ==2.4.0
@@ -11,7 +11,7 @@ scikit-learn >0.22.1, <1.8.0
 numpy >1.20.0, <1.27.0
 onnx >1.12.0, <1.19.0
 onnxruntime >=1.12.0, <1.23.0
-onnxscript >= 0.1.0, <0.4.0
+onnxscript >= 0.1.0, < 0.5.0
 psutil <7.0.1 # for `DeviceStatsMonitor`
 pandas >2.0, <2.4.0  # needed in benchmarks
 fastapi  # for `ServableModuleValidator`  # not setting version as re-defined in App
diff --git a/src/lightning/fabric/CHANGELOG.md b/src/lightning/fabric/CHANGELOG.md
@@ -12,6 +12,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Added `exclude_frozen_parameters` to `DeepSpeedStrategy` ([#21060](https://github.com/Lightning-AI/pytorch-lightning/pull/21060))
 
 
+- Added support for NVIDIA H200 GPUs in `get_available_flops` ([#20913](https://github.com/Lightning-AI/pytorch-lightning/pull/21119))
+
+
 ### Removed
 
 -
diff --git a/src/lightning/fabric/utilities/throughput.py b/src/lightning/fabric/utilities/throughput.py
@@ -304,6 +304,23 @@ def measure_flops(
 
 _CUDA_FLOPS: dict[str, dict[Union[str, torch.dtype], float]] = {
     # Hopper
+    # source: https://nvdam.widen.net/s/nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446
+    "h200 sxm1": {
+        torch.float64: 3.4e13,
+        torch.float32: 6.7e13,
+        "tfloat32": 9.9e14,
+        torch.bfloat16: 2.0e15,
+        torch.float16: 2.0e15,
+        torch.int8: 4.0e15,
+    },
+    "h200 nvl1": {
+        torch.float64: 3.0e13,
+        torch.float32: 6.0e13,
+        "tfloat32": 8.4e14,
+        torch.bfloat16: 1.7e15,
+        torch.float16: 1.7e15,
+        torch.int8: 3.3e15,
+    },
     # source: https://resources.nvidia.com/en-us-tensor-core
     "h100 nvl": {
         torch.float64: 67e12,
@@ -536,7 +553,12 @@ def get_available_flops(device: torch.device, dtype: Union[torch.dtype, str]) ->
     if device.type == "cuda":
         device_name = torch.cuda.get_device_name(device)
         chip = device_name.lower()
-        if "h100" in chip:
+        if "h200" in chip:
+            if "sxm1" in chip:
+                chip = "h200 sxm1"
+            elif "nvl1" in chip:
+                chip = "h200 nvl1"
+        elif "h100" in chip:
             if "hbm3" in chip:
                 chip = "h100 sxm"
             elif "nvl" in chip:
diff --git a/src/lightning/pytorch/CHANGELOG.md b/src/lightning/pytorch/CHANGELOG.md
@@ -42,6 +42,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 - Fixed misalignment column while using rich model summary in `DeepSpeedstrategy` ([#21100](https://github.com/Lightning-AI/pytorch-lightning/pull/21100))
 
+
+- Fixed `RichProgressBar` crashing when sanity checking using val dataloader with 0 len ([#21108](https://github.com/Lightning-AI/pytorch-lightning/pull/21108))
+
 ---
 
 ## [2.5.3] - 2025-08-13
diff --git a/src/lightning/pytorch/callbacks/progress/rich_progress.py b/src/lightning/pytorch/callbacks/progress/rich_progress.py
@@ -390,8 +390,7 @@ def on_sanity_check_start(self, trainer: "pl.Trainer", pl_module: "pl.LightningM
 
     @override
     def on_sanity_check_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
-        if self.progress is not None:
-            assert self.val_sanity_progress_bar_id is not None
+        if self.progress is not None and self.val_sanity_progress_bar_id is not None:
             self.progress.update(self.val_sanity_progress_bar_id, advance=0, visible=False)
         self.refresh()
 
diff --git a/src/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py b/src/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py
@@ -93,8 +93,9 @@ def log_metrics(self, metrics: _OUT_DICT, step: Optional[int] = None) -> None:
 
         Args:
             metrics: Metric values
-            step: Step for which metrics should be logged. Default value is `self.global_step` during training or
-                the total validation / test log step count during validation and testing.
+            step: Step for which metrics should be logged. If a `step` metric is logged, this value will
+                be used else will default to `self.global_step` during training or the total log step count
+                during validation and testing.
 
         """
         if not self.trainer.loggers or not metrics:
diff --git a/tests/tests_fabric/utilities/test_throughput.py b/tests/tests_fabric/utilities/test_throughput.py
@@ -68,6 +68,8 @@ def test_get_available_flops(xla_available):
     "device_name",
     [
         # Hopper
+        "NVIDIA H200 SXM1",
+        "NVIDIA H200 NVL1",
         "h100-nvl",  # TODO: switch with `torch.cuda.get_device_name()` result
         "h100-hbm3",  # TODO: switch with `torch.cuda.get_device_name()` result
         "NVIDIA H100 PCIe",
diff --git a/tests/tests_pytorch/callbacks/progress/test_rich_progress_bar.py b/tests/tests_pytorch/callbacks/progress/test_rich_progress_bar.py
@@ -577,3 +577,31 @@ def test_rich_progress_bar_metrics_theme_update(*_):
     theme = RichProgressBar(theme=RichProgressBarTheme(metrics_format=".3e", metrics_text_delimiter="\n")).theme
     assert theme.metrics_format == ".3e"
     assert theme.metrics_text_delimiter == "\n"
+
+
+@RunIf(rich=True)
+def test_rich_progress_bar_empty_val_dataloader_model(tmp_path):
+    """Test that RichProgressBar doesn't crash with empty val_dataloader list from model."""
+
+    class EmptyListModel(BoringModel):
+        def train_dataloader(self):
+            return DataLoader(RandomDataset(32, 64), batch_size=2)
+
+        def val_dataloader(self):
+            return []
+
+    model = EmptyListModel()
+    progress_bar = RichProgressBar()
+
+    trainer = Trainer(
+        default_root_dir=tmp_path,
+        max_epochs=1,
+        num_sanity_val_steps=1,
+        callbacks=[progress_bar],
+        limit_train_batches=2,
+        enable_checkpointing=False,
+        logger=False,
+    )
+
+    # This should not raise an AssertionError
+    trainer.fit(model)

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-coverage ==7.10.4`
	`1`	`+coverage ==7.10.5`
`2`	`2`	`numpy >=1.21.0, <1.27.0`
`3`	`3`	`pytest ==8.4.1`
`4`	`4`	`pytest-cov ==6.2.1`
Original file line number	Diff line number	Diff line change
`@@ -12,6 +12,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).`
`12`	`12`	- Added `exclude_frozen_parameters` to `DeepSpeedStrategy` ([#21060](https://github.com/Lightning-AI/pytorch-lightning/pull/21060))
`13`	`13`
`14`	`14`
	`15`	+- Added support for NVIDIA H200 GPUs in `get_available_flops` ([#20913](https://github.com/Lightning-AI/pytorch-lightning/pull/21119))
	`16`	`+`
	`17`	`+`
`15`	`18`	`### Removed`
`16`	`19`
`17`	`20`	`-`