Skip to content

Conversation

Borda
Copy link
Collaborator

@Borda Borda commented Jun 25, 2025

What does this PR do?

Having a pretty setup between the CPU and GPU testing env.

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

@github-actions github-actions bot added the ci Continuous Integration label Jun 25, 2025
Copy link
Contributor

github-actions bot commented Jun 25, 2025

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow
Check ID Status
pl-cpu-guardian success

These checks are required after the changes to requirements/fabric/base.txt, requirements/fabric/examples.txt, requirements/fabric/strategies.txt, requirements/pytorch/base.txt, requirements/pytorch/examples.txt, requirements/pytorch/strategies.txt.

🟢 pytorch_lightning: Azure GPU
Check ID Status
pytorch-lightning (GPUs) (testing Lightning | latest) success
pytorch-lightning (GPUs) (testing PyTorch | latest) success

These checks are required after the changes to .azure/gpu-tests-pytorch.yml, requirements/pytorch/base.txt, requirements/pytorch/examples.txt, requirements/pytorch/strategies.txt, requirements/fabric/base.txt, requirements/fabric/examples.txt, requirements/fabric/strategies.txt.

🟢 pytorch_lightning: Benchmarks
Check ID Status
lightning.Benchmarks success

These checks are required after the changes to requirements/fabric/base.txt, requirements/fabric/examples.txt, requirements/fabric/strategies.txt, requirements/pytorch/base.txt, requirements/pytorch/examples.txt, requirements/pytorch/strategies.txt.

🟢 fabric: Docs
Check ID Status
docs-make (fabric, doctest) success
docs-make (fabric, html) success

These checks are required after the changes to requirements/fabric/base.txt, requirements/fabric/examples.txt, requirements/fabric/strategies.txt.

🟢 pytorch_lightning: Docs
Check ID Status
docs-make (pytorch, doctest) success
docs-make (pytorch, html) success

These checks are required after the changes to requirements/pytorch/base.txt, requirements/pytorch/examples.txt, requirements/pytorch/strategies.txt.

🟢 pytorch_lightning: Docker
Check ID Status
build-cuda (3.10, 2.1.2, 12.1.1) success
build-cuda (3.11, 2.2.2, 12.1.1) success
build-cuda (3.11, 2.3.1, 12.1.1) success
build-cuda (3.11, 2.4.1, 12.1.1) success
build-cuda (3.12, 2.5.1, 12.1.1) success
build-cuda (3.12, 2.6.0, 12.4.1) success
build-pl (3.10, 2.1, 12.1.1) success
build-pl (3.11, 2.2, 12.1.1) success
build-pl (3.11, 2.3, 12.1.1) success
build-pl (3.11, 2.4, 12.1.1) success
build-pl (3.12, 2.5, 12.1.1) success
build-pl (3.12, 2.6, 12.4.1) success
build-pl (3.12, 2.7, 12.6.3, true) success

These checks are required after the changes to requirements/pytorch/base.txt, requirements/pytorch/examples.txt, requirements/pytorch/strategies.txt, requirements/fabric/base.txt, requirements/fabric/examples.txt, requirements/fabric/strategies.txt.

🟢 lightning_fabric: CPU workflow
Check ID Status
fabric-cpu-guardian success

These checks are required after the changes to requirements/fabric/base.txt, requirements/fabric/examples.txt, requirements/fabric/strategies.txt.

🟢 lightning_fabric: Azure GPU
Check ID Status
lightning-fabric (GPUs) (testing Fabric | latest) success
lightning-fabric (GPUs) (testing Lightning | latest) success

These checks are required after the changes to .azure/gpu-tests-fabric.yml, requirements/fabric/base.txt, requirements/fabric/examples.txt, requirements/fabric/strategies.txt.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to requirements/fabric/base.txt, requirements/fabric/examples.txt, requirements/fabric/strategies.txt, requirements/pytorch/base.txt, requirements/pytorch/examples.txt, requirements/pytorch/strategies.txt, src/version.info.

🟢 install
Check ID Status
install-pkg-guardian success

These checks are required after the changes to src/version.info, requirements/fabric/base.txt, requirements/fabric/examples.txt, requirements/fabric/strategies.txt, requirements/pytorch/base.txt, requirements/pytorch/examples.txt, requirements/pytorch/strategies.txt.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

@github-actions github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package dependencies Pull requests that update a dependency file labels Jun 25, 2025
This reverts commit c362ab6.
Copy link

codecov bot commented Jun 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86%. Comparing base (242d80f) to head (eb8cca3).
⚠️ Report is 190 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #20939    +/-   ##
========================================
- Coverage      87%      86%    -1%     
========================================
  Files         268      268            
  Lines       23453    23453            
========================================
- Hits        20404    20273   -131     
- Misses       3049     3180   +131     

# note: is a bug around 0.10 with `MPS_Accelerator must implement all abstract methods`
# shall be resolved by https://github.com/microsoft/DeepSpeed/issues/4372
deepspeed >=0.8.2, <=0.9.3; platform_system != "Windows" and platform_system != "Darwin" # strict
deepspeed >=0.9.3, <=0.9.3; platform_system != "Windows" and platform_system != "Darwin" # strict
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping the 0.8 since it would need to be compiled from source
Also, noted that we are quite far behind the latest 0.17 🤔

Why Upgrade? Upgrading to v0.17 delivers significant performance, stability, and integration benefits—vital for training larger models with improved efficiency and reliability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • ZeRO Optimizations: • v0.9: Early experiments in partitioning model states for memory savings. • v0.17: Advanced refinements (ZeRO-Offload and improved stage 3) enable training massively scaled models.

  • Performance Enhancements: • Upgraded distributed communication and fused operations. • Better mixed precision (fp16/bf16) support for faster training and efficient hardware usage.

  • Stability & API Maturation: • Streamlined configuration, enhanced documentation, and robust testing. • Fewer bugs and smoother integration with frameworks like HuggingFace Transformers.

  • Inference Improvements: • Expanded inference API with support for quantization. • Optimized runtime strategies for production deployment.

  • Ecosystem Integration: • Broader compatibility with modern AI tools and libraries. • Simplifies building and deploying complex deep learning workflows.

@Borda Borda merged commit a651975 into master Jun 25, 2025
134 checks passed
@Borda Borda deleted the ci/gpu-oldest branch June 25, 2025 15:50
Borda added a commit that referenced this pull request Aug 13, 2025
* ci/gpu: setting oldest dependencies
* pip install "cython<3.0"
* deepspeed ==0.9.3
* typing-extensions >=4.5.0
* PyYAML >5.4
* torchmetrics >0.7.0
* lightning-utilities >=0.10.0

(cherry picked from commit a651975)
Borda added a commit that referenced this pull request Aug 13, 2025
* ci/gpu: setting oldest dependencies
* pip install "cython<3.0"
* deepspeed ==0.9.3
* typing-extensions >=4.5.0
* PyYAML >5.4
* torchmetrics >0.7.0
* lightning-utilities >=0.10.0

(cherry picked from commit a651975)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continuous Integration dependencies Pull requests that update a dependency file fabric lightning.fabric.Fabric package pl Generic label for PyTorch Lightning package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants