Skip to content

eckit::linalg::sparse::LinearAlgebraTorch backend#165

Merged
pmaciel merged 10 commits intodevelopfrom
feature/torch
Mar 3, 2026
Merged

eckit::linalg::sparse::LinearAlgebraTorch backend#165
pmaciel merged 10 commits intodevelopfrom
feature/torch

Conversation

@pmaciel
Copy link
Member

@pmaciel pmaciel commented Feb 18, 2025

This PR adds a sparse lienar algebra backend to allow for GPU-based matrix multiplications (and other operations), which translates into a significant performance increase for interpolations, in the right conditions (right environment, advanced use of mir.)

It makes use of a deployed version of PyTorch (findable by CMake), specifically its lower level component "Torch", which is part of the same package (this is how it is released to the public.) I've exposed all possible hardware configuration options, contemporary. But obviously, the better development is to improve the whole workflow to avoid copies to/from the CPU/GPU, so this develolpment is purelly a stepping stone -- it has already allowed me to run both mars-client (C) and pgen on GPUs, and of course mir. It would be great to follow this up with a publication. Possibly, this could be configurable in earthkit-regrid for maximum marketing :-)

I've held back this development for several months, but I couldn't find a definite response on when to post it -- here it is.

🌦️ >> Documentation << 🌦️
https://sites.ecmwf.int/docs/dev-section/eckit/pull-requests/PR-165

@codecov-commenter
Copy link

codecov-commenter commented Feb 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.33%. Comparing base (8714484) to head (08c4a29).
⚠️ Report is 11 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #165      +/-   ##
===========================================
- Coverage    66.35%   66.33%   -0.02%     
===========================================
  Files         1126     1126              
  Lines        57668    57668              
  Branches      4403     4403              
===========================================
- Hits         38264    38253      -11     
- Misses       19404    19415      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wdeconinck
Copy link
Member

This looks neat!
I'd like to test this before merging, preferably on our HPC.
Could you get a working Torch that I could link into e.g. with a setup using the nvidia compiler?

As a side, it would be nice to clean up and remove unused backends (armadillo, viennacl, ... )

@pmaciel pmaciel force-pushed the feature/torch branch 2 times, most recently from 06d774e to b548c96 Compare February 20, 2025 09:43
@pmaciel
Copy link
Member Author

pmaciel commented Feb 12, 2026

This PR brings GPU to production environments (possibly). At the time of adding GPU-powered mir matrix multiplication to AIFS (late 2024) I didn't want the typicl produciton envionrment without this advantage, hence the PR.

Now I've updated this PR which comes at a good time for the reviewing of the linear algebra backends. I've just done some light testing on ag (which is modern), and the test build (for me) was:

  • module load prgenv/gnu cuda/12.9 (default 12.8 doesn't compile this backend correctly)
  • pytorch from python3/new (-DTorch_DIR=/usr/local/apps/python3/3.12.11-01/lib/python3.12/site-packages/torch/share/cmake/Torch)

prgenv/nvidia seems incompatible with the python3/new deployment. You might find other options that work.

There are a number of devices supported mapped from this LinearAlgebraSparse backend, but of note:

  • 'torch-cpu' is universal (also just 'torch')
  • 'torch-cuda' is maybe the interesting one (even for AMD ROCm architecture)
  • 'torch-mps' is for the Apple's NPU but it requires convering to single floating point precision and CSR isn't working, this is only for testing.

This PR only tests the default (cpu) device, because that's not in scope.

@pmaciel pmaciel requested a review from Ozaq February 12, 2026 15:57
@pmaciel
Copy link
Member Author

pmaciel commented Feb 12, 2026

@Ozaq maybe as part of the streamiling of the LinearAlgebra parts for release/2.0.0 (pending the review, of course)

@pmaciel pmaciel self-assigned this Feb 13, 2026
@pmaciel
Copy link
Member Author

pmaciel commented Feb 13, 2026

I've added now the dense backend version (for eg. spherical harmonics). This was just for completeness -- please review as necessary

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds PyTorch-based linear algebra backends (both dense and sparse) to enable GPU-accelerated matrix operations in eckit. The implementation leverages PyTorch's lower-level Torch library to support various hardware accelerators including CUDA, HIP, MPS, XPU, XLA, and Meta devices, with CPU as the default fallback.

Changes:

  • Added Torch backend support for both dense and sparse linear algebra operations with multiple device type options
  • Created helper functions in detail/Torch.h/cc for tensor creation and device management
  • Added CMake configuration and tests for the Torch backend

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
CMakeLists.txt Adds TORCH feature option with Torch package dependency
src/eckit/linalg/CMakeLists.txt Adds Torch source files to build and links torch library
src/eckit/linalg/detail/Torch.h Declares helper functions for Torch tensor operations and device management
src/eckit/linalg/detail/Torch.cc Implements tensor conversion functions and device selection logic
src/eckit/linalg/dense/LinearAlgebraTorch.h Declares dense linear algebra backend using Torch
src/eckit/linalg/dense/LinearAlgebraTorch.cc Implements dense operations (dot, gemv, gemm) with device support
src/eckit/linalg/sparse/LinearAlgebraTorch.h Declares sparse linear algebra backend using Torch
src/eckit/linalg/sparse/LinearAlgebraTorch.cc Implements sparse operations (spmv, spmm) with device support
tests/linalg/CMakeLists.txt Adds test configurations for Torch dense and sparse backends

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI commented Feb 17, 2026

@pmaciel I've opened a new pull request, #266, to work on those changes. Once the pull request is ready, I'll request review from you.

@pmaciel pmaciel merged commit 4b70d08 into develop Mar 3, 2026
76 checks passed
@pmaciel pmaciel deleted the feature/torch branch March 3, 2026 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants