Skip to content

[hipDNN] add matmul dynamic tolerance#5446

Open
oanaoana wants to merge 5 commits intodevelopfrom
users/omarin/matmul-ALMIOPEN-1014
Open

[hipDNN] add matmul dynamic tolerance#5446
oanaoana wants to merge 5 commits intodevelopfrom
users/omarin/matmul-ALMIOPEN-1014

Conversation

@oanaoana
Copy link
Contributor

Motivation

Enable dynamic tolerance calculations for matrix multiplications or different sizes and data type entries.

Technical details

  • Add calculateMatmulTolerance<Out, In, Comp>(ITensor&, ITensor&) for computing
    element-wise error bounds on matrix multiplication using Higham's forward error analysis
  • Add shared computeGamma(k, epsilon) utility with automatic linear/statistical
    regime switching, reusable across conv and matmul
  • Replace static matmul tolerance in TestMatmulPlan.cpp with the dynamic calculation

Uses the matrix infinity-norm ||A||_inf = max_i (sum_j |A[i,j]|) (max absolute row sum).
This is the correct subordinate norm for element-wise error bounds:

  `max_ij |fl(C)_ij - C_ij| <= gamma_k * ||A||_inf * ||B||_inf`

which matches what allClose validates.

Test Plan

19 unit tests covering 6 type combinations (float, double, half, bfloat16) across linear and statistical gamma regimes, with input/output casting error paths.
Tests use non-uniform row data to validate the infinity-norm implementation.
Three edge case tests verify throws on dimension mismatch, gamma overflow (bf16/K=100), and output overflow (tolerance > half::max).

@codecov-commenter
Copy link

codecov-commenter commented Mar 13, 2026

Codecov Report

❌ Patch coverage is 90.69767% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...de/hipdnn_test_sdk/utilities/DynamicTolerances.hpp 90.70% 6 Missing and 2 partials ⚠️

❌ Your project status has failed because the head coverage (77.21%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #5446   +/-   ##
========================================
  Coverage    67.27%   67.27%           
========================================
  Files         1844     1844           
  Lines       284014   284099   +85     
  Branches     39839    39848    +9     
========================================
+ Hits        191044   191113   +69     
- Misses       76512    76524   +12     
- Partials     16458    16462    +4     
Flag Coverage Δ *Carryforward flag
hipBLAS 90.67% <ø> (ø) Carriedforward from 02f625e
hipBLASLt 43.49% <ø> (ø) Carriedforward from 02f625e
hipCUB 82.21% <ø> (ø) Carriedforward from 02f625e
hipDNN 85.12% <90.70%> (-0.01%) ⬇️
hipFFT 55.59% <ø> (ø) Carriedforward from 02f625e
hipRAND 76.12% <ø> (ø) Carriedforward from 02f625e
hipSOLVER 68.81% <ø> (ø) Carriedforward from 02f625e
hipSPARSE 84.70% <ø> (ø) Carriedforward from 02f625e
rocBLAS 47.97% <ø> (ø) Carriedforward from 02f625e
rocFFT 53.24% <ø> (ø) Carriedforward from 02f625e
rocRAND 57.07% <ø> (ø) Carriedforward from 02f625e
rocSOLVER 77.21% <ø> (ø) Carriedforward from 02f625e
rocSPARSE 71.48% <ø> (ø) Carriedforward from 02f625e

*This pull request uses carry forward flags. Click here to find out more.

Files with missing lines Coverage Δ
...de/hipdnn_test_sdk/utilities/DynamicTolerances.hpp 95.69% <90.70%> (-2.54%) ⬇️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@oanaoana oanaoana marked this pull request as ready for review March 16, 2026 15:21
@oanaoana oanaoana requested a review from a team as a code owner March 16, 2026 15:21
* @return The calculated tolerance value as float.
*/
template <typename OutputType, typename InputType, typename ComputeType = float>
float calculateMatmulTolerance(ITensor& a, ITensor& b)
Copy link
Contributor

@BrianHarrisonAMD BrianHarrisonAMD Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update the existing hipBLASLt provider tests to use this path for the tolerance calc, and also verify that it's passing with these changes on a variety of shapes. We don't need to commit the extra tests for now, but we should do some exploratory testing to ensure this calc is working across different matmul problems with the hipBLASLt provider.

Tests in the folder here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am all on board with this. I'll take a look at the tests.

Comment on lines +564 to +571
// For tensors with batch dimensions, compute across all batches
auto batchCount = static_cast<int64_t>(tensor.elementCount()) / (rows * cols);

double maxRowSum = 0.0;

for(int64_t batch = 0; batch < batchCount; ++batch)
{
auto batchOffset = batch * rows * rowStride;
Copy link
Contributor

@BrianHarrisonAMD BrianHarrisonAMD Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this won't work with padded batches since it's not using the batch stride.

Example: dims {2, 3, 4} with strides {24, 4, 1} (padding between batches)
The function would compute offset batch * 3 * 4 = batch * 12 instead of the correct batch * 24.

I think this can be replaced with iterateAlongDimensions + TensorView::getHostValue().
There is already logic to handle the offsets.

Following something similar to layernorm I think it looks like:

  template <typename T>
  double computeMatrixInfNorm(ITensor& tensor)
  {
      using hipdnn_data_sdk::utilities::iterateAlongDimensions;

      const auto& dims = tensor.dims();
      ConstTensorView<T> view(tensor);

      auto cols = dims.back();

      // outerDims = [batch..., rows] — everything except the last dim
      auto outerDims = std::vector<int64_t>(dims.begin(), dims.end() - 1);

      double maxRowSum = 0.0;

      iterateAlongDimensions(outerDims, [&](const std::vector<int64_t>& outerIndices) {
          double rowSum = 0.0;

          auto fullIndices = outerIndices;
          fullIndices.push_back(0);

          for(int64_t j = 0; j < cols; ++j)
          {
              fullIndices.back() = j;
              rowSum += static_cast<double>(
                  hipdnn_data_sdk::types::fabs(view.getHostValue(fullIndices)));
          }

          maxRowSum = std::max(maxRowSum, rowSum);
      });

      return maxRowSum;
  }

You could also look at making it multi-threaded, but probably not worth it yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I wasn't planning to address batched matrices and had no tests for this option. I left the batch handling in place mainly because I was working with ITensor. You’re absolutely right though, there’s no good reason not to handle batching properly, and adjusting the batch stride is the correct fix. Fixed this!

Comment on lines +1179 to +1199
template <typename T>
hipdnn_data_sdk::utilities::Tensor<T>
createTensorFromRowValues(const std::vector<int64_t>& dims,
const std::vector<double>& rowValues)
{
hipdnn_data_sdk::utilities::Tensor<T> tensor(dims);
auto* data = static_cast<T*>(tensor.memory().hostData());

auto rows = dims[dims.size() - 2];
auto cols = dims[dims.size() - 1];

for(int64_t i = 0; i < rows; ++i)
{
for(int64_t j = 0; j < cols; ++j)
{
data[i * cols + j] = static_cast<T>(rowValues[static_cast<size_t>(i)]);
}
}

return tensor;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for the 2D usage in the file, but if we wanted this as a general utility then we would need to change it to be similar to the suggestions in DynamicTolerances.hpp.

Something like:

 template <typename T>
  Tensor<T> createTensorFromRowValues(const std::vector<int64_t>& dims,
                                      const std::vector<double>& rowValues)
  {
      Tensor<T> tensor(dims);

      auto cols = dims.back();
      auto rows = dims[dims.size() - 2];

      // outerDims = [batch..., rows]
      auto outerDims = std::vector<int64_t>(dims.begin(), dims.end() - 1);

      iterateAlongDimensions(outerDims, [&](const std::vector<int64_t>& outerIndices) {
          auto row = outerIndices.back();
          auto fullIndices = outerIndices;
          fullIndices.push_back(0);

          for(int64_t j = 0; j < cols; ++j)
          {
              fullIndices.back() = j;
              tensor.setHostValue(static_cast<T>(rowValues[static_cast<size_t>(row)]),
                                  fullIndices);
          }
      });

      return tensor;
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed together with the proper batch iterator.

Comment on lines +536 to +551
/**
* @brief Computes the infinity-norm (max absolute row sum) of a matrix tensor.
*
* The infinity-norm is defined as: ||A||_inf = max_i (sum_j |A[i,j]|)
*
* This is the appropriate subordinate matrix norm for element-wise error bounds
* via Higham's analysis: max_ij |error_ij| <= gamma_k * ||A||_inf * ||B||_inf.
*
* Uses strides to correctly handle both packed and non-packed tensor layouts.
*
* @tparam T The data type of the tensor elements.
* @param tensor The input tensor (must have at least 2 dimensions).
* @return The infinity-norm as a double.
*/
template <typename T>
double computeMatrixInfNorm(ITensor& tensor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to have this version that uses the exact values, but for conv we used estimated values from the min/max to be able to quickly calculate an estimated worst-case tolerance.

Would it be possible to add a version that uses estimated values similar to conv instead of real values?
I think we should probably have both for every operation, and then we can add in some environment variable, or settings that let us swap between the modes.

That way for quick pre-submit checks we can use the quicker/looser estimation of tolerance, and for nightly / weekly we can instead use the calculated form. The calculated form will also likely need to be converted into a GPU version once we scale up as well. Having the estimated version gives us something to use until all this work is in place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we could keep this pretty much the same but just extract out where the computation of the Norm happens since that seems to be the expensive operation. Of course, im not amazing with the tolerance stuff yet so please correct me if this wont work at all.

  // Extract the common math and take in the norm of A & B.
  template <typename OutputType, typename InputType, typename ComputeType = float>
  float calculateMatmulTolerance(double normA, double normB, uint64_t k)
  {
      static_assert(std::is_same_v<ComputeType, float> || std::is_same_v<ComputeType, double>
                        || std::is_same_v<ComputeType, half> || std::is_same_v<ComputeType, bfloat16>,
                    "ComputeType must be float, double, half, or bfloat16");

      auto epsilon = static_cast<double>(std::numeric_limits<ComputeType>::epsilon());
      double gamma = hipdnn_test_sdk::utilities::computeGamma(k, epsilon);

      constexpr double GAMMA_MAX = 0.5;
      if(gamma >= GAMMA_MAX)
      {
          throw std::overflow_error("Error growth factor gamma >= 0.5...");
      }

      double accumulatedTolerance = gamma * normA * normB;

      auto inputEpsilon = static_cast<double>(std::numeric_limits<InputType>::epsilon());
      if(inputEpsilon < epsilon)
      {
          accumulatedTolerance += 2.0 * normA * normB * epsilon;
      }

      auto outputEpsilon = static_cast<double>(std::numeric_limits<OutputType>::epsilon());
      double castTolerance = 0.0;
      if(outputEpsilon > epsilon)
      {
          castTolerance = normA * normB * outputEpsilon;
      }

      double totalTolerance = accumulatedTolerance + castTolerance;

      if(totalTolerance > static_cast<double>(std::numeric_limits<OutputType>::max()))
      {
          throw std::overflow_error("Calculated tolerance exceeds max of output type.");
      }

      return static_cast<float>(totalTolerance);
  }

Then we can make a fast/loose version that takes in the min/max, cols of A/B and computes a maxNorm of A/B.

 template <typename OutputType, typename InputType, typename ComputeType = float>
  float calculateMatmulToleranceLoose(double minA, double maxA, int64_t colsA,
                                 double minB, double maxB, int64_t colsB)
  {
      double maxAbsA = std::max(std::abs(minA), std::abs(maxA));
      double maxAbsB = std::max(std::abs(minB), std::abs(maxB));
      double normA = static_cast<double>(colsA) * maxAbsA;
      double normB = static_cast<double>(colsB) * maxAbsB;
      auto k = static_cast<uint64_t>(colsA);  // K = cols of A = rows of B
      return calculateMatmulTolerance<OutputType, InputType, ComputeType>(normA, normB, k);
  }

Lastly we can have the original implementation that uses the full tensor values

  template <typename OutputType, typename InputType, typename ComputeType = float>
  float calculateMatmulToleranceTight(ITensor& a, ITensor& b)
  {
      const auto& aDims = a.dims();
      const auto& bDims = b.dims();

      if(aDims.size() < 2 || bDims.size() < 2)
      {
          throw std::invalid_argument("Matrices must have at least 2 dimensions for matmul.");
      }

      int64_t k = aDims[aDims.size() - 1];
      int64_t bRows = bDims[bDims.size() - 2];

      if(k != bRows)
      {
          throw std::invalid_argument(
              "Matrix dimensions incompatible for multiplication: A columns != B rows.");
      }

      if(k <= 0)
      {
          throw std::invalid_argument("Reduction dimension k must be positive.");
      }

      double normA = computeMatrixInfNorm<InputType>(a);
      double normB = computeMatrixInfNorm<InputType>(b);

      return calculateMatmulTolerance<OutputType, InputType, ComputeType>(
          normA, normB, static_cast<uint64_t>(k));
  }

Copy link
Contributor Author

@oanaoana oanaoana Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree this would be more expensive, especially now after adding batching without tensorization etc. This might be a wider conversation, and I support Adam's suggestion for layered APIs, since we might want a few separate categories of checks with different aims. Some that run on-the-fly, like the initial convs approach, some more accurate for rigorous tests, and some in between.
This particular dynamic tol is in-between, since I used the inf and not Frobenius norm. I haven't assessed how to make it faster and simplify the norm further, but it's a step toward removing the N^2 check for convs. The best for local tests would be to compute the norms as the tensors get generated, and pass Anorm, Bnorm as input arguments for the dynamic tol evaluation.

Let me know if you'd like me to add a second simplified option in the spirit of convs or Adam's suggestion. I am leaning toward leaving this PR simple and having a second round of different norm types for different purposes, especially since the incoming GPU refs might lead to more options to weed through.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay if we leave this PR simple and do a follow up PR to split this all up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a few non-trivial stride test case in here with padded batches, and irregular layouts.
That will help verify that the stride calcs are generalized, and correct for the different ways of describing tensors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Since I fixed the batch stride, it is needed. I am wondering though if there shouldn't be separate computations for performance. I address this in your other comment on expedient norms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to split these test files into separate files per dynamic tolerance operation.
This is getting pretty bulky hehe.

Might be worth to also split the DynamicTolerances.hpp file into multiple per operation, and just keep a forwarding function, or use it as a universal include to get the other DynamicTolerance functions. Just thinking that file will get out of hand quick as we keep adding more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, but maybe it's best to do it once we have all flavors of tolerance checks and each operation has more than one type of evaluation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave it for this PR but I think itll be helpful to split up the tolerances into multiple files and tests soon. Then its easier to organize everything and for somone like myself, itll be easier to understand what belongs to which tolerance calc. I do see us having some generic functions across all tolerances and that we can keep in the generic file like DynamicToleranceHelpers.hpp or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree! What about having all utilities like norms, gamma functions etc in one file and the error evaluations in another. I should maybe wait 1-2 more PRs to have a few helper functions for the split to make more sense.

@oanaoana oanaoana force-pushed the users/omarin/matmul-ALMIOPEN-1014 branch from 04b404c to e630bca Compare March 17, 2026 20:57
@oanaoana oanaoana force-pushed the users/omarin/matmul-ALMIOPEN-1014 branch from c80dd30 to 21e9f22 Compare March 18, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants