[hipDNN] add matmul dynamic tolerance by oanaoana · Pull Request #5446 · ROCm/rocm-libraries

oanaoana · 2026-03-13T19:39:47Z

Motivation

Enable dynamic tolerance calculations for matrix multiplications or different sizes and data type entries.

Technical details

Add calculateMatmulTolerance<Out, In, Comp>(ITensor&, ITensor&) for computing
element-wise error bounds on matrix multiplication using Higham's forward error analysis
Add shared computeGamma(k, epsilon) utility with automatic linear/statistical
regime switching, reusable across conv and matmul
Replace static matmul tolerance in TestMatmulPlan.cpp with the dynamic calculation

Uses the matrix infinity-norm ||A||_inf = max_i (sum_j |A[i,j]|) (max absolute row sum).
This is the correct subordinate norm for element-wise error bounds:

  `max_ij |fl(C)_ij - C_ij| <= gamma_k * ||A||_inf * ||B||_inf`

which matches what allClose validates.

Test Plan

19 unit tests covering 6 type combinations (float, double, half, bfloat16) across linear and statistical gamma regimes, with input/output casting error paths.
Tests use non-uniform row data to validate the infinity-norm implementation.
Three edge case tests verify throws on dimension mismatch, gamma overflow (bf16/K=100), and output overflow (tolerance > half::max).

codecov-commenter · 2026-03-13T19:45:54Z

Codecov Report

❌ Patch coverage is 90.69767% with 8 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...de/hipdnn_test_sdk/utilities/DynamicTolerances.hpp	90.70%	6 Missing and 2 partials ⚠️

❌ Your project status has failed because the head coverage (77.21%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #5446   +/-   ##
========================================
  Coverage    67.27%   67.27%           
========================================
  Files         1844     1844           
  Lines       284014   284099   +85     
  Branches     39839    39848    +9     
========================================
+ Hits        191044   191113   +69     
- Misses       76512    76524   +12     
- Partials     16458    16462    +4

Flag	Coverage Δ		*Carryforward flag
hipBLAS	`90.67% <ø> (ø)`		Carriedforward from 02f625e
hipBLASLt	`43.49% <ø> (ø)`		Carriedforward from 02f625e
hipCUB	`82.21% <ø> (ø)`		Carriedforward from 02f625e
hipDNN	`85.12% <90.70%> (-0.01%)`	⬇️
hipFFT	`55.59% <ø> (ø)`		Carriedforward from 02f625e
hipRAND	`76.12% <ø> (ø)`		Carriedforward from 02f625e
hipSOLVER	`68.81% <ø> (ø)`		Carriedforward from 02f625e
hipSPARSE	`84.70% <ø> (ø)`		Carriedforward from 02f625e
rocBLAS	`47.97% <ø> (ø)`		Carriedforward from 02f625e
rocFFT	`53.24% <ø> (ø)`		Carriedforward from 02f625e
rocRAND	`57.07% <ø> (ø)`		Carriedforward from 02f625e
rocSOLVER	`77.21% <ø> (ø)`		Carriedforward from 02f625e
rocSPARSE	`71.48% <ø> (ø)`		Carriedforward from 02f625e

*This pull request uses carry forward flags. Click here to find out more.

Files with missing lines	Coverage Δ
...de/hipdnn_test_sdk/utilities/DynamicTolerances.hpp	`95.69% <90.70%> (-2.54%)`	⬇️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

BrianHarrisonAMD · 2026-03-16T20:39:10Z

projects/hipdnn/test_sdk/include/hipdnn_test_sdk/utilities/DynamicTolerances.hpp

+ * @return The calculated tolerance value as float.
+ */
+template <typename OutputType, typename InputType, typename ComputeType = float>
+float calculateMatmulTolerance(ITensor& a, ITensor& b)


We should update the existing hipBLASLt provider tests to use this path for the tolerance calc, and also verify that it's passing with these changes on a variety of shapes. We don't need to commit the extra tests for now, but we should do some exploratory testing to ensure this calc is working across different matmul problems with the hipBLASLt provider.

Tests in the folder here.

Yes, I am all on board with this. I'll take a look at the tests.

BrianHarrisonAMD · 2026-03-16T20:53:29Z

projects/hipdnn/test_sdk/include/hipdnn_test_sdk/utilities/DynamicTolerances.hpp

+    // For tensors with batch dimensions, compute across all batches
+    auto batchCount = static_cast<int64_t>(tensor.elementCount()) / (rows * cols);
+
+    double maxRowSum = 0.0;
+
+    for(int64_t batch = 0; batch < batchCount; ++batch)
+    {
+        auto batchOffset = batch * rows * rowStride;


I believe this won't work with padded batches since it's not using the batch stride.

Example: dims {2, 3, 4} with strides {24, 4, 1} (padding between batches) The function would compute offset batch * 3 * 4 = batch * 12 instead of the correct batch * 24.

I think this can be replaced with iterateAlongDimensions + TensorView::getHostValue().
There is already logic to handle the offsets.

Following something similar to layernorm I think it looks like:

template <typename T> double computeMatrixInfNorm(ITensor& tensor) { using hipdnn_data_sdk::utilities::iterateAlongDimensions; const auto& dims = tensor.dims(); ConstTensorView<T> view(tensor); auto cols = dims.back(); // outerDims = [batch..., rows] — everything except the last dim auto outerDims = std::vector<int64_t>(dims.begin(), dims.end() - 1); double maxRowSum = 0.0; iterateAlongDimensions(outerDims, [&](const std::vector<int64_t>& outerIndices) { double rowSum = 0.0; auto fullIndices = outerIndices; fullIndices.push_back(0); for(int64_t j = 0; j < cols; ++j) { fullIndices.back() = j; rowSum += static_cast<double>( hipdnn_data_sdk::types::fabs(view.getHostValue(fullIndices))); } maxRowSum = std::max(maxRowSum, rowSum); }); return maxRowSum; }

You could also look at making it multi-threaded, but probably not worth it yet.

Thanks! I wasn't planning to address batched matrices and had no tests for this option. I left the batch handling in place mainly because I was working with ITensor. You’re absolutely right though, there’s no good reason not to handle batching properly, and adjusting the batch stride is the correct fix. Fixed this!

BrianHarrisonAMD · 2026-03-16T21:02:47Z

projects/hipdnn/test_sdk/tests/utilities/TestDynamicTolerances.cpp

+template <typename T>
+hipdnn_data_sdk::utilities::Tensor<T>
+    createTensorFromRowValues(const std::vector<int64_t>& dims,
+                              const std::vector<double>& rowValues)
+{
+    hipdnn_data_sdk::utilities::Tensor<T> tensor(dims);
+    auto* data = static_cast<T*>(tensor.memory().hostData());
+
+    auto rows = dims[dims.size() - 2];
+    auto cols = dims[dims.size() - 1];
+
+    for(int64_t i = 0; i < rows; ++i)
+    {
+        for(int64_t j = 0; j < cols; ++j)
+        {
+            data[i * cols + j] = static_cast<T>(rowValues[static_cast<size_t>(i)]);
+        }
+    }
+
+    return tensor;
+}


I think this is fine for the 2D usage in the file, but if we wanted this as a general utility then we would need to change it to be similar to the suggestions in DynamicTolerances.hpp.

Something like:

template <typename T> Tensor<T> createTensorFromRowValues(const std::vector<int64_t>& dims, const std::vector<double>& rowValues) { Tensor<T> tensor(dims); auto cols = dims.back(); auto rows = dims[dims.size() - 2]; // outerDims = [batch..., rows] auto outerDims = std::vector<int64_t>(dims.begin(), dims.end() - 1); iterateAlongDimensions(outerDims, [&](const std::vector<int64_t>& outerIndices) { auto row = outerIndices.back(); auto fullIndices = outerIndices; fullIndices.push_back(0); for(int64_t j = 0; j < cols; ++j) { fullIndices.back() = j; tensor.setHostValue(static_cast<T>(rowValues[static_cast<size_t>(row)]), fullIndices); } }); return tensor; }

Addressed together with the proper batch iterator.

BrianHarrisonAMD · 2026-03-16T21:06:04Z

projects/hipdnn/test_sdk/include/hipdnn_test_sdk/utilities/DynamicTolerances.hpp

+/**
+ * @brief Computes the infinity-norm (max absolute row sum) of a matrix tensor.
+ *
+ * The infinity-norm is defined as: ||A||_inf = max_i (sum_j |A[i,j]|)
+ *
+ * This is the appropriate subordinate matrix norm for element-wise error bounds
+ * via Higham's analysis: max_ij |error_ij| <= gamma_k * ||A||_inf * ||B||_inf.
+ *
+ * Uses strides to correctly handle both packed and non-packed tensor layouts.
+ *
+ * @tparam T The data type of the tensor elements.
+ * @param tensor The input tensor (must have at least 2 dimensions).
+ * @return The infinity-norm as a double.
+ */
+template <typename T>
+double computeMatrixInfNorm(ITensor& tensor)


I think it's good to have this version that uses the exact values, but for conv we used estimated values from the min/max to be able to quickly calculate an estimated worst-case tolerance.

Would it be possible to add a version that uses estimated values similar to conv instead of real values?
I think we should probably have both for every operation, and then we can add in some environment variable, or settings that let us swap between the modes.

That way for quick pre-submit checks we can use the quicker/looser estimation of tolerance, and for nightly / weekly we can instead use the calculated form. The calculated form will also likely need to be converted into a GPU version once we scale up as well. Having the estimated version gives us something to use until all this work is in place.

I was wondering if we could keep this pretty much the same but just extract out where the computation of the Norm happens since that seems to be the expensive operation. Of course, im not amazing with the tolerance stuff yet so please correct me if this wont work at all.

// Extract the common math and take in the norm of A & B. template <typename OutputType, typename InputType, typename ComputeType = float> float calculateMatmulTolerance(double normA, double normB, uint64_t k) { static_assert(std::is_same_v<ComputeType, float> || std::is_same_v<ComputeType, double> || std::is_same_v<ComputeType, half> || std::is_same_v<ComputeType, bfloat16>, "ComputeType must be float, double, half, or bfloat16"); auto epsilon = static_cast<double>(std::numeric_limits<ComputeType>::epsilon()); double gamma = hipdnn_test_sdk::utilities::computeGamma(k, epsilon); constexpr double GAMMA_MAX = 0.5; if(gamma >= GAMMA_MAX) { throw std::overflow_error("Error growth factor gamma >= 0.5..."); } double accumulatedTolerance = gamma * normA * normB; auto inputEpsilon = static_cast<double>(std::numeric_limits<InputType>::epsilon()); if(inputEpsilon < epsilon) { accumulatedTolerance += 2.0 * normA * normB * epsilon; } auto outputEpsilon = static_cast<double>(std::numeric_limits<OutputType>::epsilon()); double castTolerance = 0.0; if(outputEpsilon > epsilon) { castTolerance = normA * normB * outputEpsilon; } double totalTolerance = accumulatedTolerance + castTolerance; if(totalTolerance > static_cast<double>(std::numeric_limits<OutputType>::max())) { throw std::overflow_error("Calculated tolerance exceeds max of output type."); } return static_cast<float>(totalTolerance); }

Then we can make a fast/loose version that takes in the min/max, cols of A/B and computes a maxNorm of A/B.

template <typename OutputType, typename InputType, typename ComputeType = float> float calculateMatmulToleranceLoose(double minA, double maxA, int64_t colsA, double minB, double maxB, int64_t colsB) { double maxAbsA = std::max(std::abs(minA), std::abs(maxA)); double maxAbsB = std::max(std::abs(minB), std::abs(maxB)); double normA = static_cast<double>(colsA) * maxAbsA; double normB = static_cast<double>(colsB) * maxAbsB; auto k = static_cast<uint64_t>(colsA); // K = cols of A = rows of B return calculateMatmulTolerance<OutputType, InputType, ComputeType>(normA, normB, k); }

Lastly we can have the original implementation that uses the full tensor values

template <typename OutputType, typename InputType, typename ComputeType = float> float calculateMatmulToleranceTight(ITensor& a, ITensor& b) { const auto& aDims = a.dims(); const auto& bDims = b.dims(); if(aDims.size() < 2 || bDims.size() < 2) { throw std::invalid_argument("Matrices must have at least 2 dimensions for matmul."); } int64_t k = aDims[aDims.size() - 1]; int64_t bRows = bDims[bDims.size() - 2]; if(k != bRows) { throw std::invalid_argument( "Matrix dimensions incompatible for multiplication: A columns != B rows."); } if(k <= 0) { throw std::invalid_argument("Reduction dimension k must be positive."); } double normA = computeMatrixInfNorm<InputType>(a); double normB = computeMatrixInfNorm<InputType>(b); return calculateMatmulTolerance<OutputType, InputType, ComputeType>( normA, normB, static_cast<uint64_t>(k)); }

I fully agree this would be more expensive, especially now after adding batching without tensorization etc. This might be a wider conversation, and I support Adam's suggestion for layered APIs, since we might want a few separate categories of checks with different aims. Some that run on-the-fly, like the initial convs approach, some more accurate for rigorous tests, and some in between.
This particular dynamic tol is in-between, since I used the inf and not Frobenius norm. I haven't assessed how to make it faster and simplify the norm further, but it's a step toward removing the N^2 check for convs. The best for local tests would be to compute the norms as the tensors get generated, and pass Anorm, Bnorm as input arguments for the dynamic tol evaluation.

Let me know if you'd like me to add a second simplified option in the spirit of convs or Adam's suggestion. I am leaning toward leaving this PR simple and having a second round of different norm types for different purposes, especially since the incoming GPU refs might lead to more options to weed through.

I am okay if we leave this PR simple and do a follow up PR to split this all up.

BrianHarrisonAMD · 2026-03-16T21:07:30Z

projects/hipdnn/test_sdk/tests/utilities/TestDynamicTolerances.cpp

We should add a few non-trivial stride test case in here with padded batches, and irregular layouts.
That will help verify that the stride calcs are generalized, and correct for the different ways of describing tensors.

Done! Since I fixed the batch stride, it is needed. I am wondering though if there shouldn't be separate computations for performance. I address this in your other comment on expedient norms.

BrianHarrisonAMD · 2026-03-16T21:09:32Z

projects/hipdnn/test_sdk/tests/utilities/TestDynamicTolerances.cpp

We probably need to split these test files into separate files per dynamic tolerance operation.
This is getting pretty bulky hehe.

Might be worth to also split the DynamicTolerances.hpp file into multiple per operation, and just keep a forwarding function, or use it as a universal include to get the other DynamicTolerance functions. Just thinking that file will get out of hand quick as we keep adding more.

Yes, I agree, but maybe it's best to do it once we have all flavors of tolerance checks and each operation has more than one type of evaluation.

I think we can leave it for this PR but I think itll be helpful to split up the tolerances into multiple files and tests soon. Then its easier to organize everything and for somone like myself, itll be easier to understand what belongs to which tolerance calc. I do see us having some generic functions across all tolerances and that we can keep in the generic file like DynamicToleranceHelpers.hpp or something

Totally agree! What about having all utilities like norms, gamma functions etc in one file and the error evaluations in another. I should maybe wait 1-2 more PRs to have a few helper functions for the split to make more sense.

projects/hipdnn/test_sdk/tests/utilities/TestDynamicTolerances.cpp

…lpers

github-actions bot added the project: hipdnn label Mar 13, 2026

assistant-librarian bot added the organization: ROCm label Mar 13, 2026

oanaoana marked this pull request as ready for review March 16, 2026 15:21

oanaoana requested a review from a team as a code owner March 16, 2026 15:21

BrianHarrisonAMD reviewed Mar 16, 2026

View reviewed changes

adickin-amd reviewed Mar 16, 2026

View reviewed changes

projects/hipdnn/test_sdk/tests/utilities/TestDynamicTolerances.cpp Outdated Show resolved Hide resolved

oanaoana force-pushed the users/omarin/matmul-ALMIOPEN-1014 branch from 04b404c to e630bca Compare March 17, 2026 20:57

oanaoana added 5 commits March 18, 2026 17:10

Add DynamicTolerances utility with computeGamma and conv tolerance he…

9e24e4a

…lpers

updated mat norm and tests on non constant tensors

e86daf0

fixed batched extension

c674d0f

forgot const tensor

714bbf3

replaced explicit u with numeric_limits epsilon

21e9f22

oanaoana force-pushed the users/omarin/matmul-ALMIOPEN-1014 branch from c80dd30 to 21e9f22 Compare March 18, 2026 17:11

Conversation

oanaoana commented Mar 13, 2026

Motivation

Technical details

Test Plan

Uh oh!

codecov-commenter commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

BrianHarrisonAMD Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BrianHarrisonAMD Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oanaoana Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Mar 13, 2026 •

edited

Loading

BrianHarrisonAMD Mar 16, 2026 •

edited

Loading

BrianHarrisonAMD Mar 16, 2026 •

edited

Loading

oanaoana Mar 17, 2026 •

edited

Loading