Conversation
Codecov Report❌ Patch coverage is
❌ Your project status has failed because the head coverage (77.21%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #5446 +/- ##
========================================
Coverage 67.27% 67.27%
========================================
Files 1844 1844
Lines 284014 284099 +85
Branches 39839 39848 +9
========================================
+ Hits 191044 191113 +69
- Misses 76512 76524 +12
- Partials 16458 16462 +4
*This pull request uses carry forward flags. Click here to find out more.
🚀 New features to boost your workflow:
|
| * @return The calculated tolerance value as float. | ||
| */ | ||
| template <typename OutputType, typename InputType, typename ComputeType = float> | ||
| float calculateMatmulTolerance(ITensor& a, ITensor& b) |
There was a problem hiding this comment.
We should update the existing hipBLASLt provider tests to use this path for the tolerance calc, and also verify that it's passing with these changes on a variety of shapes. We don't need to commit the extra tests for now, but we should do some exploratory testing to ensure this calc is working across different matmul problems with the hipBLASLt provider.
Tests in the folder here.
There was a problem hiding this comment.
Yes, I am all on board with this. I'll take a look at the tests.
| // For tensors with batch dimensions, compute across all batches | ||
| auto batchCount = static_cast<int64_t>(tensor.elementCount()) / (rows * cols); | ||
|
|
||
| double maxRowSum = 0.0; | ||
|
|
||
| for(int64_t batch = 0; batch < batchCount; ++batch) | ||
| { | ||
| auto batchOffset = batch * rows * rowStride; |
There was a problem hiding this comment.
I believe this won't work with padded batches since it's not using the batch stride.
Example: dims {2, 3, 4} with strides {24, 4, 1} (padding between batches)
The function would compute offset batch * 3 * 4 = batch * 12 instead of the correct batch * 24.
I think this can be replaced with iterateAlongDimensions + TensorView::getHostValue().
There is already logic to handle the offsets.
Following something similar to layernorm I think it looks like:
template <typename T>
double computeMatrixInfNorm(ITensor& tensor)
{
using hipdnn_data_sdk::utilities::iterateAlongDimensions;
const auto& dims = tensor.dims();
ConstTensorView<T> view(tensor);
auto cols = dims.back();
// outerDims = [batch..., rows] — everything except the last dim
auto outerDims = std::vector<int64_t>(dims.begin(), dims.end() - 1);
double maxRowSum = 0.0;
iterateAlongDimensions(outerDims, [&](const std::vector<int64_t>& outerIndices) {
double rowSum = 0.0;
auto fullIndices = outerIndices;
fullIndices.push_back(0);
for(int64_t j = 0; j < cols; ++j)
{
fullIndices.back() = j;
rowSum += static_cast<double>(
hipdnn_data_sdk::types::fabs(view.getHostValue(fullIndices)));
}
maxRowSum = std::max(maxRowSum, rowSum);
});
return maxRowSum;
}
You could also look at making it multi-threaded, but probably not worth it yet.
There was a problem hiding this comment.
Thanks! I wasn't planning to address batched matrices and had no tests for this option. I left the batch handling in place mainly because I was working with ITensor. You’re absolutely right though, there’s no good reason not to handle batching properly, and adjusting the batch stride is the correct fix. Fixed this!
| template <typename T> | ||
| hipdnn_data_sdk::utilities::Tensor<T> | ||
| createTensorFromRowValues(const std::vector<int64_t>& dims, | ||
| const std::vector<double>& rowValues) | ||
| { | ||
| hipdnn_data_sdk::utilities::Tensor<T> tensor(dims); | ||
| auto* data = static_cast<T*>(tensor.memory().hostData()); | ||
|
|
||
| auto rows = dims[dims.size() - 2]; | ||
| auto cols = dims[dims.size() - 1]; | ||
|
|
||
| for(int64_t i = 0; i < rows; ++i) | ||
| { | ||
| for(int64_t j = 0; j < cols; ++j) | ||
| { | ||
| data[i * cols + j] = static_cast<T>(rowValues[static_cast<size_t>(i)]); | ||
| } | ||
| } | ||
|
|
||
| return tensor; | ||
| } |
There was a problem hiding this comment.
I think this is fine for the 2D usage in the file, but if we wanted this as a general utility then we would need to change it to be similar to the suggestions in DynamicTolerances.hpp.
Something like:
template <typename T>
Tensor<T> createTensorFromRowValues(const std::vector<int64_t>& dims,
const std::vector<double>& rowValues)
{
Tensor<T> tensor(dims);
auto cols = dims.back();
auto rows = dims[dims.size() - 2];
// outerDims = [batch..., rows]
auto outerDims = std::vector<int64_t>(dims.begin(), dims.end() - 1);
iterateAlongDimensions(outerDims, [&](const std::vector<int64_t>& outerIndices) {
auto row = outerIndices.back();
auto fullIndices = outerIndices;
fullIndices.push_back(0);
for(int64_t j = 0; j < cols; ++j)
{
fullIndices.back() = j;
tensor.setHostValue(static_cast<T>(rowValues[static_cast<size_t>(row)]),
fullIndices);
}
});
return tensor;
}
There was a problem hiding this comment.
Addressed together with the proper batch iterator.
| /** | ||
| * @brief Computes the infinity-norm (max absolute row sum) of a matrix tensor. | ||
| * | ||
| * The infinity-norm is defined as: ||A||_inf = max_i (sum_j |A[i,j]|) | ||
| * | ||
| * This is the appropriate subordinate matrix norm for element-wise error bounds | ||
| * via Higham's analysis: max_ij |error_ij| <= gamma_k * ||A||_inf * ||B||_inf. | ||
| * | ||
| * Uses strides to correctly handle both packed and non-packed tensor layouts. | ||
| * | ||
| * @tparam T The data type of the tensor elements. | ||
| * @param tensor The input tensor (must have at least 2 dimensions). | ||
| * @return The infinity-norm as a double. | ||
| */ | ||
| template <typename T> | ||
| double computeMatrixInfNorm(ITensor& tensor) |
There was a problem hiding this comment.
I think it's good to have this version that uses the exact values, but for conv we used estimated values from the min/max to be able to quickly calculate an estimated worst-case tolerance.
Would it be possible to add a version that uses estimated values similar to conv instead of real values?
I think we should probably have both for every operation, and then we can add in some environment variable, or settings that let us swap between the modes.
That way for quick pre-submit checks we can use the quicker/looser estimation of tolerance, and for nightly / weekly we can instead use the calculated form. The calculated form will also likely need to be converted into a GPU version once we scale up as well. Having the estimated version gives us something to use until all this work is in place.
There was a problem hiding this comment.
I was wondering if we could keep this pretty much the same but just extract out where the computation of the Norm happens since that seems to be the expensive operation. Of course, im not amazing with the tolerance stuff yet so please correct me if this wont work at all.
// Extract the common math and take in the norm of A & B.
template <typename OutputType, typename InputType, typename ComputeType = float>
float calculateMatmulTolerance(double normA, double normB, uint64_t k)
{
static_assert(std::is_same_v<ComputeType, float> || std::is_same_v<ComputeType, double>
|| std::is_same_v<ComputeType, half> || std::is_same_v<ComputeType, bfloat16>,
"ComputeType must be float, double, half, or bfloat16");
auto epsilon = static_cast<double>(std::numeric_limits<ComputeType>::epsilon());
double gamma = hipdnn_test_sdk::utilities::computeGamma(k, epsilon);
constexpr double GAMMA_MAX = 0.5;
if(gamma >= GAMMA_MAX)
{
throw std::overflow_error("Error growth factor gamma >= 0.5...");
}
double accumulatedTolerance = gamma * normA * normB;
auto inputEpsilon = static_cast<double>(std::numeric_limits<InputType>::epsilon());
if(inputEpsilon < epsilon)
{
accumulatedTolerance += 2.0 * normA * normB * epsilon;
}
auto outputEpsilon = static_cast<double>(std::numeric_limits<OutputType>::epsilon());
double castTolerance = 0.0;
if(outputEpsilon > epsilon)
{
castTolerance = normA * normB * outputEpsilon;
}
double totalTolerance = accumulatedTolerance + castTolerance;
if(totalTolerance > static_cast<double>(std::numeric_limits<OutputType>::max()))
{
throw std::overflow_error("Calculated tolerance exceeds max of output type.");
}
return static_cast<float>(totalTolerance);
}
Then we can make a fast/loose version that takes in the min/max, cols of A/B and computes a maxNorm of A/B.
template <typename OutputType, typename InputType, typename ComputeType = float>
float calculateMatmulToleranceLoose(double minA, double maxA, int64_t colsA,
double minB, double maxB, int64_t colsB)
{
double maxAbsA = std::max(std::abs(minA), std::abs(maxA));
double maxAbsB = std::max(std::abs(minB), std::abs(maxB));
double normA = static_cast<double>(colsA) * maxAbsA;
double normB = static_cast<double>(colsB) * maxAbsB;
auto k = static_cast<uint64_t>(colsA); // K = cols of A = rows of B
return calculateMatmulTolerance<OutputType, InputType, ComputeType>(normA, normB, k);
}
Lastly we can have the original implementation that uses the full tensor values
template <typename OutputType, typename InputType, typename ComputeType = float>
float calculateMatmulToleranceTight(ITensor& a, ITensor& b)
{
const auto& aDims = a.dims();
const auto& bDims = b.dims();
if(aDims.size() < 2 || bDims.size() < 2)
{
throw std::invalid_argument("Matrices must have at least 2 dimensions for matmul.");
}
int64_t k = aDims[aDims.size() - 1];
int64_t bRows = bDims[bDims.size() - 2];
if(k != bRows)
{
throw std::invalid_argument(
"Matrix dimensions incompatible for multiplication: A columns != B rows.");
}
if(k <= 0)
{
throw std::invalid_argument("Reduction dimension k must be positive.");
}
double normA = computeMatrixInfNorm<InputType>(a);
double normB = computeMatrixInfNorm<InputType>(b);
return calculateMatmulTolerance<OutputType, InputType, ComputeType>(
normA, normB, static_cast<uint64_t>(k));
}
There was a problem hiding this comment.
I fully agree this would be more expensive, especially now after adding batching without tensorization etc. This might be a wider conversation, and I support Adam's suggestion for layered APIs, since we might want a few separate categories of checks with different aims. Some that run on-the-fly, like the initial convs approach, some more accurate for rigorous tests, and some in between.
This particular dynamic tol is in-between, since I used the inf and not Frobenius norm. I haven't assessed how to make it faster and simplify the norm further, but it's a step toward removing the N^2 check for convs. The best for local tests would be to compute the norms as the tensors get generated, and pass Anorm, Bnorm as input arguments for the dynamic tol evaluation.
Let me know if you'd like me to add a second simplified option in the spirit of convs or Adam's suggestion. I am leaning toward leaving this PR simple and having a second round of different norm types for different purposes, especially since the incoming GPU refs might lead to more options to weed through.
There was a problem hiding this comment.
I am okay if we leave this PR simple and do a follow up PR to split this all up.
There was a problem hiding this comment.
We should add a few non-trivial stride test case in here with padded batches, and irregular layouts.
That will help verify that the stride calcs are generalized, and correct for the different ways of describing tensors.
There was a problem hiding this comment.
Done! Since I fixed the batch stride, it is needed. I am wondering though if there shouldn't be separate computations for performance. I address this in your other comment on expedient norms.
There was a problem hiding this comment.
We probably need to split these test files into separate files per dynamic tolerance operation.
This is getting pretty bulky hehe.
Might be worth to also split the DynamicTolerances.hpp file into multiple per operation, and just keep a forwarding function, or use it as a universal include to get the other DynamicTolerance functions. Just thinking that file will get out of hand quick as we keep adding more.
There was a problem hiding this comment.
Yes, I agree, but maybe it's best to do it once we have all flavors of tolerance checks and each operation has more than one type of evaluation.
There was a problem hiding this comment.
I think we can leave it for this PR but I think itll be helpful to split up the tolerances into multiple files and tests soon. Then its easier to organize everything and for somone like myself, itll be easier to understand what belongs to which tolerance calc. I do see us having some generic functions across all tolerances and that we can keep in the generic file like DynamicToleranceHelpers.hpp or something
There was a problem hiding this comment.
Totally agree! What about having all utilities like norms, gamma functions etc in one file and the error evaluations in another. I should maybe wait 1-2 more PRs to have a few helper functions for the split to make more sense.
projects/hipdnn/test_sdk/tests/utilities/TestDynamicTolerances.cpp
Outdated
Show resolved
Hide resolved
04b404c to
e630bca
Compare
c80dd30 to
21e9f22
Compare
Motivation
Enable dynamic tolerance calculations for matrix multiplications or different sizes and data type entries.
Technical details
calculateMatmulTolerance<Out, In, Comp>(ITensor&, ITensor&)for computingelement-wise error bounds on matrix multiplication using Higham's forward error analysis
computeGamma(k, epsilon)utility with automatic linear/statisticalregime switching, reusable across conv and matmul
TestMatmulPlan.cppwith the dynamic calculationUses the matrix infinity-norm
||A||_inf = max_i (sum_j |A[i,j]|)(max absolute row sum).This is the correct subordinate norm for element-wise error bounds:
which matches what
allClosevalidates.Test Plan
19 unit tests covering 6 type combinations (float, double, half, bfloat16) across linear and statistical gamma regimes, with input/output casting error paths.
Tests use non-uniform row data to validate the infinity-norm implementation.
Three edge case tests verify throws on dimension mismatch, gamma overflow (bf16/K=100), and output overflow (tolerance > half::max).