Skip to content

Commit 758d3e8

Browse files
authored
Add unsafe aliased memory checking system (#1079)
Implement detection for unsafe memory aliasing between input and output tensors in transform operations. Enabled via MATX_EN_UNSAFE_ALIAS_DETECTION flag. Includes can_alias() trait function and matmul aliasing unit test.
1 parent 2be1dcf commit 758d3e8

35 files changed

+745
-126
lines changed

CMakeLists.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ option(MATX_EN_COMPLEX_OP_NAN_CHECKS "Enable full NaN/Inf handling for complex m
8080
option(MATX_EN_CUDA_LINEINFO "Enable line information for CUDA kernels via -lineinfo nvcc flag" OFF)
8181
option(MATX_EN_EXTENDED_LAMBDA "Enable extended lambda support for device/host lambdas" ON)
8282
option(MATX_EN_MATHDX "Enable MathDx support for kernel fusion" OFF)
83+
option(MATX_EN_UNSAFE_ALIAS_DETECTION "Enable aliased memory detection" OFF)
8384

8485
set(MATX_EN_PYBIND11 OFF CACHE BOOL "Enable pybind11 support")
8586

@@ -212,6 +213,10 @@ else()
212213
set(MATX_NVPL_INT_TYPE "ilp64")
213214
endif()
214215

216+
if (MATX_EN_UNSAFE_ALIAS_DETECTION)
217+
target_compile_definitions(matx INTERFACE MATX_EN_UNSAFE_ALIAS_DETECTION)
218+
endif()
219+
215220
# Host support
216221
if (MATX_EN_NVPL OR MATX_EN_X86_FFTW OR MATX_EN_BLIS OR MATX_EN_OPENBLAS)
217222
message(STATUS "Enabling OpenMP support")

docs_input/basics/debug.rst

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
.. _debugging:
2+
3+
Debugging
4+
#########
5+
6+
MatX employs several tools for debugging and improving the correctness of the code.
7+
8+
Logging
9+
--------
10+
11+
MatX provides a logging system that can be used to log messages to the console. This is useful for debugging your code and can be used to trace the execution of your code.
12+
13+
See :ref:`logging_basics` for more information on the logging system.
14+
15+
Compile Time
16+
------------
17+
18+
At compile time MatX uses `static_assert` calls where possible to provide helpful error messages. Static assertions have a limitation that
19+
they cannot display a formatted string, so the value of the invalid parameters are not displayed. Common compile time errors include:
20+
21+
- Invalid rank
22+
- Invalid type
23+
- Invalid tensor shapes (for static tensor sizes)
24+
25+
Runtime
26+
-------
27+
28+
At runtime MatX uses C++ exceptions to throw errors. These errors are typically based on expected vs actual outcomes. Several macros are used
29+
to raise these errors:
30+
31+
- MATX_ASSERT (boolean assertion)
32+
- MATX_ASSERT_STR (boolean assertion with a formatted string)
33+
- MATX_ASSERT_STR_EXP (boolean assertion with a formatted string and an expected value)
34+
35+
These macros are also listed in order of usefulness with the `MATX_ASSERT_STR_EXP` macro providing the most information to the user. Common
36+
runtime errors include:
37+
38+
- Invalid sizes
39+
- Invalid indexing
40+
- Errors returned from CUDA APIs
41+
42+
43+
Null Pointer Checking
44+
---------------------
45+
46+
Tensors in MatX may be left unitialized on declaration. This is common when a tensor is used as a class member and is not initialized in the constructor. For example:
47+
48+
.. code-block:: cpp
49+
50+
class MyClass {
51+
public:
52+
MyClass() {
53+
}
54+
private:
55+
tensor_t<float> t; // Uninitialized
56+
};
57+
58+
Typically `make_tensor` is used at a later time to declare the shape allocate the memory backing the tensor. Detecting an unitialized tensor on the device
59+
has a non-zero performance penalty and is disabled by default. To detect an unitialized tensor on the device, build your application in debug mode with the
60+
`NDEBUG` flag undefined. When the `NDEBUG` flag is undefined, MatX will check for unitialized tensors on the device and assert if one is found.
61+
62+
Unsafe Aliased Memory Checking
63+
------------------------------
64+
65+
MatX provides an imperfect unsafe aliased memory checking system that can be used to detect when an input tensor may overlap with output tensor memory,
66+
causing a data race. The word *unsafe* is used here because there are cases where aliasing is safe, such as a direct element-wise operation.
67+
To have a false positive rate of 0 we would need to check every possible input and output location to see if any of them overlap.
68+
This would be impractical for most applications. Instead, we use several checks that can catch the most common cases of memory aliasing. Since aliasing can be
69+
an expensive check and it's not perfect, alias checking must be explicitly enabled with the CMake option `MATX_EN_UNSAFE_ALIAS_DETECTION` or the compiler
70+
define with the same name.
71+
72+
The types of aliasing that can be detected are:
73+
74+
- Safe element-wise aliasing: (a = a + a) // No aliasing since it's a direct element-wise operation
75+
- Safe element-wise aliasing: (slice(a, {0}, {5}) = slice(a, {0}, {5}) - slice(a, {0}, {5})) // No aliasing since it's a direct element-wise operation
76+
- Unsafe element-wise aliasing: (slice(a, {0}, {5}) = slice(a, {3}, {8}) - slice(a, {0}, {5})) // Unsafe since inputs and outputs overlap to different locations
77+
- Unsafe matrix multiplication: (c = matmul(c, d)) // Unsafe since matmul doesn't allow aliasing on input and output memory
78+
- Safe FFT: (c = fft(c)) // No aliasing since FFT allows aliasing
79+
- False positive: (slice(a, {0}, {6}, {2}) = slice(a, {0}, {6}, {2}) + slice(a, {0}, {6}, {2})) // Non-unity strides throw false positive currently

include/matx/core/capabilities.h

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,8 @@ namespace detail {
6868
SET_GROUPS_PER_BLOCK, // Set the number of groups per block for the operator.
6969
ASYNC_LOADS_REQUESTED, // Whether the operator requires asynchronous loads.
7070
MAX_EPT_VEC_LOAD, // The maximum EPT for a vector load.
71+
ELEMENT_WISE, // Whether the operator is element-wise (safe with aliasing)
72+
ALIASED_MEMORY, // Whether the operator's input and output pointers alias
7173
// Add more capabilities as needed
7274
};
7375

@@ -139,6 +141,15 @@ namespace detail {
139141
static constexpr bool and_identity = true;
140142
};
141143

144+
template <>
145+
struct capability_attributes<OperatorCapability::ALIASED_MEMORY> {
146+
using type = bool;
147+
using input_type = AliasedMemoryQueryInput;
148+
static constexpr bool default_value = false;
149+
static constexpr bool or_identity = false;
150+
static constexpr bool and_identity = true;
151+
};
152+
142153
template <>
143154
struct capability_attributes<OperatorCapability::GROUPS_PER_BLOCK> {
144155
using type = cuda::std::array<int, 2>; // min/max elements per thread
@@ -266,6 +277,8 @@ namespace detail {
266277
return CapabilityQueryType::AND_QUERY; // The expression should use the range of groups per block of its children.
267278
case OperatorCapability::GROUPS_PER_BLOCK:
268279
return CapabilityQueryType::RANGE_QUERY; // The expression should use the range of groups per block of its children.
280+
case OperatorCapability::ALIASED_MEMORY:
281+
return CapabilityQueryType::OR_QUERY; // The expression should use the aliased memory of its children.
269282
case OperatorCapability::MAX_EPT_VEC_LOAD:
270283
return CapabilityQueryType::MIN_QUERY; // The expression should use the minimum EPT for a vector load of its children.
271284
case OperatorCapability::JIT_CLASS_QUERY:

include/matx/core/operator_options.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,7 @@ namespace detail {
168168

169169
// Input structure for types that require it
170170

171+
// Capabilities structures
171172

172173
struct EPTQueryInput {
173174
bool jit;
@@ -187,6 +188,13 @@ namespace detail {
187188
int groups_per_block;
188189
};
189190

191+
struct AliasedMemoryQueryInput {
192+
bool permutes_input_output;
193+
bool is_prerun;
194+
void *start_ptr;
195+
void *end_ptr;
196+
};
197+
190198
}
191199

192200
};

include/matx/core/tensor_impl.h

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1462,6 +1462,49 @@ MATX_IGNORE_WARNING_POP_GCC
14621462
return false;
14631463
#endif
14641464
}
1465+
else if constexpr (Cap == OperatorCapability::ALIASED_MEMORY) {
1466+
// Check if this tensor's memory overlaps with the query input range
1467+
static_assert(std::is_same_v<remove_cvref_t<InType>, detail::AliasedMemoryQueryInput>,
1468+
"ALIASED_MEMORY capability requires AliasedMemoryQueryInput");
1469+
1470+
// Rank 0 (scalars) don't need aliasing checks
1471+
if constexpr (Rank() == 0) {
1472+
return false;
1473+
}
1474+
else if constexpr (is_sparse_data_v<TensorData>) {
1475+
return false;
1476+
}
1477+
else {
1478+
// The logic to detect overlaps is as follows: If we have a complete overlap (first and last pointers are identical),
1479+
// ie (a = a), then we need to check if the tensor is contiguous or if the input permutes the input and output. If either
1480+
// of those are true then this will alias. Otherwise we have a partial overlap. For a partial overlap we always say this
1481+
// can alias.
1482+
1483+
// Get address of first element using operator()(0, 0, ...)
1484+
auto get_first = [this]<size_t... Is>(cuda::std::index_sequence<Is...>) {
1485+
return &(const_cast<tensor_impl_t*>(this)->operator()(static_cast<index_t>(Is*0)...));
1486+
};
1487+
void* tensor_start = static_cast<void*>(const_cast<T*>(get_first(cuda::std::make_index_sequence<Rank()>{})));
1488+
1489+
// Get address of last element using operator()(Size(0)-1, Size(1)-1, ...)
1490+
auto get_last = [this]<size_t... Is>(cuda::std::index_sequence<Is...>) {
1491+
return &(const_cast<tensor_impl_t*>(this)->operator()(static_cast<index_t>(Size(Is)-1)...));
1492+
};
1493+
void* tensor_end = static_cast<void*>(static_cast<char*>(static_cast<void*>(const_cast<T*>(get_last(cuda::std::make_index_sequence<Rank()>{})))) + sizeof(T));
1494+
1495+
bool complete_overlap = tensor_start == in.start_ptr && tensor_end == in.end_ptr;
1496+
if (complete_overlap) {
1497+
MATX_LOG_TRACE("Complete overlap of tensors. Contiguous: {}", IsContiguous());
1498+
return !IsContiguous() || in.permutes_input_output;
1499+
}
1500+
1501+
// Check for overlap: two ranges [a1, a2) and [b1, b2) overlap if a1 < b2 && b1 < a2
1502+
bool overlaps = (tensor_start < in.end_ptr) && (in.start_ptr < tensor_end);
1503+
1504+
MATX_LOG_TRACE("Overlap of tensors: {}", overlaps);
1505+
return overlaps;
1506+
}
1507+
}
14651508
else {
14661509
return detail::capability_attributes<Cap>::default_value;
14671510
}

include/matx/core/type_utils_both.h

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -172,6 +172,28 @@ template <typename T> constexpr __MATX_HOST__ __MATX_DEVICE__ bool is_matx_trans
172172
return detail::is_matx_transform_op_impl<typename remove_cvref<T>::type>::value;
173173
}
174174

175+
namespace detail {
176+
template <typename T, typename = void>
177+
struct has_can_alias_impl : cuda::std::false_type {
178+
};
179+
180+
template <typename T>
181+
struct has_can_alias_impl<T, cuda::std::void_t<typename remove_cvref_t<T>::can_alias>> : cuda::std::true_type {
182+
};
183+
}
184+
185+
/**
186+
* @brief Determine if operator can alias
187+
*
188+
* Returns true if the type is a transform operator and has the can_alias trait set
189+
*
190+
* @tparam T Type to test
191+
*/
192+
template <typename T> constexpr __MATX_HOST__ __MATX_DEVICE__ bool can_alias()
193+
{
194+
return is_matx_transform_op<T>() && detail::has_can_alias_impl<typename remove_cvref<T>::type>::value;
195+
}
196+
175197
namespace detail {
176198
template <typename T, typename = void>
177199
struct has_matx_op_type : cuda::std::false_type {

include/matx/generators/alternate.h

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -60,17 +60,6 @@ namespace matx
6060
}
6161
}
6262

63-
template <OperatorCapability Cap>
64-
__MATX_INLINE__ __MATX_HOST__ auto get_capability_proc() const {
65-
if constexpr (Cap == OperatorCapability::ELEMENTS_PER_THREAD) {
66-
const auto my_cap = cuda::std::array<ElementsPerThread, 2>{ElementsPerThread::ONE, ElementsPerThread::ONE};
67-
return my_cap;
68-
} else {
69-
auto self_has_cap = detail::capability_attributes<Cap>::default_value;
70-
return self_has_cap;
71-
}
72-
}
73-
7463
template <typename CapType>
7564
__MATX_INLINE__ __MATX_HOST__ __MATX_DEVICE__ auto operator()(index_t i) const
7665
{

include/matx/generators/diag.h

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -68,16 +68,6 @@ namespace matx
6868
}
6969
}
7070

71-
template <OperatorCapability Cap>
72-
__MATX_INLINE__ __MATX_HOST__ auto get_capability_proc() const {
73-
if constexpr (Cap == OperatorCapability::ELEMENTS_PER_THREAD) {
74-
const auto my_cap = cuda::std::array<ElementsPerThread, 2>{ElementsPerThread::ONE, ElementsPerThread::ONE};
75-
return my_cap;
76-
} else {
77-
return detail::capability_attributes<Cap>::default_value;
78-
}
79-
}
80-
8171
// Does not support vectorization yet
8272
template <typename CapType, typename... Is>
8373
__MATX_INLINE__ __MATX_DEVICE__ __MATX_HOST__ auto operator()(Is... indices) const {

include/matx/generators/linspace.h

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -79,17 +79,6 @@ namespace matx
7979
}
8080
}
8181

82-
template <OperatorCapability Cap>
83-
__MATX_INLINE__ __MATX_HOST__ auto get_capability_proc() const {
84-
if constexpr (Cap == OperatorCapability::ELEMENTS_PER_THREAD) {
85-
const auto my_cap = cuda::std::array<ElementsPerThread, 2>{ElementsPerThread::ONE, ElementsPerThread::ONE};
86-
return my_cap;
87-
} else {
88-
auto self_has_cap = detail::capability_attributes<Cap>::default_value;
89-
return self_has_cap;
90-
}
91-
}
92-
9382
template <typename CapType, typename... Is>
9483
__MATX_DEVICE__ __MATX_HOST__ __MATX_INLINE__ auto operator()(Is... indices) const {
9584
static_assert(sizeof...(indices) == NUM_RC, "Number of indices incorrect in linspace");

include/matx/generators/logspace.h

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -98,16 +98,6 @@ namespace matx
9898
}
9999
}
100100

101-
template <OperatorCapability Cap>
102-
__MATX_INLINE__ __MATX_HOST__ auto get_capability_proc() const {
103-
if constexpr (Cap == OperatorCapability::ELEMENTS_PER_THREAD) {
104-
const auto my_cap = cuda::std::array<ElementsPerThread, 2>{ElementsPerThread::ONE, ElementsPerThread::ONE};
105-
return my_cap;
106-
} else {
107-
auto self_has_cap = detail::capability_attributes<Cap>::default_value;
108-
return self_has_cap;
109-
}
110-
}
111101

112102
__MATX_DEVICE__ __MATX_HOST__ __MATX_INLINE__ auto operator()(index_t idx) const
113103
{

0 commit comments

Comments
 (0)