Skip to content

Commit 77336b8

Browse files
committed
Add documentation
1 parent f1c12d5 commit 77336b8

File tree

1 file changed

+50
-51
lines changed

1 file changed

+50
-51
lines changed

src/core/reference/include/openvino/reference/adaptive_rkv_diversity.hpp

Lines changed: 50 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -19,26 +19,18 @@
1919
namespace ov::reference {
2020

2121

22-
/** @brief Reference implementation of the XAttention sparse attention prefill mechanism
23-
* (https://arxiv.org/abs/2503.16428) */
22+
/** @brief Reference implementation of the Adaptive R-KV token diversity calculation mechanism
23+
* (https://arxiv.org/pdf/2505.24133v3) */
2424
template <typename T>
2525
class AdaptiveRKVDiversityCalculator {
2626
public:
27-
/** @param threshold Defines a threshold for introduced block sparsity - XAttention attempts to preserve the
28-
* smallest subset of attention score matrix blocks so that the ratio of the attention score sum to the total sum of
29-
* attention score matrix elements is no less than `threshold`. In other words, `threshold` defines a fraction of
30-
* the attention score mass which is to be preserved by most "important" blocks. Valid range is 0.0-1.0, with 0.0
31-
* corresponding to 0% of the blocks retained, and 1.0 corresponding to 100% of the blocks retained.
32-
* @param block_size The size of blocks into which the attention score matrix [num_heads, query_token_dimension,
33-
* key_token_dimension] will be subdivided for purposes of determining the subset of the most important blocks
34-
* according to `threshold`. This subdivision occurs on query and key dimensions of the attention score matrix with
35-
* the same granularity, i.e. the resulting blocks have equal size on both dimensions. Essentially `block_size`
36-
* defines the granularity of the eventual sparse attention computations. Must be a multiple of `stride`.
37-
* @param stride The stride at which the full attention matrix is subsampled in a block-antidiagonal fashion to
38-
* estimate the block importance. Note that the full attention matrix is not computed, instead the original query
39-
* and key matrices are reshaped appropriately so that only the necessary elements are computed. Ideally, the
40-
* computational complexity of the entire block estimation operation is `stride` times lower than the full attention
41-
* matrix computation.
27+
/** @param start_size Size, in tokens, of the key cache area that will be ignored for purposes of diversity
28+
* calculation, starting from the beginning of the token dimension ("start area"). Must be a multiple of `block_size`.
29+
* @param eviction_size Size, in tokens, from the beginning of the start area, the tokens in which will be
30+
* considred for purposes of diversity calculation ("eviction area"). The rest of the tokens after the eviction area,
31+
* if any, are ignored. Must be a multiple of `block_size`.
32+
* @param block_size Block size of the underlying paged attention implementation. The diversity values will be sum-reduced
33+
* from per-token values to per-block values based on this number of tokens in a block.
4234
* */
4335
AdaptiveRKVDiversityCalculator(size_t start_size, size_t eviction_size, size_t block_size)
4436
: m_start_size(start_size),
@@ -48,17 +40,11 @@ class AdaptiveRKVDiversityCalculator {
4840
OPENVINO_ASSERT(eviction_size % block_size == 0);
4941
}
5042

51-
/** Divides the input rank-3 tensor into blocks along last two dimensions, performs the addition of the values
52-
* inside each block and outputs each block sum into corresponding positions in the output tensor downsampled along
53-
* the same dimensions. The output tensor dimensions are such that the query and key token dimensions are
54-
* downsampled by `block_size` when compared to the *original* query and key tensors.
55-
* @param attention_scores_data Pointer to the attention score input.
56-
* @param attention_score_shape Shape of the attention score input tensor. Expected shape is [num_heads,
57-
* num_query_tokens / stride, num_key_tokens / stride], where `num_query_tokens` and `num_key_tokens` must be
58-
* multiples of `block_size`.
59-
* @param out Pointer to the output tensor data (block sums)
60-
* @param out_shape Shape of the output tensor data. Expected shape is [num_heads, num_query_tokens / block_size,
61-
* num_key_tokens / block_size].
43+
/** Fills the diagonal of each square matrix slice (at ranks 1 and 2, zero-based) of the input rank-3 tensor with
44+
* a provided value. The operation is done in-place.
45+
* @param in_out Pointer to the matrix data.
46+
* @param in_out_shape Shape of the matrix data. Expected shape is [num_heads, token_dim, token_dim].
47+
* @param val Value to fill in the diagonal positions.
6248
*/
6349
void fill_diagonal_(T* in_out,
6450
const Shape& in_out_shape,
@@ -77,6 +63,12 @@ class AdaptiveRKVDiversityCalculator {
7763
}
7864
}
7965

66+
/** For a rank-3 tensor, zeroes out the values that are less than the mean of the values of the corresponding slice at rank 2 (zero-based). Ranks 1 and 2 of the input tensor must be equal. Mean values are computed and provided externally. The operation is done in-place.
67+
* @param in_out Pointer to the tensor data.
68+
* @param in_out_shape Shape of the tensor data. Expected shape is [num_heads, token_dim, token_dim].
69+
* @param means Pointer to the tensor data containing the means of each slice of the `in_out` tensor along its rank 2 (zero-based).
70+
* @param means_shape Shape of the means tensor. Expected shape is [num_heads, token_dim].
71+
*/
8072
void fill_low_values_with_zeros_(T* in_out,
8173
const Shape& in_out_shape,
8274
const T* means,
@@ -102,17 +94,23 @@ class AdaptiveRKVDiversityCalculator {
10294
}
10395
}
10496

105-
void block_sum_diversity_values(const T* processed_similarity_token_data,
106-
const Shape& processed_similarity_token_data_shape,
97+
/** For a square matrix, sums each `block_size`-sized group of matrix rows to produce a row in the output matrix.
98+
* @param in_data Pointer to the matrix data.
99+
* @param in_shape Shape of the matrix data. Expected shape is [token_dim, token_dim], where token_dim must be a multiple of `block_size`.
100+
* @param out Pointer to the output matrix data.
101+
* @param out_shape Shape of the output matrix. Expected shape is [token_dim / block_size, token_dim].
102+
*/
103+
void block_sum_diversity_values(const T* in_data,
104+
const Shape& in_shape,
107105
T* out,
108106
const Shape& out_shape) {
109-
OPENVINO_ASSERT(processed_similarity_token_data_shape.size() == 2); // [token_dim, token_dim]
110-
OPENVINO_ASSERT(processed_similarity_token_data_shape[0] == processed_similarity_token_data_shape[1]);
111-
OPENVINO_ASSERT(processed_similarity_token_data_shape[0] % m_block_size == 0);
107+
OPENVINO_ASSERT(in_shape.size() == 2); // [token_dim, token_dim]
108+
OPENVINO_ASSERT(in_shape[0] == in_shape[1]);
109+
OPENVINO_ASSERT(in_shape[0] % m_block_size == 0);
112110

113111
OPENVINO_ASSERT(out_shape.size() == 2); // [block_dim, token_dim]
114-
OPENVINO_ASSERT(out_shape[0] == processed_similarity_token_data_shape[0] / m_block_size);
115-
OPENVINO_ASSERT(out_shape[1] == processed_similarity_token_data_shape[1]);
112+
OPENVINO_ASSERT(out_shape[0] == in_shape[0] / m_block_size);
113+
OPENVINO_ASSERT(out_shape[1] == in_shape[1]);
116114

117115
std::memset(out, 0, out_shape[0] * out_shape[1] * sizeof(T));
118116

@@ -121,28 +119,29 @@ class AdaptiveRKVDiversityCalculator {
121119
for (size_t out_token_dim_idx = 0; out_token_dim_idx < out_shape[1]; out_token_dim_idx++) {
122120
size_t in_block_offset = (out_block_dim_idx * m_block_size) * out_shape[1];
123121
for (size_t in_token_in_block_idx = 0; in_token_in_block_idx < m_block_size; in_token_in_block_idx++) {
124-
size_t source_offset = in_block_offset + in_token_in_block_idx * processed_similarity_token_data_shape[1] + out_token_dim_idx;
125-
out[out_block_offset + out_token_dim_idx] -= processed_similarity_token_data[source_offset];
122+
size_t source_offset = in_block_offset + in_token_in_block_idx * in_shape[1] + out_token_dim_idx;
123+
out[out_block_offset + out_token_dim_idx] -= in_data[source_offset];
126124
}
127125
}
128126
}
129127
}
130128

131-
/** Applies XAttention to the provided query and key matrices, returning the subset of the most important blocks for
132-
* each attention head, according to the configured block size and threshold, which are to be preserved in the
133-
* subsequent sparse attention computation.
134-
* @param query_data Pointer to the query input tensor data
135-
* @param query_shape Shape of the query input tensor data. Expected shape is [num_heads, num_query_tokens,
136-
* head_size], where `num_query_tokens` must be a multiple of both `block_size` and `stride`, padded with zeroes if
137-
* necessary to do so in the real-world scenario.
138-
* @param key_data Pointer to the key input tensor data
129+
/** Calculates token diversity in the eviction area, partially aggregating the results per-block. The resulting
130+
* diversity values have the shape of [num_eviction_blocks (== eviction_size / block_size), eviction_size]. Note
131+
* that the 1-st rank is left unaggregated when compared to the full diversity calculation algorithm. The reason
132+
* for this is as follows. The final per-block diversity value computation relies on knowing the subset of blocks
133+
* in the eviction area that will be retained regardless of calculated diversity. This subset must be filtered out
134+
* from the rank-1 dimension when performing reduce-mean in the original algorithm to get 1 diversity value per block
135+
* in the eviction area. Due to implementation specifics the paged attention kernel does not know ahead of time which
136+
* blocks will be "retained" - this information is only available on the openvino.genai level after the PA kernel has executed.
137+
* Therefore the PA kernel will provide raw per-token values on the rank 1 of the returned diversity value matrix and delegatei
138+
* the final reduce-mean and filtering to the openvino.genai level.
139+
* @param key_data Pointer to the key cache tensor data
139140
* @param key_shape Shape of the key input tensor data. Expected shape is [num_heads, num_key_tokens, head_size],
140-
* where `num_key_tokens` must be a multiple of both `block_size` and `stride`, padded with zeroes if necessary to
141-
* do so in the real-world scenario.
142-
* @return A vector of size `num_heads` of sets, each set containing pairs of block indices (.first is the block
143-
* index along the query dimension, .second - along the key). Each set is the head-specific subset of blocks that
144-
* must be preserved in the sparse attention computation. Indices are given in units of XAttention-specific
145-
* `block_size` (as configured), which may differ from the block size in the paged attention implementation.
141+
* where `num_key_tokens` must be no less than `start_size + eviction_size`.
142+
* @return A rank-2 matrix in the std::vector representation with dimensions [eviction_size / block_size, eviction_size] containing
143+
* the diversity values. The values are expected to be further mean-reduced along rank 1 (zero-based) at the point in time when the
144+
* subset of blocks to be exclusively retained is known.
146145
*/
147146
std::vector<std::vector<T>> calculate_block_diversity(const T* key_data,
148147
const Shape& key_shape) {

0 commit comments

Comments
 (0)