Skip to content

Commit f1f6e8e

Browse files
xiaoxmengmeta-codesync[bot]
authored andcommitted
feat: Add IndexReader interface for index-based lookups (#16330)
Summary: Pull Request resolved: #16330 This diff introduces the `IndexReader` interface in `Reader.h` to provide a clean abstraction for index-based data lookups, separate from the general-purpose `RowReader` interface. **New `IndexReader` class** - Abstract interface for index-based lookups with methods: - `encodeIndexBounds()` - Encodes index bounds into format-specific encoded key bounds - `lookupStripes()` - Looks up stripes that contain data matching the encoded key bounds - `setStripeRowRanges()` - Sets up row ranges for reading a specific stripe based on encoded bounds - `next()` - Pure virtual method to fetch the next portion of rows (without mutation support) **New `createIndexReader()` method in `Reader`** - Factory method to create an `IndexReader` instance. Default implementation throws `VELOX_UNSUPPORTED`, allowing format-specific readers (e.g., Nimble) to override and provide index reading support. **Supporting data structures** - `RowRange`, `StripeRowRanges`, and `StripeLookupResult` structs for representing row ranges and stripe lookup results. This separation allows `HiveIndexReader` to use a dedicated `IndexReader` interface optimized for batched key-based lookups, while keeping the `RowReader` interface focused on sequential row-by-row reading with mutation support. Reviewed By: Yuhta Differential Revision: D92851481 fbshipit-source-id: 09838895db5a7dfe895a4722fa7d9223aeccc49d
1 parent b88ce66 commit f1f6e8e

File tree

4 files changed

+143
-26
lines changed

4 files changed

+143
-26
lines changed

velox/dwio/common/Reader.h

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
#include "velox/dwio/common/SelectiveColumnReader.h"
2828
#include "velox/dwio/common/Statistics.h"
2929
#include "velox/dwio/common/TypeWithId.h"
30+
#include "velox/serializers/KeyEncoder.h"
3031
#include "velox/type/Type.h"
3132
#include "velox/vector/BaseVector.h"
3233

@@ -186,6 +187,92 @@ class RowReader {
186187
VectorPtr& result);
187188
};
188189

190+
/// Represents a row range within a stripe [startRow, endRow).
191+
struct RowRange {
192+
vector_size_t startRow{0}; // Inclusive
193+
vector_size_t endRow{0}; // Exclusive
194+
195+
RowRange() = default;
196+
RowRange(vector_size_t _startRow, vector_size_t _endRow)
197+
: startRow(_startRow), endRow(_endRow) {}
198+
199+
/// Returns true if this row range is empty (no rows to read).
200+
bool empty() const {
201+
return startRow >= endRow;
202+
}
203+
};
204+
205+
/**
206+
* Abstract index reader interface for index-based lookups.
207+
*
208+
* IndexReader provides methods for encoding index bounds, looking up stripes,
209+
* and reading data within specific row ranges. This interface is used by
210+
* HiveIndexReader to perform efficient key-based lookups on indexed files.
211+
*/
212+
class IndexReader {
213+
public:
214+
virtual ~IndexReader() = default;
215+
216+
using KeyBoundsVector = std::vector<velox::serializer::EncodedKeyBounds>;
217+
218+
/// Encodes index bounds into format-specific encoded key bounds.
219+
/// Different file formats may use different key encoding schemes, so this
220+
/// allows the format-specific reader to handle the encoding.
221+
///
222+
/// @param indexBounds The index bounds to encode, containing column names
223+
/// and lower/upper bound values.
224+
/// @return A vector of encoded key bounds, one for each row in the input
225+
/// bounds.
226+
/// @throws if encoding is not supported by the implementation or if any
227+
/// index bound fails to encode.
228+
virtual KeyBoundsVector encodeIndexBounds(
229+
const velox::serializer::IndexBounds& indexBounds) = 0;
230+
231+
/// Looks up stripes that contain data matching the encoded key bounds.
232+
/// For each request row, returns the list of stripe indices that may contain
233+
/// matching data based on the encoded lower and upper key bounds.
234+
///
235+
/// @param keyBounds The encoded key bounds for each request row.
236+
/// @return Stripe indices for each request row. Each inner vector contains
237+
/// the indices of stripes that may contain matching data for that
238+
/// request.
239+
/// @throws if lookup is not supported by the implementation.
240+
virtual std::vector<std::vector<uint32_t>> lookupStripes(
241+
const KeyBoundsVector& keyBounds) = 0;
242+
243+
/// Looks up row ranges within a specific stripe based on encoded key bounds.
244+
/// Computes row ranges per request without setting up state for iteration.
245+
///
246+
/// @param stripeIndex The index of the stripe to compute row ranges for.
247+
/// @param keyBounds The encoded key bounds for each request.
248+
/// @return Row ranges for each request, one per input encoded key bounds.
249+
/// Empty ranges (startRow >= endRow) are included for requests with
250+
/// no matching data.
251+
/// @throws if lookup is not supported by the implementation.
252+
virtual std::vector<RowRange> lookupRowRanges(
253+
uint32_t stripeIndex,
254+
const KeyBoundsVector& keyBounds) = 0;
255+
256+
/// Sets row ranges for reading from a specific stripe. Must be called before
257+
/// next() to set up the iteration state.
258+
///
259+
/// @param stripeIndex The index of the stripe to read from.
260+
/// @param rowRanges The row ranges to read within the stripe.
261+
/// @throws if setting row ranges is not supported by the implementation.
262+
virtual void setRowRanges(
263+
uint32_t stripeIndex,
264+
const std::vector<RowRange>& rowRanges) = 0;
265+
266+
/**
267+
* Fetch the next portion of rows.
268+
* @param size Max number of rows to read
269+
* @param result output vector
270+
* @return number of rows scanned in the file (including any rows filtered
271+
* out), 0 if there are no more rows to read.
272+
*/
273+
virtual uint64_t next(uint64_t size, velox::VectorPtr& result) = 0;
274+
};
275+
189276
/**
190277
* Abstract reader class.
191278
*
@@ -236,6 +323,17 @@ class Reader {
236323
virtual std::unique_ptr<RowReader> createRowReader(
237324
const RowReaderOptions& options = {}) const = 0;
238325

326+
/**
327+
* Create index reader object for index-based lookups.
328+
* @param options Row reader options describing the data to fetch
329+
* @return Index reader for efficient key-based lookups
330+
* @throws if index reading is not supported by the implementation
331+
*/
332+
virtual std::unique_ptr<IndexReader> createIndexReader(
333+
const RowReaderOptions& options = {}) const {
334+
VELOX_UNSUPPORTED("Reader::createIndexReader() is not supported");
335+
}
336+
239337
static TypePtr updateColumnNames(
240338
const TypePtr& fileType,
241339
const TypePtr& tableType);

velox/dwio/common/tests/ReaderTest.cpp

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -211,5 +211,17 @@ TEST_F(ReaderTest, projectColumnsMutation) {
211211
EXPECT_NE(0, numNonMax);
212212
}
213213

214+
TEST_F(ReaderTest, rowRangeEmpty) {
215+
// Empty when startRow >= endRow
216+
EXPECT_TRUE((RowRange{0, 0}.empty()));
217+
EXPECT_TRUE((RowRange{5, 5}.empty()));
218+
EXPECT_TRUE((RowRange{10, 5}.empty()));
219+
220+
// Not empty when startRow < endRow
221+
EXPECT_FALSE((RowRange{0, 1}.empty()));
222+
EXPECT_FALSE((RowRange{0, 10}.empty()));
223+
EXPECT_FALSE((RowRange{5, 10}.empty()));
224+
}
225+
214226
} // namespace
215227
} // namespace facebook::velox::dwio::common

velox/serializers/KeyEncoder.cpp

Lines changed: 24 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -20,34 +20,41 @@
2020
#include "velox/vector/FlatVector.h"
2121

2222
namespace facebook::velox::serializer {
23+
namespace {
2324

24-
bool IndexBounds::validate() const {
25-
if (!lowerBound.has_value() && !upperBound.has_value()) {
25+
bool validateBound(
26+
const IndexBound& indexBound,
27+
const std::vector<std::string>& indexColumns) {
28+
if (indexBound.bound == nullptr || indexBound.bound->size() == 0) {
2629
return false;
2730
}
2831

29-
const auto validateBound = [this](const IndexBound& bound) {
30-
if (bound.bound == nullptr || bound.bound->size() == 0) {
32+
const auto& rowType = asRowType(indexBound.bound->type());
33+
if (rowType->size() != indexColumns.size()) {
34+
return false;
35+
}
36+
for (const auto& columnName : indexColumns) {
37+
if (!rowType->containsChild(columnName)) {
3138
return false;
3239
}
40+
}
41+
return true;
42+
}
3343

34-
const auto& rowType = asRowType(bound.bound->type());
35-
if (rowType->size() != indexColumns.size()) {
36-
return false;
37-
}
38-
for (const auto& columnName : indexColumns) {
39-
if (!rowType->containsChild(columnName)) {
40-
return false;
41-
}
42-
}
43-
return true;
44-
};
44+
} // namespace
45+
46+
bool IndexBounds::validate() const {
47+
if (!lowerBound.has_value() && !upperBound.has_value()) {
48+
return false;
49+
}
4550

46-
if (lowerBound.has_value() && !validateBound(lowerBound.value())) {
51+
if (lowerBound.has_value() &&
52+
!validateBound(lowerBound.value(), indexColumns)) {
4753
return false;
4854
}
4955

50-
if (upperBound.has_value() && !validateBound(upperBound.value())) {
56+
if (upperBound.has_value() &&
57+
!validateBound(upperBound.value(), indexColumns)) {
5158
return false;
5259
}
5360

velox/serializers/KeyEncoder.h

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -174,14 +174,12 @@ class KeyEncoder {
174174
/// Increment fails when values are at their maximum (e.g., INT_MAX, strings
175175
/// with all \xFF characters, or nulls in NULLS_LAST ordering).
176176
///
177-
/// For multi-row bounds, returns a vector with one EncodedKeyBounds per row.
178-
/// Each row is processed independently.
179-
/// Encodes index bounds into byte-comparable key strings.
180177
/// Takes an IndexBounds containing lower and/or upper bounds and encodes them
181-
/// into EncodedKeyBounds for efficient range comparison.
182-
/// Throws if any lower bound fails to bump up (for exclusive bounds).
183-
/// For upper bound bump up failures, the upperKey is set to std::nullopt
184-
/// (unbounded).
178+
/// into EncodedKeyBounds for efficient range comparison. Returns a vector
179+
/// with one EncodedKeyBounds per row in 'indexBounds'. Each row is encoded
180+
/// into a byte-comparable key string. Throws if any lower bound fails to bump
181+
/// up (for exclusive bounds). For upper bound bump up failures, the upperKey
182+
/// is set to std::nullopt (unbounded).
185183
std::vector<EncodedKeyBounds> encodeIndexBounds(
186184
const IndexBounds& indexBounds);
187185

@@ -204,8 +202,10 @@ class KeyEncoder {
204202
std::vector<std::string> encode(const RowVectorPtr& input);
205203

206204
// Creates a new row vector with the key columns incremented by 1 for multiple
207-
// rows. Returns nullptr if any row fails to increment (all key columns
208-
// overflow), otherwise returns RowVectorPtr with incremented values.
205+
// rows. For each row, the increment takes place from the rightmost (least
206+
// significant) column.
207+
// Returns nullptr if any row fails to increment (all key columns overflow),
208+
// otherwise returns RowVectorPtr with incremented values.
209209
RowVectorPtr createIncrementedBounds(const RowVectorPtr& bounds) const;
210210

211211
// Encodes a single column for all rows.

0 commit comments

Comments
 (0)