Skip to content

Conversation

@pmattione-nvidia
Copy link
Contributor

@pmattione-nvidia pmattione-nvidia commented Jan 22, 2026

The parquet repetition and definition levels are decoded multiple times throughout the total decoding process: not just during the decode itself by also during setup in compute_page_sizes_kernel() and compute_string_page_bounds_kernel(). And during chunked reads even these setup steps are run multiple times, exploding the cost of re-decoding them.

Instead we decode the levels just once per subpass into a temporary buffer, and just read these results wherever they're needed. This dramatically speeds up the list and chunked cuDF benchmarks, as highlighted below.

Centralizing this grants several advantages. First the old (non-rle_stream) rep/def decode is now ripped entirely out of decode_split_page_data_kernel(), decode_page_data(), and the delta decode kernels, simplifying maintenance. Less shared memory is needed in the decode kernels for the rle_run and result buffers. And as the decode kernel complexity decreases, unnecessary buffer loops are removed and the register count decreases. And future improvements to rle_stream decode can be further studied in their own isolated environment (except dictionary & bool decode still need it).

Benchmarks

  • Non-chunked int/float/bool: 4-14% faster
  • Non-chunked list: 48% faster
  • Non-chunked list: 20-35% faster
  • Chunked int/float/bool: 28-38% faster
  • Chunked list: 59-67% faster
  • Chunked list: 45-59% faster
  • All (non-list) string decodes: 10-14% faster

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@pmattione-nvidia pmattione-nvidia self-assigned this Jan 22, 2026
@pmattione-nvidia pmattione-nvidia requested a review from a team as a code owner January 22, 2026 18:17
@pmattione-nvidia pmattione-nvidia added Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 22, 2026
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 22, 2026
@pmattione-nvidia
Copy link
Contributor Author

pmattione-nvidia commented Jan 22, 2026

BEFORE Benchmarks:

parquet_read_decode

[0] NVIDIA RTX A5000

data_type io_type cardinality run_length null_probability data_size Samples CPU Time Noise GPU Time Noise bytes_per_second peak_memory_usage encoded_file_size
INTEGRAL DEVICE_BUFFER 0 1 0.01 536870912 752x 12.946 ms 0.60% 12.942 ms 0.60% 41483381052 480.065 MiB 477.518 MiB
INTEGRAL DEVICE_BUFFER 1000 1 0.01 536870912 430x 13.868 ms 0.50% 13.864 ms 0.50% 38725269081 158.141 MiB 155.566 MiB
INTEGRAL DEVICE_BUFFER 0 32 0.01 536870912 544x 11.477 ms 0.55% 11.473 ms 0.55% 46794974361 28.749 MiB 26.174 MiB
INTEGRAL DEVICE_BUFFER 1000 32 0.01 536870912 608x 11.341 ms 0.55% 11.337 ms 0.55% 47356869950 16.038 MiB 13.463 MiB
FLOAT DEVICE_BUFFER 0 1 0.01 536870912 864x 6.634 ms 0.95% 6.630 ms 0.95% 80979950416 501.348 MiB 499.885 MiB
FLOAT DEVICE_BUFFER 1000 1 0.01 536870912 1312x 9.279 ms 0.54% 9.275 ms 0.54% 57882832786 109.106 MiB 107.603 MiB
FLOAT DEVICE_BUFFER 0 32 0.01 536870912 1232x 7.249 ms 0.53% 7.245 ms 0.53% 74101802777 24.559 MiB 23.056 MiB
FLOAT DEVICE_BUFFER 1000 32 0.01 536870912 278x 7.441 ms 0.50% 7.437 ms 0.50% 72190318758 10.685 MiB 9.182 MiB
BOOL8 DEVICE_BUFFER 0 1 0.01 536870912 592x 24.121 ms 0.58% 24.116 ms 0.58% 22261570770 79.550 MiB 71.654 MiB
BOOL8 DEVICE_BUFFER 1000 1 0.01 536870912 108x 23.950 ms 0.50% 23.945 ms 0.50% 22421074654 79.024 MiB 71.128 MiB
BOOL8 DEVICE_BUFFER 0 32 0.01 536870912 24x 20.884 ms 0.26% 20.879 ms 0.26% 25713418544 29.456 MiB 21.560 MiB
BOOL8 DEVICE_BUFFER 1000 32 0.01 536870912 25x 20.820 ms 0.44% 20.816 ms 0.44% 25791807569 29.395 MiB 21.499 MiB
STRING DEVICE_BUFFER 0 1 0.01 536870912 528x 26.078 ms 0.99% 26.075 ms 0.99% 20589297063 476.884 MiB 476.417 MiB
STRING DEVICE_BUFFER 1000 1 0.01 536870912 960x 8.577 ms 0.69% 8.573 ms 0.69% 62623610865 34.437 MiB 33.949 MiB
STRING DEVICE_BUFFER 0 32 0.01 536870912 572x 26.149 ms 1.00% 26.146 ms 1.00% 20533667501 476.884 MiB 476.417 MiB
STRING DEVICE_BUFFER 1000 32 0.01 536870912 60x 8.419 ms 0.49% 8.415 ms 0.49% 63797663277 4.179 MiB 3.691 MiB
LIST_INT DEVICE_BUFFER 0 1 0.01 536870912 299x 50.137 ms 1.04% 50.133 ms 1.04% 10708840380 464.953 MiB 462.473 MiB
LIST_INT DEVICE_BUFFER 1000 1 0.01 536870912 270x 55.449 ms 1.73% 55.444 ms 1.73% 9683170117 158.619 MiB 156.984 MiB
LIST_INT DEVICE_BUFFER 0 32 0.01 536870912 312x 48.027 ms 1.46% 48.023 ms 1.46% 11179436062 40.115 MiB 38.479 MiB
LIST_INT DEVICE_BUFFER 1000 32 0.01 536870912 318x 47.078 ms 1.35% 47.074 ms 1.35% 11404802537 27.370 MiB 25.735 MiB
LIST_STR DEVICE_BUFFER 0 1 0.01 536870912 444x 33.648 ms 1.07% 33.644 ms 1.07% 15957184786 455.089 MiB 452.602 MiB
LIST_STR DEVICE_BUFFER 1000 1 0.01 536870912 581x 25.685 ms 1.91% 25.682 ms 1.91% 20904846493 35.655 MiB 35.128 MiB
LIST_STR DEVICE_BUFFER 0 32 0.01 536870912 432x 34.627 ms 0.88% 34.623 ms 0.88% 15506322061 455.089 MiB 452.602 MiB
LIST_STR DEVICE_BUFFER 1000 32 0.01 536870912 634x 23.513 ms 1.63% 23.509 ms 1.63% 22837049612 7.177 MiB 6.650 MiB

parquet_read_chunks

[0] NVIDIA RTX A5000

T io_type cardinality run_length chunk_read_limit data_size Samples CPU Time Noise GPU Time Noise bytes_per_second peak_memory_usage encoded_file_size
INTEGRAL DEVICE_BUFFER 0 1 500000 536870912 5x 211.268 ms 0.26% 211.262 ms 0.26% 2541260124 480.560 MiB 477.518 MiB
INTEGRAL DEVICE_BUFFER 1000 1 500000 536870912 5x 201.871 ms 0.12% 201.864 ms 0.12% 2659566528 158.641 MiB 155.566 MiB
INTEGRAL DEVICE_BUFFER 0 32 500000 536870912 5x 183.345 ms 0.19% 183.338 ms 0.19% 2928307030 29.249 MiB 26.174 MiB
INTEGRAL DEVICE_BUFFER 1000 32 500000 536870912 5x 180.026 ms 0.19% 180.020 ms 0.19% 2982287017 16.537 MiB 13.462 MiB
FLOAT DEVICE_BUFFER 0 1 500000 536870912 8x 67.228 ms 0.20% 67.223 ms 0.20% 7986408016 501.639 MiB 499.885 MiB
FLOAT DEVICE_BUFFER 1000 1 500000 536870912 6x 93.068 ms 0.48% 93.063 ms 0.48% 5768919551 109.405 MiB 107.603 MiB
FLOAT DEVICE_BUFFER 0 32 500000 536870912 7x 80.863 ms 0.12% 80.858 ms 0.13% 6639659984 24.858 MiB 23.056 MiB
FLOAT DEVICE_BUFFER 1000 32 500000 536870912 7x 79.875 ms 0.24% 79.870 ms 0.24% 6721797551 10.984 MiB 9.182 MiB
BOOL8 DEVICE_BUFFER 0 1 500000 536870912 5x 1.112 s 0.10% 1.112 s 0.10% 482682156 81.047 MiB 71.654 MiB
BOOL8 DEVICE_BUFFER 1000 1 500000 536870912 5x 1.114 s 0.07% 1.114 s 0.07% 481762714 80.521 MiB 71.128 MiB
BOOL8 DEVICE_BUFFER 0 32 500000 536870912 5x 1.098 s 0.05% 1.098 s 0.05% 488842234 30.953 MiB 21.560 MiB
BOOL8 DEVICE_BUFFER 1000 32 500000 536870912 5x 1.099 s 0.04% 1.099 s 0.04% 488591607 30.892 MiB 21.499 MiB
STRING DEVICE_BUFFER 0 1 500000 536870912 10x 54.543 ms 0.21% 54.538 ms 0.21% 9843898427 476.960 MiB 476.417 MiB
STRING DEVICE_BUFFER 1000 1 500000 536870912 45x 44.911 ms 0.50% 44.907 ms 0.50% 11955286923 34.517 MiB 33.950 MiB
STRING DEVICE_BUFFER 0 32 500000 536870912 10x 54.489 ms 0.23% 54.485 ms 0.23% 9853547814 476.960 MiB 476.417 MiB
STRING DEVICE_BUFFER 1000 32 500000 536870912 11x 45.870 ms 0.34% 45.866 ms 0.34% 11705261968 4.260 MiB 3.692 MiB
LIST_INT DEVICE_BUFFER 0 1 500000 536870912 5x 609.720 ms 0.07% 609.707 ms 0.07% 880538710 464.953 MiB 462.473 MiB
LIST_INT DEVICE_BUFFER 1000 1 500000 536870912 28x 536.036 ms 1.05% 536.024 ms 1.05% 1001579818 158.619 MiB 156.984 MiB
LIST_INT DEVICE_BUFFER 0 32 500000 536870912 31x 484.475 ms 1.56% 484.463 ms 1.56% 1108177452 40.115 MiB 38.479 MiB
LIST_INT DEVICE_BUFFER 1000 32 500000 536870912 32x 479.636 ms 1.38% 479.624 ms 1.38% 1119357441 27.369 MiB 25.734 MiB
LIST_STR DEVICE_BUFFER 0 1 500000 536870912 5x 251.797 ms 0.07% 251.792 ms 0.07% 2132203437 455.089 MiB 452.602 MiB
LIST_STR DEVICE_BUFFER 1000 1 500000 536870912 83x 181.387 ms 1.85% 181.383 ms 1.85% 2959880715 35.655 MiB 35.128 MiB
LIST_STR DEVICE_BUFFER 0 32 500000 536870912 5x 251.669 ms 0.06% 251.664 ms 0.06% 2133286744 455.089 MiB 452.602 MiB
LIST_STR DEVICE_BUFFER 1000 32 500000 536870912 72x 210.132 ms 2.22% 210.127 ms 2.22% 2554982962 7.177 MiB 6.650 MiB

@pmattione-nvidia
Copy link
Contributor Author

pmattione-nvidia commented Jan 22, 2026

AFTER benchmarks:

parquet_read_decode

[0] NVIDIA RTX A5000

data_type io_type cardinality run_length null_probability data_size Samples CPU Time Noise GPU Time Noise bytes_per_second peak_memory_usage encoded_file_size
INTEGRAL DEVICE_BUFFER 0 1 0.01 536870912 43x 11.893 ms 0.22% 11.889 ms 0.22% 45157090704 480.065 MiB 477.518 MiB
INTEGRAL DEVICE_BUFFER 1000 1 0.01 536870912 41x 12.318 ms 0.26% 12.314 ms 0.26% 43599208970 158.141 MiB 155.566 MiB
INTEGRAL DEVICE_BUFFER 0 32 0.01 536870912 50x 10.116 ms 0.11% 10.112 ms 0.11% 53092842315 28.749 MiB 26.174 MiB
INTEGRAL DEVICE_BUFFER 1000 32 0.01 536870912 51x 9.918 ms 0.11% 9.914 ms 0.11% 54152539520 16.037 MiB 13.463 MiB
FLOAT DEVICE_BUFFER 0 1 0.01 536870912 78x 6.456 ms 0.20% 6.452 ms 0.20% 83209670930 501.348 MiB 499.885 MiB
FLOAT DEVICE_BUFFER 1000 1 0.01 536870912 63x 8.000 ms 0.30% 7.996 ms 0.30% 67143053727 109.106 MiB 107.603 MiB
FLOAT DEVICE_BUFFER 0 32 0.01 536870912 82x 6.141 ms 0.17% 6.137 ms 0.17% 87482757272 24.559 MiB 23.056 MiB
FLOAT DEVICE_BUFFER 1000 32 0.01 536870912 78x 6.439 ms 0.16% 6.435 ms 0.16% 83432937620 10.685 MiB 9.182 MiB
BOOL8 DEVICE_BUFFER 0 1 0.01 536870912 23x 22.288 ms 0.14% 22.283 ms 0.14% 24093010706 79.550 MiB 71.654 MiB
BOOL8 DEVICE_BUFFER 1000 1 0.01 536870912 23x 22.195 ms 0.15% 22.190 ms 0.15% 24193991316 79.024 MiB 71.128 MiB
BOOL8 DEVICE_BUFFER 0 32 0.01 536870912 25x 20.048 ms 0.28% 20.044 ms 0.28% 26785130812 29.456 MiB 21.560 MiB
BOOL8 DEVICE_BUFFER 1000 32 0.01 536870912 25x 20.013 ms 0.16% 20.009 ms 0.16% 26831640124 29.395 MiB 21.499 MiB
STRING DEVICE_BUFFER 0 1 0.01 536870912 636x 23.527 ms 0.83% 23.523 ms 0.83% 22823582834 476.884 MiB 476.417 MiB
STRING DEVICE_BUFFER 1000 1 0.01 536870912 67x 7.500 ms 0.37% 7.496 ms 0.37% 71619430280 34.437 MiB 33.949 MiB
STRING DEVICE_BUFFER 0 32 0.01 536870912 623x 24.017 ms 1.80% 24.013 ms 1.80% 22357918475 476.884 MiB 476.417 MiB
STRING DEVICE_BUFFER 1000 32 0.01 536870912 154x 7.469 ms 0.50% 7.465 ms 0.50% 71916146297 4.180 MiB 3.692 MiB
LIST_INT DEVICE_BUFFER 0 1 0.01 536870912 611x 24.435 ms 0.64% 24.431 ms 0.64% 21975343821 464.953 MiB 462.473 MiB
LIST_INT DEVICE_BUFFER 1000 1 0.01 536870912 567x 26.350 ms 0.92% 26.346 ms 0.92% 20377525620 158.619 MiB 156.984 MiB
LIST_INT DEVICE_BUFFER 0 32 0.01 536870912 528x 23.556 ms 0.74% 23.552 ms 0.74% 22795468478 40.115 MiB 38.479 MiB
LIST_INT DEVICE_BUFFER 1000 32 0.01 536870912 654x 22.825 ms 0.81% 22.821 ms 0.81% 23525091962 27.369 MiB 25.734 MiB
LIST_STR DEVICE_BUFFER 0 1 0.01 536870912 557x 26.820 ms 0.88% 26.816 ms 0.88% 20020683623 455.089 MiB 452.602 MiB
LIST_STR DEVICE_BUFFER 1000 1 0.01 536870912 901x 16.523 ms 0.57% 16.520 ms 0.57% 32498814970 35.655 MiB 35.128 MiB
LIST_STR DEVICE_BUFFER 0 32 0.01 536870912 550x 27.186 ms 0.73% 27.182 ms 0.73% 19750653072 455.089 MiB 452.602 MiB
LIST_STR DEVICE_BUFFER 1000 32 0.01 536870912 47x 16.210 ms 0.50% 16.206 ms 0.50% 33127881571 7.177 MiB 6.650 MiB

parquet_read_chunks

[0] NVIDIA RTX A5000

T io_type cardinality run_length chunk_read_limit data_size Samples CPU Time Noise GPU Time Noise bytes_per_second peak_memory_usage encoded_file_size
INTEGRAL DEVICE_BUFFER 0 1 500000 536870912 5x 150.765 ms 0.24% 150.759 ms 0.24% 3561125055 480.560 MiB 477.518 MiB
INTEGRAL DEVICE_BUFFER 1000 1 500000 536870912 5x 139.731 ms 0.09% 139.724 ms 0.09% 3842356717 158.641 MiB 155.566 MiB
INTEGRAL DEVICE_BUFFER 0 32 500000 536870912 5x 126.437 ms 0.12% 126.431 ms 0.12% 4246341656 29.249 MiB 26.174 MiB
INTEGRAL DEVICE_BUFFER 1000 32 500000 536870912 5x 123.280 ms 0.19% 123.275 ms 0.19% 4355083233 16.538 MiB 13.463 MiB
FLOAT DEVICE_BUFFER 0 1 500000 536870912 11x 48.432 ms 0.10% 48.427 ms 0.10% 11086147961 501.639 MiB 499.885 MiB
FLOAT DEVICE_BUFFER 1000 1 500000 536870912 8x 67.549 ms 0.19% 67.544 ms 0.19% 7948413478 109.405 MiB 107.603 MiB
FLOAT DEVICE_BUFFER 0 32 500000 536870912 9x 58.779 ms 0.12% 58.775 ms 0.13% 9134407498 24.858 MiB 23.056 MiB
FLOAT DEVICE_BUFFER 1000 32 500000 536870912 9x 57.956 ms 0.18% 57.951 ms 0.18% 9264268153 10.984 MiB 9.182 MiB
BOOL8 DEVICE_BUFFER 0 1 500000 536870912 5x 692.868 ms 0.04% 692.852 ms 0.04% 774870464 81.047 MiB 71.654 MiB
BOOL8 DEVICE_BUFFER 1000 1 500000 536870912 5x 692.756 ms 0.07% 692.741 ms 0.07% 774995564 80.521 MiB 71.128 MiB
BOOL8 DEVICE_BUFFER 0 32 500000 536870912 5x 692.155 ms 0.03% 692.139 ms 0.03% 775669291 30.953 MiB 21.560 MiB
BOOL8 DEVICE_BUFFER 1000 32 500000 536870912 5x 692.004 ms 0.04% 691.988 ms 0.04% 775838607 30.892 MiB 21.499 MiB
STRING DEVICE_BUFFER 0 1 500000 536870912 11x 49.093 ms 0.30% 49.088 ms 0.30% 10936811575 476.960 MiB 476.417 MiB
STRING DEVICE_BUFFER 1000 1 500000 536870912 39x 38.463 ms 0.50% 38.459 ms 0.50% 13959562975 34.517 MiB 33.949 MiB
STRING DEVICE_BUFFER 0 32 500000 536870912 11x 49.166 ms 0.23% 49.162 ms 0.23% 10920515660 476.960 MiB 476.417 MiB
STRING DEVICE_BUFFER 1000 32 500000 536870912 13x 41.128 ms 0.47% 41.123 ms 0.47% 13055123245 4.259 MiB 3.692 MiB
LIST_INT DEVICE_BUFFER 0 1 500000 536870912 5x 199.406 ms 0.29% 199.399 ms 0.29% 2692447267 464.953 MiB 462.473 MiB
LIST_INT DEVICE_BUFFER 1000 1 500000 536870912 5x 190.193 ms 0.26% 190.186 ms 0.26% 2822873273 158.619 MiB 156.984 MiB
LIST_INT DEVICE_BUFFER 0 32 500000 536870912 5x 201.782 ms 0.29% 201.775 ms 0.29% 2660742419 40.115 MiB 38.479 MiB
LIST_INT DEVICE_BUFFER 1000 32 500000 536870912 5x 198.272 ms 0.17% 198.265 ms 0.17% 2707845749 27.369 MiB 25.734 MiB
LIST_STR DEVICE_BUFFER 0 1 500000 536870912 5x 134.152 ms 0.15% 134.146 ms 0.15% 4002131818 455.089 MiB 452.602 MiB
LIST_STR DEVICE_BUFFER 1000 1 500000 536870912 12x 74.376 ms 0.48% 74.371 ms 0.48% 7218815087 35.655 MiB 35.128 MiB
LIST_STR DEVICE_BUFFER 0 32 500000 536870912 5x 134.265 ms 0.22% 134.259 ms 0.22% 3998766024 455.089 MiB 452.602 MiB
LIST_STR DEVICE_BUFFER 1000 32 500000 536870912 129x 116.319 ms 1.15% 116.313 ms 1.15% 4615729767 7.177 MiB 6.650 MiB

return max_depth_valid_count;
}

// is the page marked nullable or not
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to a common header

(s->input_row_count <= last_row)) {
int next_valid_count;
block.sync();
processed_count += min(rolling_buf_size, s->page.num_input_values - processed_count);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for all cases

}
block.sync();

if (!t) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still present, is lower in the diff

// the core loop. decode batches of level stream data using rle_stream objects
// and pass the results to update_page_sizes
int processed = 0;
while (processed < s->page.num_input_values) {
Copy link
Contributor Author

@pmattione-nvidia pmattione-nvidia Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer need to loop for rep/def buffers, can call update_page_sizes() in one shot

return total_len;
}

/**
Copy link
Contributor Author

@pmattione-nvidia pmattione-nvidia Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old code path finally no longer needed, superseded by rle_stream

__syncthreads();

// do something with the level data
while (start_val < processed) {
Copy link
Contributor Author

@pmattione-nvidia pmattione-nvidia Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the changes in here are due to nuking this inner loop (no longer need to buffer the level decode), and of course removing the decode itself. highly recommend hiding whitespace diffs

// Fixed length byte array: Offsets are fixed, no need to allocate offset buffer
if (chunk.physical_type == Type::FIXED_LEN_BYTE_ARRAY) { return 0; }

// Estimate number of offsets based on page.num_input_values
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimize string #offsets determination, combining logic with new level decode logic

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick glance, will do a detailed review a bit later. Couple of questions and minor comments.

Comment on lines +931 to +933
rmm::exec_policy_nosync(stream),
iter,
iter + pages.size(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just iterate over pages here since the functor only uses the page_idx to access pages[page_idx] anyway. In that case, we can also remove the pages struct member

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we want to utilize the gpu parallelism. If we loop over pages here then we get no parallelism.

s, pp, chunks, min_row, num_rows, all_types_filter{}, page_processing_stage::PREPROCESS)) {
return;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need code to skip pages based on subpass_page_mask here and other kernels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No code is needed here. We won't use the def/rep levels for these pages at all so there's nothing to set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated allocate_level_decode_space() to skip allocating memory for rep/def levels though since we don't need it.

struct compute_page_string_offset_size {
device_span<PageInfo const> pages;
device_span<ColumnChunkDesc const> chunks;
size_t skip_rows;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also update this functor's () op to directly iterate over pages instead of page_idx.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is executed by thrust::transform. If we loop over pages within the operator then we get no parallelism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants