Releases: rapidsai/cudf
Releases · rapidsai/cudf
v26.02.01
What's Changed
🐛 Bug Fixes
- Backport #21301: Only serialize column slice by @pentschev in #21328
Full Changelog: v26.02.00...v26.02.01
v26.02.00
What's Changed
🚨 Breaking Changes
- Avoid counting nulls and creating null mask in groupby aggregation
MERGE_M2by @ttnghia in #20716 - Remove cudf::get_current_device_resource by @bdice in #20688
- Avoid creating null mask in groupby aggregation
M2by @ttnghia in #20726 - Remove deprecated left semi- and anti- join APIs by @shrshi in #20668
- Inline and simplify some column methods by @vyasr in #20819
- Enable copy-on-write in cudf.pandas by @vyasr in #20401
- [FEA] Improve Null-Aware Operator Support in AST-Codegen by @lamarrr in #20206
- Remove legacy hash-combine logic and unify hashing with row hasher by @PointKernel in #20796
- Remove deprecated .from_pandas constructors by @mroeschke in #20925
- Remove deprecated Series.data by @mroeschke in #20914
- Remove all base attributes from ColumnBase by @vyasr in #20961
- Fix handling of unquoted strings in the CSV reader by @vuule in #20996
🐛 Bug Fixes
- Avoid duplicate streaming nodes for the rapidsmpf runtime by @rjzamora in #20586
- Handle scalar arguments in ternary expression by @Matt711 in #20600
- fix(noarch): use noarch build script in noarch build by @gforsyth in #20654
- fix(conda): matrix out noarch builds by cuda-major version by @gforsyth in #20678
- Include RMM in type checking environment and update type annotations for optional
streamby @TomAugspurger in #20636 - Add no-op path for
ArrowExtensionArray.astypeby @Matt711 in #20580 - Skip pytorch integration tests if CUDA is not available by @Matt711 in #20729
- Always delay CUDA Array Interface pointer access by @vyasr in #20719
- Fix various copy-on-write bugs by @vyasr in #20744
- Fix leaks in cuDF java tests by @abellina in #20767
- Fix plc.Scalar.from_py(datetime.datetime) incorrectly localizing naive datetimes by @mroeschke in #20769
- Don't remove double casts in cudf_polars by @mroeschke in #20773
- Fixes struct column handling in sort-merge joins by @shrshi in #20664
- Fix for
synccheckcompute-sanitizer errors across Parquet gtest by @mhaseeb123 in #20775 - Pin
numpy<2.4.0a0in mypy pre-commit environment by @TomAugspurger in #20781 - Raise when trying to run queries on different devices in same process by @wence- in #20617
- Ensure
min_periods=0is passed through rolling aggregations by @Matt711 in #20653 - Fix racecheck errors in the ORC reader by @vuule in #20792
- Fix the crash of multi-threaded parquet reader benchmark by @kingcrimsontianyu in #20783
- Fix racecheck reported by DATA_CHUNK_SOURCE_TEST in inflate_kernel by @davidwendt in #20804
- Fix racecheck in the gpu_debrotli_kernel by @davidwendt in #20806
- Ensure literal groupby aggregations are broadcasted to key length in cudf_polars by @mroeschke in #20776
- Pin
aiobotocore<3to fix CI failures by @TomAugspurger in #20844 - Fix racecheck in parquet decode_page_data_generic kernel by @davidwendt in #20850
- Avoid generating empty
TableChunksin streaming scan nodes by @rjzamora in #20815 - Fix dask imports in
CudfFusedParquetIOHostby @rjzamora in #20845 - Fix UB due to OOM Exception in ParquetReaderTest.ManyLargeLists by @lamarrr in #20841
- Fix racecheck/synccheck in JSON parse_fn_string_parallel kernel by @davidwendt in #20856
- Fix racecheck in ORC decode_column_data_kernel by @davidwendt in #20853
- Disable flatbuffers tests in CMake configuration by @bdice in #20848
- Upper bound on aiosqlite in polars-upstream job by @TomAugspurger in #20866
- Fix boolean casting consistency with Pandas (#20746) by @aryansri05 in #20747
- Add retries to requests made to PyPI's JSON API by @TomAugspurger in #20865
- Fix
size_typeoverflow in multiple APIs by @vuule in #20857 - Fix racecheck in parquet compute_string_page_bounds_kernel by @davidwendt in #20868
- Fix dictionary::encode to honor indices-type parameter by @davidwendt in #20842
- Add missing headers to row_ir.hpp, row_ir.cpp by @bdice in #20834
- Fix
parquet_optionsin pdsh benchmark by @TomAugspurger in #20893 - Add stream synchronize to tdigest generate_group_cluster_info by @davidwendt in #20846
- Only install RMM in mypy env on linux by @TomAugspurger in #20878
- Make nvcomp export unconditional by @vyasr in #20828
- Ensure we have nvjitlink from the CUDA version used at build time or newer and upgrade numba-cuda lower bound by @bdice in #20873
- Fix size_type overflow in the ORC writer by @vuule in #20889
- Constrain pyparsing version by @vyasr in #20935
- Revert #20902 by @vyasr in #20955
- Add force-blocking-launches to run_compute_sanitizer_test script by @davidwendt in #20962
- Fix racecheck error in parquet delta_byte_array_decoder::string_scan by @davidwendt in #20967
- Fix racechecks reported in parquet gpuEncodePages kernel by @davidwendt in #20975
- Don't encode s3 paths for kvikio_remote_io in read_json by @mroeschke in #20976
- Allow sort merge join to go above int32 output row limits by @revans2 in #20960
- Correct stream ordered deallocation in
Joinby @TomAugspurger in #20981 - Reintroduce
Buffer.nbytesproperty by @pentschev in #21027 - Fix SHA hash OOB on strings that are exact multiples of message chunk size by @rishic3 in #21004
- Temporarily disable IWYU for nightly tests by @davidwendt in #21045
- Fix cudf-polars multi-partition distributed sort by @TomAugspurger in #21047
- Backport #21051 by @wence- in #21086
- Pin pandas for
pylibcudftesting by @galipremsagar in #21124 - Hide pinned pool instantiation to avoid symbol conflicts with nvcomp by @vyasr in #21161
- Specialize field type checking for bool in Parquet thrift list decoder by @mhaseeb123 in #21144
- Fix reading of CSV files with double quotes in unquoted strings by @vuule in #21151
- Revert the multithreaded optimization in the CSV reader by @vuule in #21198
- Pin sqlglot in third-party integration tests by @Matt711 in #21271
- Exclude sqlglot version 28.7 from CI by @Matt711 in #21293
📖 Documentation
- Add note to developer guide about null values being undefined by @bdice in #20645
- [DOC] Add cudf-polars to the example build command by @Matt711 in #20763
- Clarify internal API header placement guidelines for details headers by @PointKernel in #20985
- Clarify deprecation message for cudf::round by @nirandaperera in #20809
- Require nvcc 12.9 in contributing guide by @bdice in #21186
🚀 New Features
- Expose
cudf::compute_column_jitto python by @Matt711 in #20697 - Add configuration option for max-io-threads by @quasiben in #20606
- Return stats from
lower_ir_graphby @rjzamora in #20528 - Promote join_kind from detail namespace to public by @PointKernel in #20703
- Make DataFrameScan and DataFrameSourceInfo pickle-able by @rjzamora in #20732
- Add compute-sanitizer dispatch action by @bdice in #20542
- Add RapidsMPF Al...
v25.12.00
What's Changed
🚨 Breaking Changes
- Rewrite JNI functions to use
JNI_TRY/JNI_CATCHby @ttnghia in #19053 - Remove compatibility with nvCOMP versions before 5.0 by @vuule in #20140
- Remove DataFrame.apply_chunks, Groupby.apply_grouped by @mroeschke in #20194
- Change .str.starts/endswith with tuple argument to match any pattern instead of pairwise matching by @mroeschke in #20249
- [cudf-polars] CUDA stream by @madsbk in #20154
- Chunked read parquet, prepend index column, and apply deletion vector by @mhaseeb123 in #20201
- Zero-copy
hostdevice_vectoron integrated systems by @vuule in #20225 - Use int64_t for the num_rows slot in parquet_reader_options by @wence- in #20256
- Require CUDA 12.2+ by @jakirkham in #20416
- Remove compatibility for CCCL < 3.1 by @bdice in #20468
- Remove deprecated types and APIs by @vuule in #20422
- Support signed integers and decimals in
SUM_WITH_OVERFLOWgroupby by @PointKernel in #19598 - Change groupby-scan COUNT to 1-based results by @davidwendt in #20168
- Change strings::like() pattern parameter from string_scalar to string_view by @davidwendt in #20428
- No-op performance tracking wrappers by @galipremsagar in #20595
🐛 Bug Fixes
- Copy
attrsat correct place inDataFrameconstructor by @galipremsagar in #20074 - Handle missing nightly runs in pandas tests job by @galipremsagar in #20081
- Fix numpy ufunc for
DataFrameby @galipremsagar in #20070 - Unproxy few unnecessary testing utilities in pandas by @galipremsagar in #20088
- Fix libcudf groupby benchmarks to not include internal cache by @davidwendt in #20038
- Fix cudf.date_range with non-iso start and end date strings by @mroeschke in #20116
- Fix create_distinct_rows_column to create non-nullable columns by @davidwendt in #20082
- Fix arrow timestamp frequency cases in
cudf.pandasby @galipremsagar in #20128 - Cast inputs to true division from decimal to float by @Matt711 in #20077
- Handle NVMLError_NotSupported in cudf-polars by @TomAugspurger in #20179
- Fix RMM JNI pinned_fallback_host_memory_resource for CCCL 3.1.0 by @bdice in #20160
- Require passing memory resources to from_libcudf methods by @vyasr in #20171
- Enable hash-groupby for decimal32/64 type and MEAN aggregation by @davidwendt in #20040
- Align decimal dtypes in predicate before conditional join by @Matt711 in #20060
- Change stream_checking_resource_adaptor::do_deallocate to noexcept by @vyasr in #20218
- Deallocation should be noexcept by @bdice in #20219
- Fix a race condition in the decode of delta encoded Parquet columns by @vuule in #20216
- Fix the host-device tdigest offsets by using cuda::std::span by @PointKernel in #20220
- Add
streamandmrarguments toColumn.from_arrowtype stub by @TomAugspurger in #20244 - Pin
deltalakein cudf-polars-polars-tests CI job by @TomAugspurger in #20255 - Pin ibis-framework<11.0.0 by @Matt711 in #20267
- Add private attributes for
cudf.pandasproxy objects by @galipremsagar in #20276 - Add Proxy for
SparseAccessorby @galipremsagar in #20278 - We need this to pacify mypy by @wence- in #20285
- Purge non-empty nulls for the generated lists columns in data generation utility by @ttnghia in #20283
- Fix missing table compatibility check in two_table_comparator constructor by @PointKernel in #20305
- Fix the check for equal
num_colsacross empty parquet sources by @mhaseeb123 in #20320 - Add
nans_to_nullstoFrameby @galipremsagar in #20314 - Add support for list type in
getby @galipremsagar in #20332 - Fix decimal dtype serialization in cudf-polars by @Matt711 in #20300
- Make the
GroupedRollingWindowexpression node reconstructable in cudf-polars by @Matt711 in #20288 - Ensure pylibcudf.Scalar.from_py uses CUDA streams by @TomAugspurger in #20340
- Skip failing cudf-polars test due to hash groupby bug by @Matt711 in #20356
- Support order by keys for order-sensitive scalar aggregations in grouped windows by @Matt711 in #20350
- Honor user-passed stream in slice_strings for scalar inputs by @mroeschke in #20349
- Thread missing streams in column/table view creation to char size calculation by @vyasr in #20351
- Fix missed-sync for
mapping_indices_kernelin hash-based groupby aggregation by @ttnghia in #20370 - Fix a few SPDX-related issues by @KyleFromNVIDIA in #20364
- Fix a
dtypebug in column constructor by @galipremsagar in #20384 - Refactor
as_columndtype parameter calls by @galipremsagar in #20379 - Add CUDA stream to
cudf_polars.Column.deserializeby @TomAugspurger in #20396 - Add missing CUDA stream to cudf-polars left-semi join by @TomAugspurger in #20398
- Fix various string APIs to work with extension types by @galipremsagar in #20368
- Add parameter validation for
mergeandMultiIndex.from_frameby @galipremsagar in #20382 - Fix nvtext::normalize_characters special token case by @davidwendt in #20242
- Fix pinned memory resource
shared_pointerlifetime in tests. by @bdice in #20407 - Support new
nvcompStatus_tenum value by @vuule in #20376 - Don't skip blank CSV lines rows after the header in cudf-polars scan_csv by @mroeschke in #20341
- Fix OOB accesses in JSON_CornerCase_Empty test and get_row_array_parent_col_id function by @bdice in #20421
- Change calls to cudaMemcpyToSymbol to cudaMemcpyToSymbolAsync by @davidwendt in #20374
- Do not accelerate
pandas._config.configby @Matt711 in #20413 - Return timedelta instead of datetime type with std with datetime type with missing values by @mroeschke in #20439
- Disallow non-bool skipna arguments to reduction methods by @mroeschke in #20436
- Fix parquet scans for duckDB PDS-DS by @Matt711 in #20388
- Support
__array_function__on the proxy array type by @Matt711 in #20419 - Make
memory_usageand__sizeof__proxy attributes and always skip all memory usage tests by @Matt711 in #20425 - Add input validation for
from_recordsby @galipremsagar in #20412 - Use computed reduction result type for empty sum and product aggregations by @mroeschke in #20438
- Correct level arg validation for Index.isin, unique by @mroeschke in #20449
- Add private
_grouperattribute toDataFrameGroupByproxy type by @Matt711 in #20448 - Raise ValueError when indexing with zero step slice by @mroeschke in #20453
- Raise IndexError for float-like indexers in RangeIndex/MultiIndex.getitem by @mroeschke in #20454
- Disallow slice(bool, ...) in DataFrame.loc with MultiIndex by @mroeschke in #20457
- Fix core dump in MemoryCleaner by @res-life in #19872
- Disallow multiple ellipse values in loc/iloc indexing by @mroeschke in #20456
- Fix
scanoperations forstringcolumns by @galipremsagar in #20460 - Fix UTF8 data generator in libcudf benchmarks utility by @davidwendt in #20465
- Handle dealloc in stream-ordered cudf-polars ops by @TomAugspurger in #20467
- Raise on unsupported unsta...
v25.10.00
🚨 Breaking Changes
- Remove UCX-Py (#19979) @pentschev
- Revert "Migrate mixed join to use multiset #19660" (#19933) @PointKernel
- Fill missing values in
Series/Index.valuesfor numeric types with np.nan by default (#19923) @mroeschke - Remove deprecated
DataFrame.apply_rows, deprecateDataFrame.apply_chunksandGroupby.apply_grouped(#19896) @mroeschke - Move prefetching out of experimental and simplify the API (#19875) @vyasr
- Add join
*_match_contextAPIs to hash join (#19835) @PointKernel - Vendor libnvcomp in libcudf (#19743) @bdice
- Migrate mixed join to use multiset (#19660) @PointKernel
- Separate row mask and page mask computation and usage (#19537) @mhaseeb123
- [FEA] Implement null-aware transforms and filters (#19502) @lamarrr
- Support output-type for MEDIAN/QUANTILE aggregation in cudf::reduce (#19267) @davidwendt
🐛 Bug Fixes
- Fix edge cases in statistics collection (#20094) @rjzamora
- Fix multi-partition
Filterbug (#20075) @rjzamora - Fix
reindexto fill only the reindexed values withfill_value(#20063) @galipremsagar - Fix arrow arrays + numpy ufunc interaction (#20047) @galipremsagar
- Fix race conditions in ORC reader decimal decoding (#20044) @vuule
- Keep mr alive along with arrow tables and columns (#20028) @vyasr
- Fix
value_countsmissingnanbug (#20026) @galipremsagar - Compatibility for rapidsmpf's unspill_partitions (#20020) @TomAugspurger
- Fix type metadata preservation in
shift(#20017) @galipremsagar - Fix incorrect type propagation in dataframe assignment (#20010) @galipremsagar
- Fix OOB memory read in decode_page_data_generic kernel (#19995) @davidwendt
- Fix data_type creation in ast::operation::instantiate (#19994) @davidwendt
- Skip Narwhals pandas get_dtype_backend[pyarrow] tests after ArrowDtype proxy changes (#19992) @Matt711
- Make cudf.pandas callables usable with inspect.getfullargspec (#19988) @mroeschke
- Align decimal dtypes to schema after parquet IO scan (#19974) @Matt711
- Avoid undefined numpy protocols on cudf.pandas proxy objects (#19968) @mroeschke
- Skip failing polars iceberg test (#19955) @Matt711
- Revert "Migrate mixed join to use multiset #19660" (#19933) @PointKernel
- Define FrozenList proxy independently in cudf.pandas (#19931) @mroeschke
- Ignore scalars when broadcasting for horizontal string concatenation in cudf-polars (#19893) @Matt711
- Fix is_valid_rolling_aggregation for STD aggregation (#19888) @davidwendt
- Fix a decompression parameter in the chunked ORC reader (#19882) @vuule
- Skip flaky stats tests pending follow up (#19881) @brandon-b-miller
- Require list type for is_valid_aggregation and MERGE_LISTS/SETS (#19876) @davidwendt
- Temporary solution to ensure data-source/sink stream ordering (#19874) @kingcrimsontianyu
- Check for integer overflow in cudf::strings::find_multiple (#19867) @davidwendt
- Fix missing stream from cudf::top_k_order (#19866) @davidwendt
- Disallow loc.setitem with list-like indexer when list elements not in index (#19851) @mroeschke
- Fix .str.replace ignoring n for single character replacements (#19848) @mroeschke
- Fix strings::find_instance warp parallel logic (#19845) @davidwendt
- Add changed-files to the needs of every job that requires it (#19830) @Matt711
- xfail polars
decimal(precision=None)test (#19821) @Matt711 - Fix empty column returned by cudf::from_arrow_stream_column (#19812) @davidwendt
- Filter pandas warning in dask_cudf test (#19808) @TomAugspurger
- Update identify_stream_usage CUDA runtime hooks to CUDA 13 (#19807) @robertmaynard
- When bundling
libnvcomp.so.Xonly append the major version value (#19786) @robertmaynard - Improvements to
pylibcudf.from_iterable_of_py(#19781) @Matt711 - Avoid using multiple
Cachenodes with the same hash (#19769) @rjzamora - Fix window var() test failures from float rounding (#19761) @Matt711
- Use
is_compressedfield from Parquet V2 data page headers to determine if they are compressed (#19755) @mhaseeb123 - Fix bug in
evalfunction withnvtx-0.2.11(#19754) @galipremsagar - Fix ndsh benchmarks nvtx range usage (#19753) @davidwendt
- Support
nanin non-floating point column in cudf-polars (#19742) @Matt711 - Fix filter call in benchmark (#19732) @vyasr
- Suppress NVRTC warning from stdint.h (#19712) @davidwendt
- Correctly decode boolean lists in chunked parquet reader (#19707) @mhaseeb123
- Add new xfails for xarray release (#19705) @vyasr
- Fix "--executor" pytest parameter for cudf-polars (#19703) @rjzamora
- Match polars semantics for rolling-sum with all-null windows (non-empty) (#19680) @Matt711
- [BUG] Set
query_setarg when validating/running cudf-polars PDS-DS benchmarks (#19674) @Matt711 - Fix
group_by().agg()on non-aggregatable dtypes (#19669) @Matt711 - Fix broken links in 10min notebook (#19665) @Matt711
- Skip managed memory test if managed memory not supported in cudf-polars (#19653) @Matt711
- Fix integer overflow in warp-per-row grid calculation (#19638) @davidwendt
- Propagate exceptions thrown in async IO operations (#19628) @vuule
- Make
DataFrame.dtypesnot fallback to CPU always (#19627) @galipremsagar - Set scalar to valid in range_window_bounds unbounded/current_row (#19622) @davidwendt
- Enable data page mask computation for nullable
listandstructcolumns (#19617) @mhaseeb123 - Fix cudf::sequence() to throw exception for invalid scalar inputs (#19612) @davidwendt
- Fix uninitialized variable and misaligned write in parquet generic decoder (#19601) @mhaseeb123
- Compatibility with rapidsmpf 25.10.0 (#19591) @TomAugspurger
- Avoid querying device memory on systems without it in dask-cudf (#19577) @Matt711
- Avoid querying device memory on systems without it in cudf-polars benchmarks (#19575) @Matt711
- Increase alignment requirement for parquet bloom filter to 256 (#19573) @mhaseeb123
- Fix strftime with non-exact %a, %A, %b, %B (#19570) @mroeschke
- Fix OOB memcheck error in group_rank_to_percentage utility (#19567) @davidwendt
- Fix logic for number of unique values generated by data profile in benchmarks (#19540) @shrshi
- Fix contiguous-split nvbench cmake build (#19534) @davidwendt
- Fix value counts expression when the column has nulls (#19524) @Matt711
- Prefer
Column.astypeoverplc.unary.castin the fill null unary function expression (#19479) @Matt711 - Fix missing return in StringFunction.Strptime strict=True path (#19464) @Matt711
- Make dividing a boolean column return f64 dtype in cudf-polars (#19443) @Matt711
- branch-25.10-merge-branch-25.08 (#19429) @davidwendt
- Replace sprintf with std::format in libcudf parquet tests (#19364) @davidwendt
📖 Documentation
- Update missing docs (#19925) @vyasr
- Add examples of null handling to doxygen for cudf::rank (#19774) @davidwendt
- Fix cudf-polars dependency list docs (#19750) @pentschev
- Update cuDF classic testing documention regarding testing organization (#19745) @mroeschke
- Improve documentation around why we need no_gc_clear on pylibcudf Scalars (#19661) @vyasr
🚀 New Features
- Add memory resource parameters to interop, merge, and transpose (#20007) @vyasr
- Add mixed join benchmark with complex AST operators (#20004) @PointKernel
- Add memory resource arguments to join, round, and labeling (#20001) @vyasr
cudf-polarsstrptimeformat inference (#19997) @brandon-b-miller- Filter parquet row groups using byte offset bounds (#19991) @mhaseeb123
- Add memory resource arguments to concatenate (#19943) @vyasr
- Use column statistics to generate the physical plan in cuDF-Polars (#19940) @rjzamora
- Add all missing stream parameters (#19922) @vyasr
- Remote IO support in cudf-polars (#19921) @Matt711
- Add streams to io/timezone and io/text modules (#19913) @vyasr
- Add stream support to all nvtext modules (#19911) @vyasr
- Add streams to all top-level strings modules (#19910) @vyasr
- Update strings split APIs with stream parameters (#19909) @vyasr
- Support ordered grouped windows in cudf-polars (#19891) @Matt711
- Add local row-count and unique-count estimates to
explain(... logical=True)(#19864) @rjzamora - Add join
*_match_contextAPIs to hash join (#19835) @PointKernel - Support
rank(...).over(...)expressions in cudf-polars (#19803) @Matt711 - Add strings to/from encoded integer APIs (#19789) @davidwendt
- Add to_arrow method to pylibcudf core types (#19787) @Matt711
- Add streams to strings convert APIs (#19780) @vyasr
- Add an option to support reading ORC timestamp column as UTC time. (#19773) @res-life
- Support null_count in groupby/rolling context (#19739) @Matt711
- Collect join-key information in cudf-polars (#19736) @rjzamora
- Add count aggregation support to cudf::reduce (#19734) @davidwendt
- [FEA] Implement AST Expression - JIT codegen (#19733) @lamarrr
- Add streams to all scalar factories (#19729) @vyasr
- Add streams to reshape (#19728) @vyasr
- Add streams to null mask APIs (#19727) @vyasr
- Add streams to column APIs (#19726) @vyasr
- Construct next-gen parquet reader with pre-populated footer (#19724) @mhaseeb123
- Require
numba-cuda>=0.19.0,<0.20.0a0(#19711) @brandon-b-miller - Support
overexpression (window mapping) in cudf-polars (#19684) @Matt711 - Add streams support to all list APIs (#19683) @vyasr
- [FEA] Add Filter Benchmark (#19678) @lamarrr
- Add streams to pylibcudf join APIs (#19672) @vyasr
- Add streams to sorting APIs (#19671) @vyasr
- [FEA] Remove excessive copies of JITIFY's ProgramData during JIT kernel launch (#19667) @lamarrr
- Add streams to hashing APIs (#19663) @vyasr
- Use a more robust metric for sorting (de)compression tasks (#19656) @vuule
- Add streams support to datetime APIs (#19654) @vyasr
- Add streams to stream_compaction (#19651) @vyasr
- Enable casting
pl.Datetimeto integer types incudf-polars(#19647) @brandon-b-miller - Add Java JNI interface to get Gpu UUID (#19646) @res-life
- Add reduction with overflow detection (#19641) @PointKernel
- Upgrade to nvCOM...
[NIGHTLY] v25.12.00
🔗 Links
🚨 Breaking Changes
- Change .str.starts/endswith with tuple argument to match any pattern instead of pairwise matching (#20249) @mroeschke
- Remove DataFrame.apply_chunks, Groupby.apply_grouped (#20194) @mroeschke
- [cudf-polars] CUDA stream (#20154) @madsbk
- Remove compatibility with nvCOMP versions before 5.0 (#20140) @vuule
- Rewrite JNI functions to use
JNI_TRY/JNI_CATCH(#19053) @ttnghia
🐛 Bug Fixes
- We need this to pacify mypy (#20285) @wence-
- Purge non-empty nulls for the generated lists columns in data generation utility (#20283) @ttnghia
- Add Proxy for
SparseAccessor(#20278) @galipremsagar - Add private attributes for
cudf.pandasproxy objects (#20276) @galipremsagar - Pin ibis-framework<11.0.0 (#20267) @Matt711
- Pin
deltalakein cudf-polars-polars-tests CI job (#20255) @TomAugspurger - Add
streamandmrarguments toColumn.from_arrowtype stub (#20244) @TomAugspurger - Fix the host-device tdigest offsets by using cuda::std::span (#20220) @PointKernel
- Deallocation should be noexcept (#20219) @bdice
- Change stream_checking_resource_adaptor::do_deallocate to noexcept (#20218) @vyasr
- Fix a race condition in the decode of delta encoded Parquet columns (#20216) @vuule
- Handle NVMLError_NotSupported in cudf-polars (#20179) @TomAugspurger
- Require passing memory resources to from_libcudf methods (#20171) @vyasr
- Fix RMM JNI pinned_fallback_host_memory_resource for CCCL 3.1.0 (#20160) @bdice
- Fix arrow timestamp frequency cases in
cudf.pandas(#20128) @galipremsagar - Fix cudf.date_range with non-iso start and end date strings (#20116) @mroeschke
- Unproxy few unnecessary testing utilities in pandas (#20088) @galipremsagar
- Fix create_distinct_rows_column to create non-nullable columns (#20082) @davidwendt
- Handle missing nightly runs in pandas tests job (#20081) @galipremsagar
- Cast inputs to true division from decimal to float (#20077) @Matt711
- Copy
attrsat correct place inDataFrameconstructor (#20074) @galipremsagar - Fix numpy ufunc for
DataFrame(#20070) @galipremsagar - Align decimal dtypes in predicate before conditional join (#20060) @Matt711
- Enable hash-groupby for decimal32/64 type and MEAN aggregation (#20040) @davidwendt
- Fix libcudf groupby benchmarks to not include internal cache (#20038) @davidwendt
📖 Documentation
- Add profiling guide (#20292) @bdice
- Add note that --rmm-async only affects distributed scheduler. (#20129) @bdice
🚀 New Features
- Implement
ARGMINandARGMAXaggregations for reduction (#20207) @ttnghia - Add remaining memory resources (#20197) @vyasr
- Add memory resources to scalars (#20196) @vyasr
- Skip decompression of pruned parquet pages (#20192) @mhaseeb123
- Add memory resources to replace, json, and hashing (#20150) @vyasr
- Support decimal literals in cudf-polars (#20147) @Matt711
- Add pylibcudf is_valid_reduce_aggregation API (#20145) @davidwendt
- Add memory resources to I/O modules (#20136) @vyasr
- Add memory resources to reduce, column, column_factories, and contiguous_split (#20135) @vyasr
- Passthrough unary ops through Parquet predicate pushdown (#20127) @mhaseeb123
- Add memory resource to all strings modules (#20123) @vyasr
- Add memory resources to all nvtext APIs (#20119) @vyasr
- Add an example to inspect parquet files and dump row group and page level metadata information (#20117) @mhaseeb123
- Allow multiple calls to
cudf::initializeandcudf::deinitialize(#20111) @vuule - Remove rounding from cudf java (#20110) @pmattione-nvidia
- Add memory resources to groupby, datetime, and lists modules (#20102) @vyasr
- Add memory resources to search, reshape, and partitioning module (#20101) @vyasr
- Add memory resources to rolling, sorting, and quantiles modules (#20099) @vyasr
- Add memory resources to binaryop, copying, and stream_compaction (#20059) @vyasr
- Add memory resources to unary, transform, and filling modules (#20054) @vyasr
- Support
cum_sum(...).over(...)expressions in cudf-polars (#19908) @Matt711 - Support forward/backward filling null values in a grouped window context (#19907) @Matt711
- [FEA] Implement JIT Filter for read_parquet (#19831) @lamarrr
- Add an example to demonstrate the use of next-gen parquet reader to read a parquet file with highly selective filters (#19469) @mhaseeb123
- Rewrite JNI functions to use
JNI_TRY/JNI_CATCH(#19053) @ttnghia - Add support for maintain_order param in joins (#17698) @Matt711
🛠️ Improvements
- Add more Python type annotations to
cudf/core(#20287) @mroeschke - Skip mypy in pre-commit.ci (#20286) @bdice
- Remove extraneous host_memory_resource include (#20284) @bdice
- Add numpy to the mypy pre-commit environment (#20282) @vyasr
- Add
MultiIndex.dtypes(#20279) @galipremsagar - Add more type annotations to cudf/core/column subclasses (#20277) @mroeschke
- Handle unordered grouped windows properly for null filling and cum sums (#20275) @Matt711
- Unpin DuckDB and Ibis in cudf.pandas thirdparty tests (#20269) @mroeschke
- Enable
sccache-distconnection pool (#20264) @trxcllnt - Add ability to set the source_info of parquet_reader_options (#20253) @wence-
- Update
ConfigOptionsfor rapidsmpf-streaming integration (#20252) @rjzamora - Add arm testing of cudf.pandas unit tests (#20251) @vyasr
- Add pylibcudf to pre-commit linting and fix outstanding errors (#20250) @vyasr
- Change .str.starts/endswith with tuple argument to match any pattern instead of pairwise matching (#20249) @mroeschke
- Move and rename
ScanPartitionPlan(#20248) @rjzamora - Standardize setting StructDtype field names post libcudf conversion (#20235) @mroeschke
- Prevent accidental copies of expensive-to-copy object types (#20226) @vuule
- More mypy and docs fixes (#20224) @vyasr
- Configuration for which metrics are enabled during tracing (#20223) @TomAugspurger
- Fix parquet row number check for page bounds (#20217) @pmattione-nvidia
- Rename
comparison_binop_generatortoarg_minmax_binop_generatorand corresponding file tonested_types_extrema_utils.cuh(#20212) @Copilot - Fix various typing errors (#20205) @vyasr
- Stop using libcudf default parameters in pylibcudf (#20204) @vyasr
- Pin pydantic<2.12 in ci/test_cudf_polars_polars_tests.sh (#20200) @mroeschke
- Support binops between float scalar to decimal column (#20199) @mroeschke
- Add an overhead field to cudf-polars tracing (#20198) @TomAugspurger
- Remove DataFrame.apply_chunks, Groupby.apply_grouped (#20194) @mroeschke
- [pre-commit.ci] pre-commit autoupdate (#20189) @pre-commit-ci[bot]
- Revert "Temporarily disable conda-java-tests" (#20184) @bdice
- Don't assume cudf_polars benchmarking scale factor is always an integer (#20182) @mroeschke
- Remove unnecessary work from
read_parquet_metadata(#20180) @vuule - Reduce execution times for parquet dictionary tests (#20176) @mhaseeb123
- Skip filtering Parquet row groups with dictionaries if there are non-dict encoded pages (#20175) @mhaseeb123
- Improve performance of groupby tdigests gtests (#20173) @davidwendt
- Update to rapids-logger 0.2 (#20172) @bdice
- Split row operator header (#20166) @PointKernel
- Add PDSH benchmark runner for cudf.pandas (#20164) @mroeschke
- Temporarily disable conda-java-tests (#20162) @bdice
- Manual forward merger for Branch 25.12 - branch 25.10 (#20157) @galipremsagar
- [cudf-polars] CUDA stream (#20154) @madsbk
- Avoid NumericalColumn call from CategoricalColumn.children (#20153) @mroeschke
- Branch 25.12 merge branch 25.10 (#20152) @vyasr
- Make ListColumn._transform_leaves convert via pylibcudf (#20151) @mroeschke
- Make ColumnBase.as_*_column convert via pylibcudf (#20149) @mroeschke
- Make ColumnBase.deserialize construct via pylibcudf (#20142) @mroeschke
- Remove unused ColumnBase.view (#20141) @mroeschke
- Remove compatibility with nvCOMP versions before 5.0 (#20140) @vuule
- Adjust rmm pool handling in PDSH benchmarks (#20138) @TomAugspurger
- Fix slowdown in cudf-polars distributed tests (#20137) @TomAugspurger
- Disable async MR priming in cudf.pandas (#20133) @bdice
- Fix type annotations in cudf-polars (#20131) @TomAugspurger
- Add tests for AUTO and HYBRID (de)compression modes (#20126) @vuule
- Run cudf-polars wheels unit tests with more than 1 process (#20124) @mroeschke
- Add pyarrow stubs to mypy environment and fix associated errors (#20118) @vyasr
- Avoid running pandas unit tests for private functionality with cudf.pandas (#20115) @mroeschke
- Remove MultiIndex.from_pandas pytest benchmark (#20112) @mroeschke
- Use 8 processes for pandas tests, show top 10 test times (#20109) @bdice
- Reduce verbosity of running the pandas test suite (#20107) @vyasr
- Switch host_vector and host_span dependency (#20106) @davidwendt
- Make Column.set_mask go through pylibcudf (#20103) @mroeschke
- Have ListColumn.from_sequence go through pylibcudf (#20098) @mroeschke
- Deprecate legacy public row operators (#20097) @PointKernel
- Fix
RAPIDS_BRANCHversion and update script (#20091) @galipremsagar - Reduce output buffer sizes for pruned pages of columns with a
listparent (#20086) @mhaseeb123 - Avoid direct CategoricalColumn calls in dask_cudf (#20080) @mroeschke
- Rework reduction case statement as dispatch_type_and_aggregation (#20078) @davidwendt
- Avoid shadowing module names (#20071) @vyasr
- Fix typing issues in pylibcudf (#20069) @vyasr
- Avoid more explicit calls to IntervalColumn and StructColumn (#20064) @mroeschke
- Cleanup of some libcudf aggregation code (#20053) @davidwendt
- Prune entries in Sphinx nitpick_ignore (#20045) @mroeschke
- Deprecate .from_pandas constructor (#19996) @mroeschke
- Improve performance of string column size computation during parquet reads. (#19986) @nvdbaranec
- Run cudf-polars conda unit tests with more than 1 process (#19980) @mroeschk...
v25.08.00
🚨 Breaking Changes
- Allow
np.dtype('object')for cases that are valid (#19478) @galipremsagar - [FEA] Remove CUDA JIT-Compatibility Checks & CCCL WARs (#19470) @lamarrr
- Drop cuda 11 usages (#19386) @galipremsagar
- Deprecate cudf::round for float types (#19298) @davidwendt
- Support output_dtype in cudf::reduce for nunique aggregation (#19265) @davidwendt
- Change default cudf-polars executor to "streaming" (#19263) @TomAugspurger
- Fix Handling of Complex Types in AST (#19248) @lamarrr
- Enable chunked reading of PQ sources with
>2Brows (#19245) @mhaseeb123 - Refactor
grid_1dclass (#19211) @lamarrr - Return valid for all-nulls in reduce() with nunique include-nulls aggregation (#19196) @davidwendt
- Refactor JNI error handling (#19149) @ttnghia
- Remove CUDA 11 from dependencies.yaml (#19139) @KyleFromNVIDIA
- Quick fixes of
modernize-use-constraintsrule (#19105) @vuule - Filter Parquet row groups using row bounds (#19082) @mhaseeb123
- Temporarily revert "Refactor JNI error handling (#18983)" (#19076) @abellina
- Rename
parquet_chunked_writertochunked_parquet_writerfor consistency with the reader (#19047) @mhaseeb123 - Compile libcudf using C++20 Standard (#19045) @vuule
- Refactor JNI error handling (#18983) @ttnghia
- stop uploading packages to downloads.rapids.ai (#18973) @jameslamb
- Remove deprecated Series methods, isclose (#18947) @mroeschke
- Remove deprecated groupby.collect (#18946) @mroeschke
- Remove deprecated get_dummies(cats=, ...) (#18944) @mroeschke
- Add pylibcudf.Column.from_arrow factory method (#18937) @Matt711
- Add pylibcudf.Table.from_arrow factory method (#18936) @Matt711
- Remove deprecated APIs (#18933) @vuule
- Remove cudf.Scalar (#18927) @mroeschke
- Remove deprecated
cudf::io::host_buffer(#18881) @Matt711 - Null-handling for Transforms (#18845) @lamarrr
- Enable
skip_rowsin the chunked parquet reader. (#18130) @mhaseeb123
🐛 Bug Fixes
- Increase alignment requirement for parquet bloom filter to 256 (#19595) @mhaseeb123
- Revert "Add primitive row dispatch support for semi/anti join and cudf::contains" (#19503) @PointKernel
- Allow
np.dtype('object')for cases that are valid (#19478) @galipremsagar - Add conda dependency on nvidia-ml-py. (#19454) @bdice
- Mark
cudf.pandasnotebook repr test as flaky (#19441) @Matt711 - Fix pytest to properly expose a bug (#19433) @galipremsagar
- Switch from
thrust::sorttocub::DeviceRadixSortin Parquet chunked reader (#19414) @ttnghia - Use numba-cuda>=0.15.2,<0.16 (#19413) @bdice
- Update String Transform Examples (#19407) @lamarrr
- [BUG] Make floor division and modulo by 0 match CPU polars (#19406) @Matt711
- Handle empty input in cudf::strings::extract APIs (#19398) @davidwendt
- Fix jitify error on exit from FILTER_TEST (#19395) @davidwendt
- Update cudf.pandas tests to silence deprecation warnings (#19377) @Matt711
- Replace sprintf with snprintf in libcudf parquet tests (#19371) @davidwendt
- Make DateOffset respect timezone (#19366) @Matt711
- Fix flaky tests in
cudf.pandas(#19345) @TomAugspurger - Update protocol choices for ucxx in PDSH benchmark (#19343) @TomAugspurger
- Remove passing pandas tests from xfail list (#19341) @Matt711
- Fix Union-Slice bug (#19336) @Matt711
- Fix bit shift overflow in segmented_offset_bitmask_binop utility (#19329) @davidwendt
- Fix job filters for
pandas-tests(#19322) @galipremsagar - Fix compile warning in interop_stringview.cpp (#19320) @davidwendt
- Fix a use-after-free issue in TDigest aggregation code. (#19311) @nvdbaranec
- Always represent datetime aware data as UTC in strftime (#19304) @mroeschke
- Do not pass cupy objects objects to numba kernels directly (#19283) @brandon-b-miller
- Correct docstring for
DataFrame.applyto match code (#19262) @dagardner-nv - Cast
n_uniqueaggregation result to match polars (#19256) @Matt711 - Fix Handling of Complex Types in AST (#19248) @lamarrr
- Add missing include (#19239) @vyasr
- Raised
MixedTypeErrorsfor condition that lead to mixed types (#19232) @galipremsagar - Fix errors in the nvCOMP adapter (#19221) @vuule
- Remove nvToolsExt usage (#19209) @vyasr
- Fix a pair of bugs in get_decompression_scratch() size. (#19207) @nvdbaranec
- Allow
is_list_liketo return correct values by disabling it (#19188) @galipremsagar - Fix slicing after
JoinandGroupByin streaming cudf-polars (#19187) @rjzamora - Fix
binopstype preservation for some dtypes (#19183) @galipremsagar - Fix streaming
GroupByon non-trivial keys (#19181) @rjzamora - Fix bitmask in from_arrow_host for sliced stringview type (#19174) @davidwendt
- Fixed group_by mean with missing values and multiple partitions (#19165) @TomAugspurger
- Add fallback to
HStacklowering in cudf-polars (#19163) @rjzamora - Fix
Literalpartitioning in cudf-polars (#19160) @rjzamora - Fix
from_array_interfacefor empty arrays (#19144) @Matt711 - Adding GH_TOKEN pass-through to summarize job (#19143) @msarahan
- Fix hash collision in Union([MapFunction]) (#19124) @TomAugspurger
- Fix bug in
group_by().n_unique()in streaming cudf-polars (#19108) @rjzamora - Parse (non-MultiIndex) label-based keys to structured data (#19103) @mroeschke
- Fix cudf_polars spilling (#19101) @TomAugspurger
- Fix libcudf strings case logic to set null-row size to zero (#19095) @davidwendt
- Temporarily revert "Refactor JNI error handling (#18983)" (#19076) @abellina
- Temporary workaround for incorrect
SplitScanresults in cuDF-Polars (#19071) @rjzamora - Use default memory resource for JSON_QUOTE_NORMALIZATION gtests (#19057) @davidwendt
- Added null-probability to polynomial benchmarks and fixed transform call-sites (#18972) @lamarrr
- Fix flaky custreamz test (#18961) @TomAugspurger
- Fix tdigest percentile correctness for low row-counts (#18952) @mythrocks
- Enable
skip_rowsin the chunked parquet reader. (#18130) @mhaseeb123
📖 Documentation
- Update conda environment file for CUDA 12.9 compatibility (#19376) @a-hirota
- Update recommended gcc version in contibuting guide (#19365) @davidwendt
- Autodoc DateOffset (#19297) @wence-
- Fix cudf::column_device_view::element() doxygen (#19296) @davidwendt
- Document aggregations for cudf::reduce in doxygen (#19264) @davidwendt
- add docs on CI workflow inputs (#19234) @jameslamb
- Update README and CONTRIBUTING to reflect new CUDA requirements (#19138) @PointKernel
- Remove the extra index URL for CUDA 12 (#19128) @vyasr
- Improve WordPieceVocabulary.tokenize documentation (#19098) @davidwendt
- Add some basic streaming engine documentation (#19088) @wence-
- Update the contributing guide to include pylibcudf in the build command (#19011) @Matt711
- Fix pylibcudf docs for some strings APIs (#19004) @davidwendt
- Update cuDF Python library design with BaseIndex and pylibcudf updates (#18903) @mroeschke
🚀 New Features
- Avoid using UVM on systems without a traditional memory resource (#19444) @Matt711
- Add parquet-sampling configuration options (#19423) @rjzamora
- Add new JSON reader interface accepting string column input to pylibcudf (#19400) @shrshi
- Add a parquet reader utility to update output null masks (#19370) @mhaseeb123
- Build and ship
shim.cufile as LTOIR (#19368) @brandon-b-miller - Add cudf::strings::find_instance API (#19326) @davidwendt
- Add single-file streaming
Sinksupport (#19317) @rjzamora - Support null_count expression (#19314) @Matt711
- Materialize tables in the experimental Parquet reader (#19308) @mhaseeb123
- Add new cudf::top_k API (#19303) @davidwendt
- Add cudf::strings::split_part API (#19289) @davidwendt
- Support output_dtype in cudf::reduce for nunique aggregation (#19265) @davidwendt
- Add
post_traversalAPI to cudf-polars (#19258) @rjzamora - Deprecate
DataFrame.apply_rows(#19218) @brandon-b-miller - Require
numba-cuda>=0.16.0(#19213) @brandon-b-miller - Add a mode to co-process decompression and compression on host and device (#19203) @vuule
- Return valid for all-nulls in reduce() with nunique include-nulls aggregation (#19196) @davidwendt
- Refactor JNI error handling (#19149) @ttnghia
- Add support for horizontal string concatenation
pl.concat_str(#19142) @Matt711 - Add PDS-DS Query 1 (#19131) @Matt711
- Support
cudf-polarsstr.reverse(#19117) @brandon-b-miller - Support
cudf-polarsstr.pad_endandstr.pad_start(#19116) @brandon-b-miller - Support
cudf-polarsstr.headandstr.tail(#19115) @brandon-b-miller - Support
cudf-polarsstr.to_titlecase(#19114) @brandon-b-miller - Add
cudf/io/codec.hppto expose compression/decompression APIs (#19113) @ttnghia - Support converting decimals to/from pylibcudf scalars (#19106) @Matt711
- Support resource-constrained sort-merge inner join operation through left table partitioning (#19102) @shrshi
- Filter Parquet row groups using row bounds (#19082) @mhaseeb123
- Implement UDF Filters (#19070) @lamarrr
- Move the remaining libcudf pieces to C++20 (#19065) @vuule
- Allow using a stream per thread at runtime (#19051) @vyasr
- Remove stacktrace retrieval code (#19048) @ttnghia
- Compile libcudf using C++20 Standard (#19045) @vuule
- String Transform Examples: Added Branching, Public API Versions, and Sampling (#19038) @lamarrr
- Refactor JNI error handling (#18983) @ttnghia
- Add basic
Sinksupport for streaming cudf-polars executor (#18963) @rjzamora - Fix debug-build Failure in JIT Tests (#18939) @lamarrr
- Add from_arrow factory methods for Scalar and DataType (#18938) @Matt711
- Add pylibcudf.Column.from_arrow factory method (#18937) @Matt711
- Add pylibcudf.Table.from_arrow factory method (#18936) @Matt711
- Update nvCOMP adapter (#18931) @vuule
- Create a pylibcudf Column from a iterable of python strings (#18916) @Matt711
- Add CLI argument to enable OOM protection in PDS-H (#18914) @pentschev
- Implement data page pruning using Parquet page index stats (#18873) @mhaseeb123
- Null-handlin...
[NIGHTLY] v25.10.00
🔗 Links
🚨 Breaking Changes
- Remove UCX-Py (#19979) @pentschev
- Revert "Migrate mixed join to use multiset #19660" (#19933) @PointKernel
- Fill missing values in
Series/Index.valuesfor numeric types with np.nan by default (#19923) @mroeschke - Remove deprecated
DataFrame.apply_rows, deprecateDataFrame.apply_chunksandGroupby.apply_grouped(#19896) @mroeschke - Move prefetching out of experimental and simplify the API (#19875) @vyasr
- Add join
*_match_contextAPIs to hash join (#19835) @PointKernel - Vendor libnvcomp in libcudf (#19743) @bdice
- Migrate mixed join to use multiset (#19660) @PointKernel
- Separate row mask and page mask computation and usage (#19537) @mhaseeb123
- [FEA] Implement null-aware transforms and filters (#19502) @lamarrr
- Support output-type for MEDIAN/QUANTILE aggregation in cudf::reduce (#19267) @davidwendt
🐛 Bug Fixes
- Fix edge cases in statistics collection (#20094) @rjzamora
- Fix multi-partition
Filterbug (#20075) @rjzamora - Fix
reindexto fill only the reindexed values withfill_value(#20063) @galipremsagar - Fix arrow arrays + numpy ufunc interaction (#20047) @galipremsagar
- Fix race conditions in ORC reader decimal decoding (#20044) @vuule
- Keep mr alive along with arrow tables and columns (#20028) @vyasr
- Fix
value_countsmissingnanbug (#20026) @galipremsagar - Compatibility for rapidsmpf's unspill_partitions (#20020) @TomAugspurger
- Fix type metadata preservation in
shift(#20017) @galipremsagar - Fix incorrect type propagation in dataframe assignment (#20010) @galipremsagar
- Fix OOB memory read in decode_page_data_generic kernel (#19995) @davidwendt
- Fix data_type creation in ast::operation::instantiate (#19994) @davidwendt
- Skip Narwhals pandas get_dtype_backend[pyarrow] tests after ArrowDtype proxy changes (#19992) @Matt711
- Make cudf.pandas callables usable with inspect.getfullargspec (#19988) @mroeschke
- Align decimal dtypes to schema after parquet IO scan (#19974) @Matt711
- Avoid undefined numpy protocols on cudf.pandas proxy objects (#19968) @mroeschke
- Skip failing polars iceberg test (#19955) @Matt711
- Revert "Migrate mixed join to use multiset #19660" (#19933) @PointKernel
- Define FrozenList proxy independently in cudf.pandas (#19931) @mroeschke
- Ignore scalars when broadcasting for horizontal string concatenation in cudf-polars (#19893) @Matt711
- Fix is_valid_rolling_aggregation for STD aggregation (#19888) @davidwendt
- Fix a decompression parameter in the chunked ORC reader (#19882) @vuule
- Skip flaky stats tests pending follow up (#19881) @brandon-b-miller
- Require list type for is_valid_aggregation and MERGE_LISTS/SETS (#19876) @davidwendt
- Temporary solution to ensure data-source/sink stream ordering (#19874) @kingcrimsontianyu
- Check for integer overflow in cudf::strings::find_multiple (#19867) @davidwendt
- Fix missing stream from cudf::top_k_order (#19866) @davidwendt
- Disallow loc.setitem with list-like indexer when list elements not in index (#19851) @mroeschke
- Fix .str.replace ignoring n for single character replacements (#19848) @mroeschke
- Fix strings::find_instance warp parallel logic (#19845) @davidwendt
- Add changed-files to the needs of every job that requires it (#19830) @Matt711
- xfail polars
decimal(precision=None)test (#19821) @Matt711 - Fix empty column returned by cudf::from_arrow_stream_column (#19812) @davidwendt
- Filter pandas warning in dask_cudf test (#19808) @TomAugspurger
- Update identify_stream_usage CUDA runtime hooks to CUDA 13 (#19807) @robertmaynard
- When bundling
libnvcomp.so.Xonly append the major version value (#19786) @robertmaynard - Improvements to
pylibcudf.from_iterable_of_py(#19781) @Matt711 - Avoid using multiple
Cachenodes with the same hash (#19769) @rjzamora - Fix window var() test failures from float rounding (#19761) @Matt711
- Use
is_compressedfield from Parquet V2 data page headers to determine if they are compressed (#19755) @mhaseeb123 - Fix bug in
evalfunction withnvtx-0.2.11(#19754) @galipremsagar - Fix ndsh benchmarks nvtx range usage (#19753) @davidwendt
- Support
nanin non-floating point column in cudf-polars (#19742) @Matt711 - Fix filter call in benchmark (#19732) @vyasr
- Suppress NVRTC warning from stdint.h (#19712) @davidwendt
- Correctly decode boolean lists in chunked parquet reader (#19707) @mhaseeb123
- Add new xfails for xarray release (#19705) @vyasr
- Fix "--executor" pytest parameter for cudf-polars (#19703) @rjzamora
- Match polars semantics for rolling-sum with all-null windows (non-empty) (#19680) @Matt711
- [BUG] Set
query_setarg when validating/running cudf-polars PDS-DS benchmarks (#19674) @Matt711 - Fix
group_by().agg()on non-aggregatable dtypes (#19669) @Matt711 - Fix broken links in 10min notebook (#19665) @Matt711
- Skip managed memory test if managed memory not supported in cudf-polars (#19653) @Matt711
- Fix integer overflow in warp-per-row grid calculation (#19638) @davidwendt
- Propagate exceptions thrown in async IO operations (#19628) @vuule
- Make
DataFrame.dtypesnot fallback to CPU always (#19627) @galipremsagar - Set scalar to valid in range_window_bounds unbounded/current_row (#19622) @davidwendt
- Enable data page mask computation for nullable
listandstructcolumns (#19617) @mhaseeb123 - Fix cudf::sequence() to throw exception for invalid scalar inputs (#19612) @davidwendt
- Fix uninitialized variable and misaligned write in parquet generic decoder (#19601) @mhaseeb123
- Compatibility with rapidsmpf 25.10.0 (#19591) @TomAugspurger
- Avoid querying device memory on systems without it in dask-cudf (#19577) @Matt711
- Avoid querying device memory on systems without it in cudf-polars benchmarks (#19575) @Matt711
- Increase alignment requirement for parquet bloom filter to 256 (#19573) @mhaseeb123
- Fix strftime with non-exact %a, %A, %b, %B (#19570) @mroeschke
- Fix OOB memcheck error in group_rank_to_percentage utility (#19567) @davidwendt
- Fix logic for number of unique values generated by data profile in benchmarks (#19540) @shrshi
- Fix contiguous-split nvbench cmake build (#19534) @davidwendt
- Fix value counts expression when the column has nulls (#19524) @Matt711
- Prefer
Column.astypeoverplc.unary.castin the fill null unary function expression (#19479) @Matt711 - Fix missing return in StringFunction.Strptime strict=True path (#19464) @Matt711
- Make dividing a boolean column return f64 dtype in cudf-polars (#19443) @Matt711
- branch-25.10-merge-branch-25.08 (#19429) @davidwendt
- Replace sprintf with std::format in libcudf parquet tests (#19364) @davidwendt
📖 Documentation
- Update missing docs (#19925) @vyasr
- Add examples of null handling to doxygen for cudf::rank (#19774) @davidwendt
- Fix cudf-polars dependency list docs (#19750) @pentschev
- Update cuDF classic testing documention regarding testing organization (#19745) @mroeschke
- Improve documentation around why we need no_gc_clear on pylibcudf Scalars (#19661) @vyasr
🚀 New Features
- Add memory resource parameters to interop, merge, and transpose (#20007) @vyasr
- Add mixed join benchmark with complex AST operators (#20004) @PointKernel
- Add memory resource arguments to join, round, and labeling (#20001) @vyasr
cudf-polarsstrptimeformat inference (#19997) @brandon-b-miller- Filter parquet row groups using byte offset bounds (#19991) @mhaseeb123
- Add memory resource arguments to concatenate (#19943) @vyasr
- Use column statistics to generate the physical plan in cuDF-Polars (#19940) @rjzamora
- Add all missing stream parameters (#19922) @vyasr
- Remote IO support in cudf-polars (#19921) @Matt711
- Add streams to io/timezone and io/text modules (#19913) @vyasr
- Add stream support to all nvtext modules (#19911) @vyasr
- Add streams to all top-level strings modules (#19910) @vyasr
- Update strings split APIs with stream parameters (#19909) @vyasr
- Support ordered grouped windows in cudf-polars (#19891) @Matt711
- Add local row-count and unique-count estimates to
explain(... logical=True)(#19864) @rjzamora - Add join
*_match_contextAPIs to hash join (#19835) @PointKernel - Support
rank(...).over(...)expressions in cudf-polars (#19803) @Matt711 - Add strings to/from encoded integer APIs (#19789) @davidwendt
- Add to_arrow method to pylibcudf core types (#19787) @Matt711
- Add streams to strings convert APIs (#19780) @vyasr
- Add an option to support reading ORC timestamp column as UTC time. (#19773) @res-life
- Support null_count in groupby/rolling context (#19739) @Matt711
- Collect join-key information in cudf-polars (#19736) @rjzamora
- Add count aggregation support to cudf::reduce (#19734) @davidwendt
- [FEA] Implement AST Expression - JIT codegen (#19733) @lamarrr
- Add streams to all scalar factories (#19729) @vyasr
- Add streams to reshape (#19728) @vyasr
- Add streams to null mask APIs (#19727) @vyasr
- Add streams to column APIs (#19726) @vyasr
- Construct next-gen parquet reader with pre-populated footer (#19724) @mhaseeb123
- Require
numba-cuda>=0.19.0,<0.20.0a0(#19711) @brandon-b-miller - Support
overexpression (window mapping) in cudf-polars (#19684) @Matt711 - Add streams support to all list APIs (#19683) @vyasr
- [FEA] Add Filter Benchmark (#19678) @lamarrr
- Add streams to pylibcudf join APIs (#19672) @vyasr
- Add streams to sorting APIs (#19671) @vyasr
- [FEA] Remove excessive copies of JITIFY's ProgramData during JIT kernel launch (#19667) @lamarrr
- Add streams to hashing APIs (#19663) @vyasr
- Use a more robust metric for sorting (de)compression tasks (#19656) @vuule
- Add streams support to datetime APIs (#19654) @vyasr
- Add streams to stream_compaction (#19651) @vyasr
- Enable casting
pl.Datetimeto integer types in ...
v25.06.00
🚨 Breaking Changes
- Remove cudf.BaseIndex (#18751) @mroeschke
- Implement
BIT_COUNTunary operation (#18589) @ttnghia - Expose column chunk metadata in
read_parquet_metadata()(#18579) @mhaseeb123 - Fix overflow for
MERGE_M2groupby aggregation (#18546) @ttnghia - Deduplicate parquet physical type enums (#18526) @mhaseeb123
- Implemented String Output & User-data Support for Transforms (#18490) @lamarrr
- Promote Parquet type enums to enum classes (#18441) @mhaseeb123
- Move parquet schema types and structs to public headers (#18424) @mhaseeb123
- Start removal of vector factories with
_syncsuffix by deprecating them and adding versions without the suffix (#18414) @vuule - Skip decoding of pages marked as pruned in PQ reader (#18347) @mhaseeb123
- Deprecate nvtext subword tokenizer (#18334) @davidwendt
- Add standard data ingestion pipelines to pylibcudf for ndarrays (#18311) @Matt711
- Remove extranous modules from top level cudf namespace (#18287) @mroeschke
- Add Keep Option Parameter to Distinct (#18237) @warrickhe
- Update to CCCL 2.8.x with no CCCL patches (#18235) @bdice
🐛 Bug Fixes
- Disable pytest benchmark for Narwhals CI job (#19074) @Matt711
- Avoid undefined behaviour in rolling_store_output_functor (#19069) @wence-
- Filter out pkg_resources UserWarning to make nightly CI pass (#19058) @Matt711
- Pin deltalake to <1.0.0 (#19017) @Matt711
- [BUG] Incorrectly getting the caller's frame when searching for locals and globals in cudf.pandas (#18979) @Matt711
- Ensure gc fixture is used in custreamz test (#18915) @TomAugspurger
- Fix a potential segfault in PQ reader's number of rows per source calculation (#18906) @mhaseeb123
- Fix Dataframe
getitemwhenMultiIndexcolumns exist (#18880) @galipremsagar - Ensure eq/ne between Columns in public objects don't return bool (#18875) @mroeschke
- Fix fencepost error in
Repartitiontask generation (#18854) @wence- - Fix cudf_polars pl.col(...).len() always excluding null values (#18849) @mroeschke
- Throw a descriptive exception in Parquet reader when trying to read files with more than two billion rows (#18835) @mhaseeb123
- Skip a decompression test (#18825) @vuule
- Update strings benchmarks to use alloc_size column/table function (#18822) @davidwendt
- Fix host decompression of empty DEFLATE data (#18805) @vuule
- Avoid going OOM in
test_row_limit_exceed_raisesby using dummy array (#18802) @Matt711 - Fix host decompression of empty Snappy data (#18800) @vuule
- Skip test that fails due to polars issue (#18787) @wence-
- Ensure scalar dtype is always set in from_py (#18780) @vyasr
- Fix reading of Snappy compressed Avro files (#18774) @vuule
- Fix missing semicolon in label_bins.cu (#18765) @evanramos-nvidia
- Fix noexcept annotations on strings_column_view (#18763) @wence-
- Fix integer overflows in pylibcudf
from_column_view_of_arbitrary(#18758) @wence- - Fix overflow case and clean up some logic (#18734) @vyasr
- Link to
nvtx3::nvtx3-cppinstead ofnvToolsExt(#18730) @jakirkham - Revise
DaskIntegrationprotocol to align withrapidsmpf(#18720) @rjzamora - Fix
skip_compressionoption in the Parquet writer with host compression (#18714) @vuule - Add missing header (#18671) @vyasr
- Revert "Set flag to always use unsafe atomic storage" (#18657) @PointKernel
- Fix optional operator* called on a disengaged value in clamp.cu (#18655) @davidwendt
- Add missing header to host_memory.cpp (#18649) @alliepiper
- Fix device compression when writing Parquet files without using nvCOMP (#18644) @vuule
- Add CUDA_ARCHITECTURES setting to cpp-linters script (#18637) @davidwendt
- Pin to cython<3.1 (#18617) @wence-
- Fix
DataFrame.memory_usageoutput order (#18595) @mroeschke - Set flag to always use unsafe atomic storage (#18590) @PointKernel
- Update KvikIO S3 endpoint usage (#18565) @kingcrimsontianyu
- Skip cuml third-party integration tests that may segfault (#18561) @Matt711
- Allow .iloc with cuDF objects as column indexers (#18558) @mroeschke
- Fix overflow for
MERGE_M2groupby aggregation (#18546) @ttnghia - Add back cudf root (#18544) @vyasr
- Change default memory resource for 'distributed' cudf-polars (#18531) @rjzamora
- Fix copy-on-write buffer separation and cleanup (#18530) @galipremsagar
- Fix cpp examples cmake to use the rapids_config.cmake (#18501) @davidwendt
- Rename rapidsmp to rapidsmpf (#18493) @rjzamora
- Fix compilation with the C++20 standard (#18486) @vuule
- Fix an error when reading some compressed Parquet V2 files (#18478) @vuule
- Support title-case characters in strings capitalize() and title() APIs (#18457) @davidwendt
- Ensure DataFrame column label operations reset label_dtype (#18452) @mroeschke
- Fix a segfault when reading a Parquet file with unsupported compression type (#18451) @vuule
- Fix logger macros (#18444) @vyasr
- Fix auto-detection of compression type in host-side decompression (#18440) @shrshi
- Use delete not free to release data allocated with new (#18412) @wence-
- Fix synchronization issues in host compression and decompression (#18395) @vuule
- Update Dask array-conversion handling (#18382) @rjzamora
- Fixed indexing on empty DataFrame with no columns (#18381) @TomAugspurger
- Deterministic hashing for DataFrameScan nodes in cudf-polars multi-partition executor (#18351) @TomAugspurger
- Fix index of right table in unary operators in AST, in Joins (#18333) @karthikeyann
- Add offsetalator to contiguous-split (#18312) @davidwendt
- Support large strings in nvtext vocabulary-tokenizer (#18283) @davidwendt
- Handle empty aggregations in multi-partition cudf.polars group_by (#18277) @TomAugspurger
📖 Documentation
- Docs for streaming executor options (#18934) @quasiben
- Fix some duplicate toctree issues and improve groupby docs (#18580) @vyasr
- [DOC] Running libcudf benchmarks and comparing output results (#18548) @Matt711
- Fix doxygen usage of the contraction for it is (#18517) @davidwendt
- Clarify @brief tag as description/title on documentation guide (#18515) @davidwendt
- [DOC] Improve clarity in parquet APIs set_row_groups and set_columns parquet (#18466) @Matt711
- Add a usage page to cudf-polars documentation (#18460) @Matt711
- [DOC] Fix typo in CONTRIBUTING.md on build type tests (#18456) @JigaoLuo
- improve docs related to documentation contribution (#18418) @ncclementi
- Add restart kernel note in cudf pandas docs (#18374) @ncclementi
🚀 New Features
- Add CLI argument to enable RMM async memory resource in PDS-H (#18899) @pentschev
- Scan a headerless CSV file with column names provided (#18816) @Matt711
- Add fast paths for
DataFrame.to_cupy(#18801) @Matt711 - Require
numba-cuda>=0.11.0(#18770) @brandon-b-miller - Create a pylibcudf Column from a python iterable (#18768) @Matt711
- Support
ConditianalJoinvia broadcasting in cudf-polars streaming engine (#18723) @rjzamora - Experimental PQ reader utility to calculate total rows in input row groups (#18716) @mhaseeb123
- Extend
explain_queryto support printing the logical plan (pre lowered plan) (#18708) @Matt711 - Reuse
libcudfdependencies for Java JNI build when they are available (#18682) @ttnghia - Add alloc_size member function to cudf::column and cudf::table (#18639) @davidwendt
- Print the physical cudf-polars plan in
pdsh.py(#18635) @rjzamora - String Transform Examples (#18616) @lamarrr
- Add streaming support for
group_by -> n_uniqueto cudf-polars (#18606) @rjzamora - Export cudf compiler flags and definitions (#18604) @ttnghia
- Implement
BIT_COUNTunary operation (#18589) @ttnghia - Expose column chunk metadata in
read_parquet_metadata()(#18579) @mhaseeb123 - Add APIs to check ORC and Parquet compression support at runtime (#18578) @vuule
- Add
Distinctsupport to the cudf-polars streaming executor (#18576) @rjzamora - Add support for large list host Arrow data conversion (#18562) @vyasr
- Implement
BITWISE_AGGaggregations (bitwiseAND,ORandXOR) for sort-based groupby and reduction (#18551) @ttnghia - Implement row group pruning with bloom filters in experimental PQ reader (#18545) @mhaseeb123
- Implement row group pruning with stats in experimental PQ reader (#18543) @mhaseeb123
- [JNI] Expose row-wise sha1 api (#18540) @warrickhe
- Add
Sort+head/tailsupport to streaming cudf-polars executor (#18538) @rjzamora - Add multi-partition MapFunction support to cudf-polars (#18523) @rjzamora
- Adds support for writing raw UTF-8 characters (without escaping) in the JSON writer (#18508) @Matt711
- Support reading from device buffers in the pylibcudf IO APIs (#18496) @Matt711
- Support multi-partition
Selectoperations with aggregations (#18492) @rjzamora - Implemented String Output & User-data Support for Transforms (#18490) @lamarrr
- Add a utility to bulk set multiple null masks (#18489) @mhaseeb123
- High level interface for experimental PQ reader and implementation of metadata APIs (#18480) @mhaseeb123
- Added
pylibcudf.utilities.is_ptds_enabled(#18467) @TomAugspurger - Add a public API for copying a table_view to device array (#18450) @Matt711
- Support
cudf-polarscast_time_unit(#18442) @brandon-b-miller - Support creating a pylibcudf Column from a host array (#18425) @Matt711
- Move parquet schema types and structs to public headers (#18424) @mhaseeb123
- Add optional dtype argument to
Scalar.from_any(#18415) @Matt711 - Expose
cudf::chunked_packin pylibcudf (#18411) @wence- - Add support for long string columns in cudf::contiguous_split (#18393) @nvdbaranec
- Implemented String Input support for Transforms and Removed
jit::column_device_view(#18378) @lamarrr - Automatically dispatch between host and device decompression/compression based on the number of buffers (#18363) @vuule
- Expose join hash table load factor (#18361) @PointKernel
- Skip decoding of pages marked as pruned in PQ reader (#18347) @mhaseeb123
- Sort-based inner join for high-multiplic...
v25.04.00
🚨 Breaking Changes
- Remove unused
group_range_rolling_windowAPI (#18313) @wence- - [BUG] Disabled JIT for CUDA Runtime < 11.5 (#18296) @lamarrr
- Remove cudf.Scalar from binops (#18240) @mroeschke
- Enforce deprecation of dtype parameter in sum/product (#18070) @mroeschke
- Remove deprecated single component datetime extract APIs (#18010) @Matt711
- Remove deprecated rolling window functionality (#17993) @wence-
- Remove deprecated nvtext::minhash_permuted APIs (#17939) @davidwendt
- Remove dataframe protocol (#17909) @vyasr
- Use new rapids-logger library (#17899) @vyasr
- Added Multi-input & Scalar Support for Transform UDFs (#17881) @lamarrr
- Fixed incorrect PTX parsing of
retinstruction after branch label (#17859) @lamarrr - Use KvikIO to enable file's fast host read and host write (#17764) @kingcrimsontianyu
🐛 Bug Fixes
- Fix alpha versions of cudf package. (#18429) @bdice
- Backport: Deterministic hashing for DataFrameScan nodes in cudf-polars multi-partition executor (#18351) (#18420) @bdice
- Skip failing Narwhals rolling groupy tests (#18398) @Matt711
- Pin cmake in test_java to be less than 4.0.0 (#18392) @abellina
- Skip polars tests that fail with pydantic deprecation warnings (#18388) @Matt711
- Backport: Fix index of right table in unary operators in AST, in Joins (#18342) @bdice
- xfail narwhals sqlframe tests (#18297) @Matt711
- [BUG] Disabled JIT for CUDA Runtime < 11.5 (#18296) @lamarrr
- Make a pylibcudf Column from a device array object with
strides=None(#18295) @Matt711 - Fix
cudf.pandasobjects to not beCallable(#18288) @galipremsagar - Skip failing polars test test_general_prefiltering (#18264) @Matt711
- Filter all cudf.pandas profiler tests from running in parallel (#18262) @Matt711
- Allow cudf.Series([pd.NA], dtype=, nan_as_null=False) (#18259) @mroeschke
- Fix
crossjoin with extra columns (#18256) @galipremsagar - Fix
Dataframe.locto not modify the actual dataframe (#18254) @galipremsagar - Remove RMM macro usage from to_arrow_device.cu (#18252) @davidwendt
- Skip Narwhals cross join tests for cudf.pandas CI run (#18249) @Matt711
- Fix cudf-polars tests for polars < 1.24 (#18246) @wence-
- Fix experimental cudf-polars tests (#18244) @rjzamora
- Fix
datetime64vsdatetimebinops max resolution (#18241) @galipremsagar - Use CCCL::libcudacxx include directories in Jitify preprocessing. (#18233) @bdice
- Disable conda prefix patching to avoid mangling binaries (#18225) @vyasr
- Workaround for ARM compiler issue with single space literal string (#18220) @davidwendt
- Bump nightly check limit (#18213) @Matt711
- Support comparitive binops between catgorical and non categorical (#18200) @mroeschke
- Make the version file inside cudf.pandas not a symlink (#18198) @vyasr
- Ensure RAPIDS_ARTIFACTS_DIR is set for build metrics reports. (#18192) @bdice
- Ignore run exports of libcufile. (#18190) @bdice
- Skip flaky multi GPU test (#18187) @Matt711
- Fix BPE merges table static-map capacity size (#18184) @davidwendt
- Drop
CUB_QUOTIENT_CEILING(#18179) @miscco - Disable ARM CI in C++ and Python test CI jobs (#18175) @Matt711
- Add fmt to the test/benchmarks env (#18173) @vyasr
- Fix merge(how=left, left_on=, right_index=True, sort=True) (#18166) @mroeschke
- Allow nonnative cupy dtype in cudf.Series (#18164) @mroeschke
- Fix Series construction from numpy array with non-native byte order (#18151) @mroeschke
- Use protocol for dlpack instead of deprecated function in cupy notebook (#18147) @Matt711
- Skip failing test (#18146) @vyasr
- Update calls to KvikIO's config setter (#18144) @kingcrimsontianyu
- Reduce memory use when writing tables with very short columns to ORC (#18136) @vuule
- Handle empty dictionary in to_arrow_device interop (#18121) @davidwendt
- Allow pivot_table to accept single label index and column arguments (#18115) @mroeschke
- Preserve DataFrame.column subclass and type during binop (#18113) @mroeschke
- Fix rmm macro call (#18108) @pmattione-nvidia
- Add include for
<functional>(#18102) @miscco - Remove static column vectors from window function tests. (#18099) @mythrocks
- Fix scatter_by_map with spilling enabled (#18095) @mroeschke
- Use the right version macro
CCCL_MAJOR_VERSION(#18073) @miscco - Fix
test_scan_csv_multicudf-polars test (#18064) @rjzamora - Fix memcopy direction for concatenate (#18058) @tgujar
- Fix upstream dask
loctest (#18045) @rjzamora - Fix hang on invalid UTF-8 data in string_view iterator (#18039) @davidwendt
- Fix
dask_cudf.to_orcdeprecation (#18038) @rjzamora - Compatibility with dask.dataframe's
is_scalar(#18030) @TomAugspurger - Fix the build error due to KvikIO update (#18025) @kingcrimsontianyu
- Fix failing ibis test (#18022) @Matt711
- Skip failing polars tests (#18015) @Matt711
- Fix
to_arrowto return consistent pandas-metadata (#18009) @galipremsagar - Prevent setting custom attributes to
ColumnMethods(#18005) @galipremsagar - Compatibility with Dask
main(#17992) @TomAugspurger - [Bug] Fix Parquet-metadata sampling in cudf-polars (#17991) @rjzamora
- Add missing include for calling std::iota() (#17983) @davidwendt
- Fix pickle and unpickling for all objects (#17980) @galipremsagar
- Install duckdb the default backend for ibis in the cudf.pandas integration tests (#17972) @Matt711
- Check null count too in sum aggregation (#17964) @Matt711
- Raise NotImplementedError for groupby.agg if duplicate columns would be created (#17956) @mroeschke
- Ensure disabling the module accelerator is thread-safe (#17955) @vyasr
- Fix DataFrame/Series.rank for int and null data in mode.pandas_compatible (#17954) @mroeschke
- Limit buffer size in reallocation policy in JSON reader (#17940) @shrshi
- Make
cudf.pandasproxy array picklable (#17929) @Matt711 - Add missing standard includes (#17928) @miscco
- Fix torch integration test (#17923) @Matt711
- Fix
to_pandaswritable bug fordatetimeandtimedeltatypes (#17913) @galipremsagar - Raise NotImplementedError if
.merge(suffixes=)introduces duplicate labels (#17905) @mroeschke - Fix groupby scans with int and NA data in mode.pandas_compatible (#17895) @mroeschke
- Patch
__init__ofcudfconstructors to parse throughcudf.pandasproxy objects (#17878) @galipremsagar - Fixed incorrect PTX parsing of
retinstruction after branch label (#17859) @lamarrr - Relax inconsistent schema handling in
dask_cudf.read_parquet(#17554) @rjzamora
📖 Documentation
- Clarify that cudf.pandas should be enabled before importing pandas. (#18339) @bdice
- [DOC] Add wordpiece tokenizer to cudf documentation (#18247) @davidwendt
- Added pylibcudf.contiguous_split to API docs (#18194) @TomAugspurger
- Fix build.sh docs for default behavior (#18180) @bdice
- Update Dask-cuDF documentation to fix all warnings and errors (#18157) @TomAugspurger
- [DOC] Document character normalizer (#18125) @Matt711
🚀 New Features
- Add and revise experimental cudf-polars config options (#18284) @rjzamora
- Support
top-kandbottom_kexpressions (#18222) @Matt711 - Support
cudf-polarsis_leap_year(#18212) @brandon-b-miller - Support
cudf-polarsmonth_start/month_end(#18211) @brandon-b-miller - Support
cudf-polarsordinal_day(#18152) @brandon-b-miller - Add
pylibcudf.gpumemoryviewsupport forlen()/nbytes(#18133) @pentschev - Link to libzstd for ZSTD compression and decompression APIs (#18129) @shrshi
- Added NDSH Q09 Benchmark for Transforms (#18127) @lamarrr
- Make pylibcudf traits raise exceptions gracefully rather than terminating in C++ (#18117) @Matt711
- Host decompression (#18114) @vuule
- Add owning types to hold Arrow data (#18084) @vyasr
- Bump polars version to <1.24 (#18076) @Matt711
- Support sorted merges in cudf.polars (#18075) @Matt711
- Add a slice expression to polars IR (#18050) @Matt711
- Expose
num_rows_per_source(IO metadata) to pylibcudf (#18049) @Matt711 - Added Imbalanced Tree Benchmarks for Transforms (#18032) @lamarrr
- Run the narwhals test suite with cudf.pandas (#18031) @Matt711
- Add
host_read_asyncinterfaces todatasource(#18018) @vuule - Make most cudf-polars
Nodeobjects pickleable (#17998) @rjzamora - Add
Column.serializeto cudf-polars (#17990) @rjzamora - Bump polars version to <1.23 (#17986) @Matt711
- Implemented Decimal Transforms (#17968) @lamarrr
- Introduce ZSTD host-side compression and decompression APIs (#17935) @shrshi
- Add catboost integration tests (#17931) @Matt711
- [FEA] Expose
stripe_size_rowssetting forORCWriterOptions(#17927) @ustcfy - Test narwhals in CI (#17884) @bdice
- Added Multi-input & Scalar Support for Transform UDFs (#17881) @lamarrr
- Host Snappy compression (#17824) @vuule
- Run spark-rapids-jni CI (#17781) @KyleFromNVIDIA
- Add multi-partition
Shuffleoperation to cuDF Polars (#17744) @rjzamora - Added polynomials benchmark (#17695) @lamarrr
- Add stream parameters in pylibcudf IO APIs (#17620) @Matt711
- New nvtext::wordpiece_tokenizer APIs (#17600) @davidwendt
- Add support for unary negation operator (#17560) @Matt711
- Add multi-partition
Joinsupport to cuDF-Polars (#17518) @rjzamora - Add basic multi-partition
GroupBysupport to cuDF-Polars (#17503) @rjzamora - Support Distributed in cudf-polars tests and IR evaluation (#17364) @pentschev
🛠️ Improvements
- Use pyarrow 15 in oldest dependency CI jobs (#18409) @bdice
- Bump librdkafka to 2.8.0 (#18370) @raydouglass
- fix(rattler): ignore
libzlibrun dependency to avoidpandoccollision (#18368) @gforsyth - Fix zstd build interface include definition (#18366) @trxcllnt
- test: Install pytest-env and hypothesis in test_narwhals.sh (#18337) @MarcoGorelli
- Remove unused
group_range_rolling_windowAPI (#18313) @wence- - Cache column view creation from arrow types (#18302) @vyasr
- Split Narwhals cudf.pandas tests failures into to fix and to skip (#18267) @mroeschke
- Support BinOp, min, and max Aggregations in cudf-polars parallel ...
v25.02.02
🚨 Breaking Changes
- Expose stream-ordering in scalar and avro APIs (#17766) @shrshi
- Add seed parameter to hash_character_ngrams (#17643) @davidwendt
- Performance improvements and simplifications for fixed size row-based rolling windows (#17623) @wence-
- Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
- Deprecate cudf::grouped_time_range_rolling_window (#17589) @wence-
- Remove "legacy" Dask DataFrame support from Dask cuDF (#17558) @rjzamora
- Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
- Rework minhash APIs for deprecation cycle (#17421) @davidwendt
- Change indices for dictionary column to signed integer type (#17390) @davidwendt
🐛 Bug Fixes
- Use protocol for dlpack instead of deprecated function (#18134) @vyasr
- Skip the failing connectorx polars tests (#18037) @Matt711
- Fix 'Unexpected short subpass' exception in parquet chunked reader. (#18019) @nvdbaranec
- Fix race check failures in shared memory groupby (#17985) @PointKernel
- Pin
ibisversion in the cudf.pandas integration tests <10.0.0 (#17975) @Matt711 - Fix the index type in the indexing operator of the span types (#17971) @vuule
- Add missing pin (#17915) @vyasr
- Fix third-party
cudf.pandastests (#17900) @galipremsagar - Fix
numpydata access by making attribute private (#17890) @galipremsagar - Remove extra local var declaration from cudf.pandas 3rd-party integration shell script (#17886) @Matt711
- Move
isinstance_cudf_pandastofast_slow_proxy(#17875) @galipremsagar - Make
_Series_dtypemethod a property (#17854) @Matt711 - Fix the bug in determining the heuristics for shared memory groupby (#17851) @PointKernel
- Fix possible OOB mem access in Parquet decoder (#17841) @mhaseeb123
- Require batches to be non-empty in multi-batch JSON reader (#17837) @shrshi
- Fix rolling(min_periods=) with int and null data with mode.pandas_compat (#17822) @mroeschke
- Resolve race-condition in
disable_module_accelerator(#17811) @galipremsagar - Make Series(dtype=object) raise in mode.pandas_compat with non string data (#17804) @mroeschke
- Disable intended disabled ORC tests (#17790) @davidwendt
- Fix empty DataFrame construction not returning RangeIndex columns (#17784) @mroeschke
- Fix various
.strmethods for pandas compatability (#17782) @mroeschke - Fix
countAPI issue about ignoring nan values (#17779) @galipremsagar - Add
numbapinning tocudfrepo (#17777) @galipremsagar - Allow .sort_values(na_position=) to include NaNs in mode.pandas_compatible (#17776) @mroeschke
- allow deselecting nvcomp wheels (#17774) @jameslamb
- Use the
aligned_resource_adaptorto allocate bloom filter device buffers (#17758) @mhaseeb123 - Avoid instantiating bloom filter query function for nested and bool types (#17753) @mhaseeb123
- Fix DataFrame.merge(Series, how="left"/"right") on column and index not resulting in a RangeIndex (#17739) @mroeschke
- [BUG] xfail Polars excel test (#17731) @Matt711
- Require to implement
AutoCloseablefor the classes derived fromHostUDFWrapper(#17727) @ttnghia - Remove jlowe as a java committer since he retired (#17725) @tgravescs
- Prevent use of invalid grid sizes in ORC reader and writer (#17709) @vuule
- Enforce schema for partial tables in multi-source multi-batch JSON reader (#17708) @shrshi
- Compute and use the initial string offset when building
nestedlarge string cols with chunked parquet reader (#17702) @mhaseeb123 - Fix writing of compressed ORC files with large stripe footers (#17700) @vuule
- Fix cudf.polars sum of empty not equalling zero (#17685) @mroeschke
- Fix formatting in logging (#17680) @vuule
- convert all nulls to nans in a specific scenario (#17677) @galipremsagar
- Define cudf repr methods on the Column (#17675) @mroeschke
- Fix groupby.len with null values in cudf.polars (#17671) @mroeschke
- Fix: DataFrameGroupBy.get_group was raising with length>1 tuples (#17653) @MarcoGorelli
- Fix possible int overflow in compute_mixed_join_output_size (#17633) @davidwendt
- Fix a minor potential i32 overflow in
thrust::transform_exclusive_scanin PQ reader preprocessing (#17617) @mhaseeb123 - Fix failing xgboost test in the cudf.pandas third-party integration tests (#17616) @Matt711
- Fix
dask_cudf.read_csv(#17612) @rjzamora - Fix memcheck error in ReplaceTest.NormalizeNansAndZerosMutable gtest (#17610) @davidwendt
- Correctly accept a
pandas.CategoricalDtype(pandas.IntervalDtype(...), ...)type (#17604) @mroeschke - Add ability to modify and propagate
namesofcolumnsobject (#17597) @galipremsagar - Ignore NaN correctly in .quantile (#17593) @mroeschke
- Fix groupby argmin/max gather of sorted-order indices (#17591) @davidwendt
- Fix ctest fail running libcudf tests in a Debug build (#17576) @davidwendt
- Specify a version for rapids_logger dependency (#17573) @jlowe
- Fix the ORC decoding bug for the timestamp data (#17570) @kingcrimsontianyu
- [JNI] remove rmm argument to set rw access for fabric handles (#17553) @abellina
- Document undefined behavior in div_rounding_up_safe (#17542) @davidwendt
- Fix nvcc-imposed UB in
constexprfunctions (#17534) @vuule - Add anonymous namespace to libcudf test source (#17529) @davidwendt
- Propagate failures in pandas integration tests and Skip failing tests (#17521) @Matt711
- Fix libcudf compile error when logging is disabled (#17512) @davidwendt
- Fix Dask-cuDF
clipAPIs (#17509) @rjzamora - Fix pylibcudf to_arrow with multiple nested data types (#17504) @mroeschke
- Fix groupby(as_index=False).size not reseting index (#17499) @mroeschke
- Revert "Temporarily skip tests due to dask/distributed#8953" (#17492) @Matt711
- Workaround for a misaligned access in
read_csvon some CUDA versions (#17477) @vuule - Fix some possible thread-id overflow calculations (#17473) @davidwendt
- Temporarily skip tests due to dask/distributed#8953 (#17472) @wence-
- Detect mismatches in begin and end tokens returned by JSON tokenizer FST (#17471) @shrshi
- Support dask>=2024.11.2 in Dask cuDF (#17439) @rjzamora
- Fix write_json failure for zero columns in table/struct (#17414) @karthikeyann
- Fix Debug-mode failing Arrow test (#17405) @zeroshade
- Fix all null list column with missing child column in JSON reader (#17348) @karthikeyann
📖 Documentation
- Fix forward merge 24.12->25.02 (#18002) @raydouglass
- Fix incorrect example in pylibcudf docs (#17912) @Matt711
- Explicitly call out that the GPU open beta runs on a single GPU (#17872) @taureandyernv
- Update cudf.pandas colab link in docs (#17846) @taureandyernv
- [DOC] Make pylibcudf docs more visible (#17803) @Matt711
- Cross-link cudf.pandas profiler documentation. (#17668) @bdice
- Document interpreter install command for cudf.pandas (#17358) @bdice
- add comment to Series.tolist method (#17350) @tequilayu
🚀 New Features
- Bump polars version to <1.22 (#17771) @Matt711
- Make more constexpr available on device for cuIO (#17746) @PointKernel
- Add public interop functions between pylibcudf and cudf classic (#17730) @Matt711
- Support
dask_exprmigration intodask.dataframe(#17704) @rjzamora - Make tests build without relaxed constexpr (#17691) @PointKernel
- Set default logger level to warn (#17684) @vyasr
- Support multithreaded reading of compressed buffers in JSON reader (#17670) @shrshi
- Control pinned memory use with environment variables (#17657) @vuule
- Host compression (#17656) @vuule
- Enable text build without relying on relaxed constexpr (#17647) @PointKernel
- Implement
HOST_UDFaggregation for reduction and segmented reduction (#17645) @ttnghia - Add JSON reader options structs to pylibcudf (#17614) @Matt711
- Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
- Add JSON Writer options classes to pylibcudf (#17606) @Matt711
- Add ORC reader options structs to pylibcudf (#17601) @Matt711
- Add Avro Reader options classes to pylibcudf (#17599) @Matt711
- Enable binaryop build without relying on relaxed constexpr (#17598) @PointKernel
- Measure the number of Parquet row groups filtered by predicate pushdown (#17594) @mhaseeb123
- Implement
HOST_UDFaggregation for groupby (#17592) @ttnghia - Plumb pylibcudf.io.parquet options classes through cudf python (#17506) @Matt711
- Add partition-wise
Selectsupport to cuDF-Polars (#17495) @rjzamora - Add multi-partition
Scansupport to cuDF-Polars (#17494) @rjzamora - Migrate
cudf::io::merge_row_group_metadatato pylibcudf (#17491) @Matt711 - Add Parquet Reader options classes to pylibcudf (#17464) @Matt711
- Add multi-partition
DataFrameScansupport to cuDF-Polars (#17441) @rjzamora - Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
- Abstract polars function expression nodes to ensure they are serializable (#17418) @pentschev
- Add CSV Reader options classes to pylibcudf (#17412) @Matt711
- Add support for
pylibcudf.DataTypeserialization (#17352) @pentschev - Enable rounding for Decimal32 and Decimal64 in cuDF (#17332) @a-hirota
- Remove upper bounds on cuda-python to allow 12.6.2 and 11.8.5 (#17326) @bdice
- Expose stream-ordering to groupby APIs (#17324) @shrshi
- Migrate ORC Writer to pylibcudf (#17310) @Matt711
- Support reading bloom filters from Parquet files and filter row groups using them (#17289) @mhaseeb123
🛠️ Improvements
- Update to nvcomp 4.2.0.11 (#18042) @bdice
- Remove pandas backend from
cudf.pandas- ibis integration tests (#17945) @Matt711 - Revert CUDA 12.8 shared workflow branch changes (#17879) @vyasr
- Remove predicate param from
DataFrameScanIR (#17852) @Matt711 - Remove cudf.Scalar from scatter APIs (#17847) @mroeschke
- Remove cudf.Scalar from interval_range (#17844) @mroeschke
- Add
verify-codeownershook (#17840) @KyleFromNVIDIA - Build and test with CUDA 12.8.0 (#17834) @bdice
- In...