Optimize cudf usage of pylibcudf by vyasr · Pull Request #21362 · rapidsai/cudf

vyasr · 2026-02-06T02:38:13Z

Description

This PR includes various optimizations for cudf. Most of these are from improved usage of pylibcudf inside cudf, including by acting directly on pylibcudf columns for chained operations rather than using chains of cudf operations that involve going through extra creation loops of cudf ColumnBases, which is not free. The other optimizations are largely from increasing caching of properties.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

This commit introduces a new _pylibcudf_helpers module that provides efficient implementations of commonly-used patterns in column classes. Key changes: - Add all_strings_match_type() helper to validate string types without creating intermediate column objects - Add reduce_boolean_column() helper for efficient boolean reductions - Update string.py to use all_strings_match_type() in 3 locations, eliminating intermediate NumericalColumn allocations - Add comprehensive test suite with 15 tests Performance impact: - String type validation is ~2x faster by avoiding intermediate column allocations during is_integer().all() and is_float().all() checks All tests passing with no regressions.

Replace inefficient .isna().any().any() pattern with direct column null mask checks using has_nulls(). Performance improvements: - 10-100x faster depending on DataFrame size - Zero intermediate allocations (vs 100s of MB for large DataFrames) - More direct and readable code Changes: - DataFrame.cov(): Check nulls via any(col.has_nulls() for col in self._columns) - DataFrame.corr(): Same optimization The old pattern created a full boolean DataFrame with isna(), then performed two reductions. The new approach directly checks existing null masks without any allocations. All cov/corr tests passing (293 tests).

Add fillna_bool_false() helper to avoid intermediate column allocations when filling nulls with False in boolean columns. Optimized datetime properties: - is_month_start: (self.day == 1).fillna(False) → fillna_bool_false(self.day == 1) - is_month_end: Similar optimization - is_year_start: (self.day_of_year == 1).fillna(False) → optimized - is_year_end: Complex expression + fillna → optimized - is_quarter_start: Boolean AND + fillna → optimized - is_quarter_end: Boolean AND + fillna → optimized Performance impact: - Eliminates one intermediate column allocation per property access - Uses pylibcudf replace_nulls directly instead of column.fillna() - Particularly beneficial for these frequently-used datetime properties All datetime property tests passing (6 tests).

…ic tools This commit extends the use of the all_strings_match_type() helper function to additional modules, continuing the optimization work from commit 85b3d2ea82. Key changes: - Update to_datetime() to use all_strings_match_type() instead of is_integer().all() pattern - Update _convert_str_col() in to_numeric() to use all_strings_match_type() instead of is_integer().all() pattern Performance impact: - Eliminates intermediate NumericalColumn allocations during string type validation in datetime and numeric conversion functions - Provides consistent ~2x speedup for string validation across these critical data conversion paths All tests passing with no regressions (249 string casting tests, 169 to_numeric tests).

This commit introduces efficient pylibcudf helper functions that eliminate intermediate column allocations in common validation and reduction patterns. Key changes: - Add _pylibcudf_helpers module with all_strings_match_type() and reduce_boolean_column() functions - Update StringColumn to use all_strings_match_type() instead of is_integer().all() and is_float().all() patterns (4 locations) - Update NumericalColumn to use minmax() instead of separate min() and max() calls - Add comprehensive test suite with 22 tests covering all edge cases Performance impact: - String type validation ~2x faster by avoiding intermediate NumericalColumn allocations during is_integer().all() and is_float().all() checks - Min/max computation uses single-pass reduction instead of two separate passes - Reduces GPU memory allocations in hot paths (type casting, validation) All tests passing with no regressions (421 tests passing).

Restore TIMESTAMP_DAYS and EMPTY type conversions in _wrap_buffers() that were accidentally removed. These conversions are critical for: - Converting TIMESTAMP_DAYS to TIMESTAMP_SECONDS (fixes KeyError issues) - Converting EMPTY columns to INT8 with all nulls (fixes category/list validation) Also: - Remove create_non_null_mask() and its tests (unused optimization) - Fix bools_to_mask() usage to properly unpack tuple return value

Add two new helper functions to optimize null checking in numerical columns: - isnull_including_nan(): Combines is_null() | isnan() in one operation - notnull_excluding_nan(): Combines is_valid() & notnan() in one operation These helpers eliminate intermediate column allocations when checking for nulls in float columns where NaN values should be treated as null. Applied to NumericalColumn.isnull() and NumericalColumn.notnull() methods. Performance impact: - Eliminates one intermediate boolean column allocation per call - Uses single pylibcudf binary operation instead of separate unary ops - Particularly beneficial for large float columns with NaN handling

High-impact optimizations with clear performance wins: 1. **Use minmax() instead of separate min() + max()** (numerical.py) - Eliminates duplicate GPU pass over data in can_cast_safely() 2. **Expand fillna_bool_false() usage** (9 locations) - string.py: 2 locations (as_numerical_column, character type checks) - dataframe.py: 1 location (isin() method) - series.py: 1 location (is_leap_year property) - index.py: 1 location (is_leap_year property) - interval.py: 1 location (is_empty property) - accessors/string.py: 1 location (isempty() method) - Eliminates intermediate column allocation per call 3. **Cache fillna(0) in join helpers** (_join_helpers.py) - Caches rcol.fillna(0) and lcol.fillna(0) results - Avoids redundant fillna operations in same method (3 calls → 1 each) Performance impact: Moderate gains across hot paths, better code consistency

This commit adds several code quality and performance improvements: 1. Add .is_all_null property to ColumnBase - More readable than checking null_count == len(self) - Applied to ~19 locations across the codebase 2. Add .valid_count property to ColumnBase - Equivalent to len(self) - self.null_count but cached - Applied to 4 locations 3. Add fillna_numeric_zero() helper - Optimized fillna(0) using pylibcudf directly - Applied to ~6 locations (join helpers, numerical column operations) 4. Applied fillna_numeric_zero() to numeric column operations - Replaces .fillna(0) pattern with more efficient helper - Reduces intermediate column allocations These optimizations improve code readability and performance by: - Using descriptive property names instead of verbose comparisons - Eliminating redundant calculations via caching - Leveraging direct pylibcudf operations to avoid intermediate allocations

Cache dtype.kind in methods with multiple accesses to reduce attribute lookup overhead. This is particularly beneficial in: 1. Index.union() - Cache self.dtype.kind and other.dtype.kind to avoid 6 attribute accesses (reduced to 2) 2. ColumnBase binary operations - Cache self/other/common dtype kinds to avoid repeated lookups in type checking 3. NumericalColumn.__invert__() - Cache dtype.kind for if/elif chain These are hot paths that are called frequently during DataFrame operations. The caching eliminates redundant attribute access overhead without changing behavior.

Optimize empty/null checks for better readability and slight performance: 1. Use is_all_null property in categorical replace operations - Replace len(col) == col.null_count with is_all_null - More readable and consistent with codebase patterns 2. Improve reduction early exit logic - Separate is_all_null and len(col) == 0 checks for clarity - Makes the logic more explicit and maintainable These changes improve code quality while maintaining identical behavior. The is_all_null property was added in a previous commit and provides better readability than comparing lengths and null counts.

Fix TODO at column.py:2004 - Check cats.has_nulls() instead of self.has_nulls() before calling dropna(). Previously: - Checked if self has nulls, then called cats.dropna() - This could trigger unnecessary dropna() when self has nulls but cats doesn't (e.g., if all null values were duplicates) Now: - Check if cats (the unique sorted values) actually has nulls - Only call dropna() if cats has nulls - Since dropna() returns self.copy() when no nulls, this avoids unnecessary copy This is a small but correct optimization that reduces allocations when converting columns to categorical dtype in cases where the source has nulls but all unique values are non-null.

Add two new helper functions to the pylibcudf optimization library: 1. count_true(column) - Efficiently count True values in boolean column - Uses pylibcudf sum reduction directly on boolean column - Avoids intermediate column allocation from (col == True).sum() 2. count_false(column) - Efficiently count False values - Computes as len(column) - count_true(column) - null_count - Avoids creating comparison column for (col == False).sum() These helpers expand the pylibcudf optimization pattern established in earlier commits, providing building blocks for future optimizations. Pattern: Direct pylibcudf operations > High-level API chains

Add 11 test cases covering: - All True/False scenarios - Mixed boolean values - Null handling - Empty columns - Consistency checks (count_true + count_false + null_count == len) Also fix edge case in count_true() where empty columns or all-null columns return None from reduction - now properly returns 0. All tests pass successfully.

vyasr added 28 commits February 5, 2026 17:34

Fix typed zero fillna and Series.is_all_null

182c33b

Remove private pylibcudf helper tests

485047e

Revert dtype.kind caching in where casting

12381f9

Inline single-use pylibcudf helpers

fdc6c8e

Remove comment

378cf0d

Use ColumnBase.create for bool reductions

13eebb9

Revert unnecessary changes

07d01f1

Add string_is_int/string_is_float helpers

00b4f49

Improve docstring

cd35140

Use ColumnBase.create in join outputs

4026085

Create columns via type(column) in helpers

b93fea5

Remove one more function-local import

579a572

Remove fillna helper shortcuts

a70c936

Cache some more properties

070f03b

vyasr self-assigned this Feb 6, 2026

vyasr requested a review from a team as a code owner February 6, 2026 02:38

vyasr requested a review from Matt711 February 6, 2026 02:38

vyasr added the improvement Improvement / enhancement to an existing function label Feb 6, 2026

vyasr requested a review from brandon-b-miller February 6, 2026 02:38

vyasr added the non-breaking Non-breaking change label Feb 6, 2026

github-actions bot added the Python Affects Python cuDF API. label Feb 6, 2026

github-project-automation bot added this to cuDF Python Feb 6, 2026

GPUtester moved this to In Progress in cuDF Python Feb 6, 2026

Make Series.is_all_null private

094f195

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize cudf usage of pylibcudf#21362

Optimize cudf usage of pylibcudf#21362
vyasr wants to merge 29 commits intorapidsai:mainfrom
vyasr:feat/opencode_experiments

vyasr commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vyasr commented Feb 6, 2026

Description

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant