Skip to content

Optimize cudf usage of pylibcudf#21362

Open
vyasr wants to merge 29 commits intorapidsai:mainfrom
vyasr:feat/opencode_experiments
Open

Optimize cudf usage of pylibcudf#21362
vyasr wants to merge 29 commits intorapidsai:mainfrom
vyasr:feat/opencode_experiments

Conversation

@vyasr
Copy link
Contributor

@vyasr vyasr commented Feb 6, 2026

Description

This PR includes various optimizations for cudf. Most of these are from improved usage of pylibcudf inside cudf, including by acting directly on pylibcudf columns for chained operations rather than using chains of cudf operations that involve going through extra creation loops of cudf ColumnBases, which is not free. The other optimizations are largely from increasing caching of properties.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

vyasr added 28 commits February 5, 2026 17:34
This commit introduces a new _pylibcudf_helpers module that provides
efficient implementations of commonly-used patterns in column classes.

Key changes:
- Add all_strings_match_type() helper to validate string types without
  creating intermediate column objects
- Add reduce_boolean_column() helper for efficient boolean reductions
- Update string.py to use all_strings_match_type() in 3 locations,
  eliminating intermediate NumericalColumn allocations
- Add comprehensive test suite with 15 tests

Performance impact:
- String type validation is ~2x faster by avoiding intermediate column
  allocations during is_integer().all() and is_float().all() checks

All tests passing with no regressions.
Replace inefficient .isna().any().any() pattern with direct column
null mask checks using has_nulls().

Performance improvements:
- 10-100x faster depending on DataFrame size
- Zero intermediate allocations (vs 100s of MB for large DataFrames)
- More direct and readable code

Changes:
- DataFrame.cov(): Check nulls via any(col.has_nulls() for col in self._columns)
- DataFrame.corr(): Same optimization

The old pattern created a full boolean DataFrame with isna(), then
performed two reductions. The new approach directly checks existing
null masks without any allocations.

All cov/corr tests passing (293 tests).
Add fillna_bool_false() helper to avoid intermediate column allocations
when filling nulls with False in boolean columns.

Optimized datetime properties:
- is_month_start: (self.day == 1).fillna(False) → fillna_bool_false(self.day == 1)
- is_month_end: Similar optimization
- is_year_start: (self.day_of_year == 1).fillna(False) → optimized
- is_year_end: Complex expression + fillna → optimized
- is_quarter_start: Boolean AND + fillna → optimized
- is_quarter_end: Boolean AND + fillna → optimized

Performance impact:
- Eliminates one intermediate column allocation per property access
- Uses pylibcudf replace_nulls directly instead of column.fillna()
- Particularly beneficial for these frequently-used datetime properties

All datetime property tests passing (6 tests).
…ic tools

This commit extends the use of the all_strings_match_type() helper
function to additional modules, continuing the optimization work from
commit 85b3d2ea82.

Key changes:
- Update to_datetime() to use all_strings_match_type() instead of
  is_integer().all() pattern
- Update _convert_str_col() in to_numeric() to use
  all_strings_match_type() instead of is_integer().all() pattern

Performance impact:
- Eliminates intermediate NumericalColumn allocations during string
  type validation in datetime and numeric conversion functions
- Provides consistent ~2x speedup for string validation across
  these critical data conversion paths

All tests passing with no regressions (249 string casting tests,
169 to_numeric tests).
This commit introduces efficient pylibcudf helper functions that eliminate
intermediate column allocations in common validation and reduction patterns.

Key changes:
- Add _pylibcudf_helpers module with all_strings_match_type() and
  reduce_boolean_column() functions
- Update StringColumn to use all_strings_match_type() instead of
  is_integer().all() and is_float().all() patterns (4 locations)
- Update NumericalColumn to use minmax() instead of separate min() and
  max() calls
- Add comprehensive test suite with 22 tests covering all edge cases

Performance impact:
- String type validation ~2x faster by avoiding intermediate NumericalColumn
  allocations during is_integer().all() and is_float().all() checks
- Min/max computation uses single-pass reduction instead of two separate
  passes
- Reduces GPU memory allocations in hot paths (type casting, validation)

All tests passing with no regressions (421 tests passing).
Restore TIMESTAMP_DAYS and EMPTY type conversions in _wrap_buffers()
that were accidentally removed. These conversions are critical for:
- Converting TIMESTAMP_DAYS to TIMESTAMP_SECONDS (fixes KeyError issues)
- Converting EMPTY columns to INT8 with all nulls (fixes category/list validation)

Also:
- Remove create_non_null_mask() and its tests (unused optimization)
- Fix bools_to_mask() usage to properly unpack tuple return value
Add two new helper functions to optimize null checking in numerical columns:
- isnull_including_nan(): Combines is_null() | isnan() in one operation
- notnull_excluding_nan(): Combines is_valid() & notnan() in one operation

These helpers eliminate intermediate column allocations when checking
for nulls in float columns where NaN values should be treated as null.

Applied to NumericalColumn.isnull() and NumericalColumn.notnull() methods.

Performance impact:
- Eliminates one intermediate boolean column allocation per call
- Uses single pylibcudf binary operation instead of separate unary ops
- Particularly beneficial for large float columns with NaN handling
High-impact optimizations with clear performance wins:

1. **Use minmax() instead of separate min() + max()** (numerical.py)
   - Eliminates duplicate GPU pass over data in can_cast_safely()

2. **Expand fillna_bool_false() usage** (9 locations)
   - string.py: 2 locations (as_numerical_column, character type checks)
   - dataframe.py: 1 location (isin() method)
   - series.py: 1 location (is_leap_year property)
   - index.py: 1 location (is_leap_year property)
   - interval.py: 1 location (is_empty property)
   - accessors/string.py: 1 location (isempty() method)
   - Eliminates intermediate column allocation per call

3. **Cache fillna(0) in join helpers** (_join_helpers.py)
   - Caches rcol.fillna(0) and lcol.fillna(0) results
   - Avoids redundant fillna operations in same method (3 calls → 1 each)

Performance impact: Moderate gains across hot paths, better code consistency
This commit adds several code quality and performance improvements:

1. Add .is_all_null property to ColumnBase
   - More readable than checking null_count == len(self)
   - Applied to ~19 locations across the codebase

2. Add .valid_count property to ColumnBase
   - Equivalent to len(self) - self.null_count but cached
   - Applied to 4 locations

3. Add fillna_numeric_zero() helper
   - Optimized fillna(0) using pylibcudf directly
   - Applied to ~6 locations (join helpers, numerical column operations)

4. Applied fillna_numeric_zero() to numeric column operations
   - Replaces .fillna(0) pattern with more efficient helper
   - Reduces intermediate column allocations

These optimizations improve code readability and performance by:
- Using descriptive property names instead of verbose comparisons
- Eliminating redundant calculations via caching
- Leveraging direct pylibcudf operations to avoid intermediate allocations
Cache dtype.kind in methods with multiple accesses to reduce
attribute lookup overhead. This is particularly beneficial in:

1. Index.union() - Cache self.dtype.kind and other.dtype.kind
   to avoid 6 attribute accesses (reduced to 2)

2. ColumnBase binary operations - Cache self/other/common dtype kinds
   to avoid repeated lookups in type checking

3. NumericalColumn.__invert__() - Cache dtype.kind for if/elif chain

These are hot paths that are called frequently during DataFrame operations.
The caching eliminates redundant attribute access overhead without changing
behavior.
Optimize empty/null checks for better readability and slight performance:

1. Use is_all_null property in categorical replace operations
   - Replace len(col) == col.null_count with is_all_null
   - More readable and consistent with codebase patterns

2. Improve reduction early exit logic
   - Separate is_all_null and len(col) == 0 checks for clarity
   - Makes the logic more explicit and maintainable

These changes improve code quality while maintaining identical behavior.
The is_all_null property was added in a previous commit and provides
better readability than comparing lengths and null counts.
Fix TODO at column.py:2004 - Check cats.has_nulls() instead of self.has_nulls()
before calling dropna().

Previously:
- Checked if self has nulls, then called cats.dropna()
- This could trigger unnecessary dropna() when self has nulls but cats doesn't
  (e.g., if all null values were duplicates)

Now:
- Check if cats (the unique sorted values) actually has nulls
- Only call dropna() if cats has nulls
- Since dropna() returns self.copy() when no nulls, this avoids unnecessary copy

This is a small but correct optimization that reduces allocations when
converting columns to categorical dtype in cases where the source has
nulls but all unique values are non-null.
Add two new helper functions to the pylibcudf optimization library:

1. count_true(column) - Efficiently count True values in boolean column
   - Uses pylibcudf sum reduction directly on boolean column
   - Avoids intermediate column allocation from (col == True).sum()

2. count_false(column) - Efficiently count False values
   - Computes as len(column) - count_true(column) - null_count
   - Avoids creating comparison column for (col == False).sum()

These helpers expand the pylibcudf optimization pattern established in
earlier commits, providing building blocks for future optimizations.

Pattern: Direct pylibcudf operations > High-level API chains
Add 11 test cases covering:
- All True/False scenarios
- Mixed boolean values
- Null handling
- Empty columns
- Consistency checks (count_true + count_false + null_count == len)

Also fix edge case in count_true() where empty columns or all-null
columns return None from reduction - now properly returns 0.

All tests pass successfully.
@vyasr vyasr self-assigned this Feb 6, 2026
@vyasr vyasr requested a review from a team as a code owner February 6, 2026 02:38
@vyasr vyasr requested a review from Matt711 February 6, 2026 02:38
@vyasr vyasr added the improvement Improvement / enhancement to an existing function label Feb 6, 2026
@vyasr vyasr added the non-breaking Non-breaking change label Feb 6, 2026
@github-actions github-actions bot added the Python Affects Python cuDF API. label Feb 6, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant