Skip to content

Conversation

@aaron-ang
Copy link
Contributor

Changes Made

Compare hashes of item and child first before doing full value match in arrow format.

Related Issues

Closes #2749

@github-actions github-actions bot added the feat label Jan 28, 2026
@aaron-ang aaron-ang force-pushed the list-contains branch 2 times, most recently from 7f98673 to 5b11491 Compare January 28, 2026 23:05
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 28, 2026

Greptile Overview

Greptile Summary

Implemented list_contains function to check if list elements contain a specified item, addressing issue #2749.

Key Changes:

  • Added new list_contains expression method and function to Python API
  • Implemented efficient Rust kernel using hash-based comparison followed by full value comparison
  • Supports both List and FixedSizeList types with proper broadcasting
  • Comprehensive null handling (null lists, null items, null elements within lists)
  • Complete test coverage with 15 test cases covering edge cases and various data types

Implementation Highlights:

  • Performance optimization: compares hashes first before doing full value equality check
  • Proper type validation ensuring item type matches list element type
  • Broadcasting support for scalar items across multiple lists
  • Follows existing patterns from list_append and other list functions

Confidence Score: 5/5

  • Safe to merge - well-implemented feature with comprehensive tests and no breaking changes
  • The implementation follows established patterns in the codebase, includes thorough test coverage, and properly handles edge cases including nulls and various data types. The hash-based optimization is sound and the code structure is clean.
  • No files require special attention

Important Files Changed

Filename Overview
src/daft-functions-list/src/contains.rs Added new ListContains UDF with proper type checking and broadcasting support
src/daft-functions-list/src/kernels.rs Implemented list_contains kernel with hash-based optimization for efficient lookup
daft/functions/list.py Added list_contains function with comprehensive documentation and examples
tests/recordbatch/list/test_list_contains.py Comprehensive test coverage including edge cases, null handling, and various data types

Sequence Diagram

sequenceDiagram
    participant User
    participant Expression
    participant ListContains UDF
    participant SeriesListExtension
    participant ListArray Kernel
    participant Hash & Compare

    User->>Expression: col("lists").list_contains(item)
    Expression->>ListContains UDF: call(list_series, item)
    
    alt item.len() == 1
        ListContains UDF->>ListContains UDF: broadcast item to match list length
    end
    
    ListContains UDF->>SeriesListExtension: list_series.list_contains(item)
    
    alt DataType is List
        SeriesListExtension->>ListArray Kernel: list().list_contains(item)
    else DataType is FixedSizeList
        SeriesListExtension->>ListArray Kernel: fixed_size_list().to_list().list_contains(item)
    end
    
    alt flat_child is Null type
        ListArray Kernel->>ListArray Kernel: return all false (with proper null handling)
    else normal case
        ListArray Kernel->>Hash & Compare: cast item, compute hashes
        Hash & Compare-->>ListArray Kernel: item_hashes, child_hashes
        
        loop for each list
            alt list is null or item is null
                ListArray Kernel->>ListArray Kernel: push false with null flag
            else
                loop for each element in list
                    ListArray Kernel->>Hash & Compare: compare hash first
                    alt hashes match
                        Hash & Compare->>Hash & Compare: full value comparison
                        alt values equal
                            ListArray Kernel->>ListArray Kernel: found = true, break
                        end
                    end
                end
                ListArray Kernel->>ListArray Kernel: push result
            end
        end
    end
    
    ListArray Kernel-->>SeriesListExtension: BooleanArray result
    SeriesListExtension-->>ListContains UDF: Series result
    ListContains UDF-->>Expression: Series result
    Expression-->>User: Boolean column
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@codecov
Copy link

codecov bot commented Jan 28, 2026

Codecov Report

❌ Patch coverage is 6.36364% with 103 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.38%. Comparing base (aa8add2) to head (21a51f2).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-functions-list/src/kernels.rs 0.00% 54 Missing ⚠️
src/daft-functions-list/src/contains.rs 0.00% 40 Missing ⚠️
src/daft-functions-list/src/series.rs 0.00% 8 Missing ⚠️
src/daft-functions-list/src/lib.rs 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #6095       +/-   ##
===========================================
- Coverage   72.91%   43.38%   -29.54%     
===========================================
  Files         973      910       -63     
  Lines      126196   112849    -13347     
===========================================
- Hits        92016    48954    -43062     
- Misses      34180    63895    +29715     
Files with missing lines Coverage Δ
daft/expressions/expressions.py 95.14% <100.00%> (+0.01%) ⬆️
daft/functions/__init__.py 100.00% <ø> (ø)
daft/functions/list.py 100.00% <100.00%> (ø)
daft/series.py 92.00% <100.00%> (+0.02%) ⬆️
src/daft-functions-list/src/lib.rs 0.00% <0.00%> (-100.00%) ⬇️
src/daft-functions-list/src/series.rs 1.22% <0.00%> (-68.40%) ⬇️
src/daft-functions-list/src/contains.rs 0.00% <0.00%> (ø)
src/daft-functions-list/src/kernels.rs 5.39% <0.00%> (-88.78%) ⬇️

... and 650 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@aaron-ang
Copy link
Contributor Author

I'm not sure why my Python test cases do not cover the Rust code path. Do I need to add native Rust tests?


fn list_contains(&self, item: &Series) -> DaftResult<BooleanArray> {
let list_nulls = self.nulls();
let mut result = Vec::with_capacity(self.len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use arrow's arrow::array::BooleanBuilder instead of a vec. Using a vec results in a double iteration.

This should instead be something like

let mut builder = arrow::array::BooleanBuilder::new();

for (..) in .. {
  if .. {

    builder.append_null()
  } else {
    builder.append_value(value)
  }
}

let arr = builder.finish();

let arr = BooleanArray::from_arrow(
  field,
  Arc::new(arr)
);

Ok(arr)

while this is more focused on arrow2 migration, a lot of these patterns are also relevant to adding net new functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a .list.contains() expression

2 participants