-
Notifications
You must be signed in to change notification settings - Fork 38
Description
Describe the bug
The ListOfColumns type annotation fails validation when PySpark Column objects are passed in a list. The validation function _list_of_columns_validation attempts to perform a boolean check (if col) on PySpark Column objects, which triggers PySpark's CANNOT_CONVERT_COLUMN_INTO_BOOL error.
This breaks the intended functionality where ListOfColumns should support both string column names and PySpark Column expressions.
Steps to Reproduce
- Create a Pydantic model with a
ListOfColumnsfield - Try to instantiate the model with PySpark Column objects in a list
- Observe the validation error
from koheesio.models import BaseModel, Field, ListOfColumns
from pyspark.sql import functions as sf
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
class TestModel(BaseModel):
columns: ListOfColumns = Field(description='Test columns')
# This works fine
model1 = TestModel(columns=['col1', 'col2']) # ✅ Works with strings
# This fails
col_obj = sf.col('right.country_nm')
model2 = TestModel(columns=[col_obj]) # ❌ Fails with Column objectsExpected behavior
ListOfColumns should accept both string column names and PySpark Column objects, as the type is designed to be a flexible column specification type for Spark operations.
Environment
- OS: macOS 15.6.1 (arm64)
- Python: 3.12.9
- Koheesio version: 0.10.5
- PySpark version: 3.5.3
- Pydantic version: 2.11.0
Additional context
The root cause is in the _list_of_columns_validation function in koheesio/models.py:
def _list_of_columns_validation(columns_value: Union[str, list]) -> list:
columns = [columns_value] if isinstance(columns_value, str) else [*columns_value]
columns = [col for col in columns if col] # ← This line causes the issue
return list(dict.fromkeys(columns))The if col check attempts to convert PySpark Column objects to boolean, which PySpark explicitly prohibits to prevent ambiguous boolean operations in DataFrame expressions.
Suggested Fix:
Replace the boolean check with a more explicit null/empty check:
def _list_of_columns_validation(columns_value: Union[str, list]) -> list:
columns = [columns_value] if isinstance(columns_value, str) else [*columns_value]
# Filter out None, empty strings, but preserve PySpark Column objects
columns = [col for col in columns if col is not None and col != ""]
return list(dict.fromkeys(columns))This affects real-world usage in join operations and other transformations where Column expressions with aliases are commonly used in the select parameter.
Applicable Stack Trace
pydantic_core.pydantic_core.ValidationError: 1 validation error for TestModel
columns
Value error, [CANNOT_CONVERT_COLUMN_INTO_BOOL] Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. [type=value_error, input_value=[Column<'right.country_nm'>], input_type=list]
For further information visit https://errors.pydantic.dev/2.11/v/value_error