-
Notifications
You must be signed in to change notification settings - Fork 27
Description
Bug Description
make_pandas_udf raises ValueError when the schema contains column names with spaces, dots, or other characters that are not valid Python identifiers.
Reproduction
import xorq.api as xo
import xorq.expr.datatypes as dt
t = xo.read_parquet("financial_table.parquet")
# Schema has columns like "No. of deaths", "october 31 2009", "payments due by period total"
def parse_col(df):
return df["No. of deaths"].apply(float)
parse_udf = xo.make_pandas_udf(
fn=parse_col,
schema=t.select(["No. of deaths"]).schema(),
return_type=dt.float64,
name="parse_deaths"
)
t.mutate(val=parse_udf.on_expr(t)).execute()Error:
ValueError: 'No. of deaths' is not a valid parameter name
Root Cause
make_pandas_udf internally uses schema column names as Python function parameter names (likely in the generated wrapper function). Column names like "No. of deaths", "october 31 2009", "$ in millions" are not valid Python identifiers and cause a ValueError.
Impact
This makes make_pandas_udf unusable for its most natural use case: parsing/transforming columns in financial and document-scraped tables, which commonly have column names with spaces, dots, dollar signs, parentheses, etc.
Suggested Fix
Sanitize column names internally when constructing the UDF wrapper function (e.g., map them to positional arg0, arg1, ... or slugified names), while preserving the original column name mapping for the schema → column extraction in .on_expr().
Environment
- xorq: latest main branch
- Python 3.12
- macOS / DataFusion backend