Skip to content

make_pandas_udf fails when schema contains column names with spaces or special characters #1605

@paddymul

Description

@paddymul

Bug Description

make_pandas_udf raises ValueError when the schema contains column names with spaces, dots, or other characters that are not valid Python identifiers.

Reproduction

import xorq.api as xo
import xorq.expr.datatypes as dt

t = xo.read_parquet("financial_table.parquet")
# Schema has columns like "No. of deaths", "october 31 2009", "payments due by period total"

def parse_col(df):
    return df["No. of deaths"].apply(float)

parse_udf = xo.make_pandas_udf(
    fn=parse_col,
    schema=t.select(["No. of deaths"]).schema(),
    return_type=dt.float64,
    name="parse_deaths"
)

t.mutate(val=parse_udf.on_expr(t)).execute()

Error:

ValueError: 'No. of deaths' is not a valid parameter name

Root Cause

make_pandas_udf internally uses schema column names as Python function parameter names (likely in the generated wrapper function). Column names like "No. of deaths", "october 31 2009", "$ in millions" are not valid Python identifiers and cause a ValueError.

Impact

This makes make_pandas_udf unusable for its most natural use case: parsing/transforming columns in financial and document-scraped tables, which commonly have column names with spaces, dots, dollar signs, parentheses, etc.

Suggested Fix

Sanitize column names internally when constructing the UDF wrapper function (e.g., map them to positional arg0, arg1, ... or slugified names), while preserving the original column name mapping for the schema → column extraction in .on_expr().

Environment

  • xorq: latest main branch
  • Python 3.12
  • macOS / DataFusion backend

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions