Skip to content

sink_parquet arrow_schema doesn't work with schema generated from pyiceberg #26427

@ldacey

Description

@ldacey

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

  1. We are unable to use the arrow schema from pyiceberg
import tempfile
from pathlib import Path

from pyiceberg.schema import Schema
from pyiceberg.types import LongType, NestedField, StringType
from pyiceberg.io.pyarrow import schema_to_pyarrow
from pyiceberg.catalog import load_catalog

_iceberg_schema = Schema(
    NestedField(1, "id", LongType()),
    NestedField(2, "name", StringType()),
)

_arrow_schema = schema_to_pyarrow(_iceberg_schema)
print(f"Iceberg arrow schema: {_arrow_schema}")

_df = pl.LazyFrame(
    {
        "id": [1, 2],
        "name": ["test1", "test2"],
    }
)

with tempfile.TemporaryDirectory() as tmpdir:
    _df.sink_parquet(
        f"{tmpdir}/test.parquet", arrow_schema=_arrow_schema
    )
  1. If we write files with the string view types, we cannot delete or filter rows it seems. The output string_view is rejected
with tempfile.TemporaryDirectory() as _tmpdir:
    _warehouse = Path(_tmpdir)

    _catalog = load_catalog(
        "test",
        type="sql",
        uri=f"sqlite:///{_warehouse}/catalog.db",
        warehouse=str(_warehouse),
    )
    _catalog.create_namespace("ns")

    _table = _catalog.create_table(
        "ns.test",
        schema=Schema(
            NestedField(1, "id", LongType()),
            NestedField(2, "name", StringType()),
        ),
    )

    _data_dir = _warehouse / "data"
    _data_dir.mkdir()
    _file_path = _data_dir / "test.parquet"

    _df = pl.DataFrame(
        {
            "id": [1, 2, 3],
            "name": ["test1", "test2", "test3"],
        }
    )
    _df.write_parquet(_file_path)

    _table.add_files([str(_file_path)])

    _table.delete(delete_filter="id = 1")

Log output

1. 


SchemaError
to_arrow(): provided dtype (LargeUtf8) does not match output dtype (Utf8View) Resolved plan until failure: ---> FAILED HERE RESOLVING THIS_NODE <--- DF ["id", "name"]; PROJECT */2 COLUMNS


2. 


ArrowNotImplementedError
Function 'array_filter' has no kernel matching input types (string_view, bool)

See the console area for a traceback.

Issue description

sink_parquet with arrow_schema should cast strings to match the provided schema type, not reject it. I would love if this could also handle fixed binary columns (I have around 130 iceberg tables with fixed binary columns).

@nameexhaustion

Expected behavior

The arrow_schema should be used to cast the compatible types.

Installed versions

Details
--------Version info---------
Polars:              1.38.0
Index type:          UInt32
Platform:            Linux-6.18.3-arch1-1-x86_64-with-glibc2.42
Python:              3.12.12 (main, Nov 19 2025, 22:46:53) [Clang 21.1.4 ]
Runtime:             rt32

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               6.0.0
azure.identity       1.25.1
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            1.4.0
fastexcel            0.19.0
fsspec               2026.1.0
gevent               25.9.1
google.auth          2.48.0
great_tables         <not installed>
matplotlib           3.10.8
numpy                2.4.2
openpyxl             3.1.5
pandas               3.0.0
polars_cloud         <not installed>
pyarrow              23.0.0
pydantic             2.12.5
pyiceberg            0.10.0
sqlalchemy           2.0.46
torch                <not installed>
xlsx2csv             0.8.6
xlsxwriter           <not installed>

Metadata

Metadata

Labels

A-io-icebergRelated to Apache Iceberg tables.A-io-parquetArea: reading/writing Parquet filesenhancementNew feature or an improvement of an existing featurepythonRelated to Python Polars

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions