-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Closed
Closed
Copy link
Labels
A-io-icebergRelated to Apache Iceberg tables.Related to Apache Iceberg tables.A-io-parquetArea: reading/writing Parquet filesArea: reading/writing Parquet filesenhancementNew feature or an improvement of an existing featureNew feature or an improvement of an existing featurepythonRelated to Python PolarsRelated to Python Polars
Description
Checks
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of Polars.
Reproducible example
- We are unable to use the arrow schema from pyiceberg
import tempfile
from pathlib import Path
from pyiceberg.schema import Schema
from pyiceberg.types import LongType, NestedField, StringType
from pyiceberg.io.pyarrow import schema_to_pyarrow
from pyiceberg.catalog import load_catalog
_iceberg_schema = Schema(
NestedField(1, "id", LongType()),
NestedField(2, "name", StringType()),
)
_arrow_schema = schema_to_pyarrow(_iceberg_schema)
print(f"Iceberg arrow schema: {_arrow_schema}")
_df = pl.LazyFrame(
{
"id": [1, 2],
"name": ["test1", "test2"],
}
)
with tempfile.TemporaryDirectory() as tmpdir:
_df.sink_parquet(
f"{tmpdir}/test.parquet", arrow_schema=_arrow_schema
)- If we write files with the string view types, we cannot delete or filter rows it seems. The output
string_viewis rejected
with tempfile.TemporaryDirectory() as _tmpdir:
_warehouse = Path(_tmpdir)
_catalog = load_catalog(
"test",
type="sql",
uri=f"sqlite:///{_warehouse}/catalog.db",
warehouse=str(_warehouse),
)
_catalog.create_namespace("ns")
_table = _catalog.create_table(
"ns.test",
schema=Schema(
NestedField(1, "id", LongType()),
NestedField(2, "name", StringType()),
),
)
_data_dir = _warehouse / "data"
_data_dir.mkdir()
_file_path = _data_dir / "test.parquet"
_df = pl.DataFrame(
{
"id": [1, 2, 3],
"name": ["test1", "test2", "test3"],
}
)
_df.write_parquet(_file_path)
_table.add_files([str(_file_path)])
_table.delete(delete_filter="id = 1")Log output
1.
SchemaError
to_arrow(): provided dtype (LargeUtf8) does not match output dtype (Utf8View) Resolved plan until failure: ---> FAILED HERE RESOLVING THIS_NODE <--- DF ["id", "name"]; PROJECT */2 COLUMNS
2.
ArrowNotImplementedError
Function 'array_filter' has no kernel matching input types (string_view, bool)
See the console area for a traceback.Issue description
sink_parquet with arrow_schema should cast strings to match the provided schema type, not reject it. I would love if this could also handle fixed binary columns (I have around 130 iceberg tables with fixed binary columns).
Expected behavior
The arrow_schema should be used to cast the compatible types.
Installed versions
Details
--------Version info---------
Polars: 1.38.0
Index type: UInt32
Platform: Linux-6.18.3-arch1-1-x86_64-with-glibc2.42
Python: 3.12.12 (main, Nov 19 2025, 22:46:53) [Clang 21.1.4 ]
Runtime: rt32
----Optional dependencies----
Azure CLI <not installed>
adbc_driver_manager <not installed>
altair 6.0.0
azure.identity 1.25.1
boto3 <not installed>
cloudpickle <not installed>
connectorx <not installed>
deltalake 1.4.0
fastexcel 0.19.0
fsspec 2026.1.0
gevent 25.9.1
google.auth 2.48.0
great_tables <not installed>
matplotlib 3.10.8
numpy 2.4.2
openpyxl 3.1.5
pandas 3.0.0
polars_cloud <not installed>
pyarrow 23.0.0
pydantic 2.12.5
pyiceberg 0.10.0
sqlalchemy 2.0.46
torch <not installed>
xlsx2csv 0.8.6
xlsxwriter <not installed>
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
A-io-icebergRelated to Apache Iceberg tables.Related to Apache Iceberg tables.A-io-parquetArea: reading/writing Parquet filesArea: reading/writing Parquet filesenhancementNew feature or an improvement of an existing featureNew feature or an improvement of an existing featurepythonRelated to Python PolarsRelated to Python Polars