-
Notifications
You must be signed in to change notification settings - Fork 387
Description
Apache Iceberg version
0.10.0 (latest release)
Please describe the bug 🐞
Starting from version 20, Pyarrow has support for Azure filesystems.
ABFS URIs have this format: abfs[s]://<file_system>@<account_name>.dfs.core.windows.net//<file_name>
But Pyarrow library expects the following path format for Azure: abfs[s]://<file_system>//<file_name>.
As you see, the part "@<account_name>.<dfs|blob>.core.windows.net" prevents users to use pyarrow file io in Azure environment. This issue CAN be fixed in Pyiceberg by removing account_name part.
The proposed fix is just to start a conversation around the issue. I am not 100% sure how and where this should be fixed.
We know similar issues do not occur with Fsspec file io.
Examples
We have a very basic setup with RestCatalog:
def create_iceberg_catalog():
CATALOG_URI = "https://lakehouse.../catalog"
catalog_config = {
"uri": CATALOG_URI,
PY_IO_IMPL: "pyiceberg.io.pyarrow.PyArrowFileIO",
ADLS_ACCOUNT_NAME: "lakehouseaccount",
}
return RestCatalog("lakehouse", **catalog_config)
When we create a table "testns.testtable", it is assigned a following location : abfss://[email protected]/testns/testtable
Then, when we try to append data to the table:
data = pa.table(
{
"id": pa.array(range(5), type=pa.int32()), # Ensure 'id' is int32 to match Iceberg schema
"value": [random.choice(["Heads", "Tails"]) for _ in range(5)],
}
)
table.append(data)
it throws the following exception:
OSError: ListBlobsByHierarchy failed for prefix='aip_test[/test_table-xxx/metadata/snap-xxx.avro](https://xxx/test_table-xxx.avro)'. GetFileInfo is unable to determine whether the path exists. Azure Error: [InvalidResourceName] 400 The specified resource name contains invalid characters.
This is because exists() method is called:
File [~/.official-venvs/amd64.ipykernel-default.master/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py:368](https://xxx/user/nikita-matckevich/.official-venvs/amd64.ipykernel-default.master/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py#line=367), in PyArrowFile.create(self, overwrite)
366 if not overwrite and self.exists() is True:
And it expects the uri without "@akehouseaccount.dfs.core.windows.net". When we monkey-patch the PyArrowFile.init everything works fine:
PyArrowFile.old_init = PyArrowFile.__init__
def patched_init(self, location: str, path: str, fs: FileSystem, buffer_size: int = ONE_MEGABYTE):
# Call the original __init__ method
self.old_init(location, path, fs, buffer_size)
self._path = remove_section_between_at_and_slash(path)
print("Logging: PyArrowFile initialized")
PyArrowFile.__init__ = patched_init
It does not matter how and with which engine the table was created and written before: all pyarrow methods are not working, even those that are on read path, so it will be impossible to scan a non-empty table as well. We tested it by creating a table with fsspec file io and reading it with pyarrow file io.
It is hard to test this behavior with Azurite, because Azurite uris are different and do not contain "@<account_name>" part.
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time