Skip to content

PyArrowFile class is not compatible with ABFS uri syntax #2698

@NikitaMatskevich

Description

@NikitaMatskevich

Apache Iceberg version

0.10.0 (latest release)

Please describe the bug 🐞

Starting from version 20, Pyarrow has support for Azure filesystems.

ABFS URIs have this format: abfs[s]://<file_system>@<account_name>.dfs.core.windows.net//<file_name>

But Pyarrow library expects the following path format for Azure: abfs[s]://<file_system>//<file_name>.

As you see, the part "@<account_name>.<dfs|blob>.core.windows.net" prevents users to use pyarrow file io in Azure environment. This issue CAN be fixed in Pyiceberg by removing account_name part.

The proposed fix is just to start a conversation around the issue. I am not 100% sure how and where this should be fixed.

We know similar issues do not occur with Fsspec file io.

Examples

We have a very basic setup with RestCatalog:

def create_iceberg_catalog():
    CATALOG_URI = "https://lakehouse.../catalog"

    catalog_config = {
        "uri": CATALOG_URI,
        PY_IO_IMPL: "pyiceberg.io.pyarrow.PyArrowFileIO",
        ADLS_ACCOUNT_NAME: "lakehouseaccount",
    }

    return RestCatalog("lakehouse", **catalog_config)

When we create a table "testns.testtable", it is assigned a following location : abfss://[email protected]/testns/testtable

Then, when we try to append data to the table:

data = pa.table(
    {
        "id": pa.array(range(5), type=pa.int32()),  # Ensure 'id' is int32 to match Iceberg schema
        "value": [random.choice(["Heads", "Tails"]) for _ in range(5)],
    }
)
table.append(data)

it throws the following exception:

OSError: ListBlobsByHierarchy failed for prefix='aip_test[/test_table-xxx/metadata/snap-xxx.avro](https://xxx/test_table-xxx.avro)'. GetFileInfo is unable to determine whether the path exists. Azure Error: [InvalidResourceName] 400 The specified resource name contains invalid characters.

This is because exists() method is called:

File [~/.official-venvs/amd64.ipykernel-default.master/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py:368](https://xxx/user/nikita-matckevich/.official-venvs/amd64.ipykernel-default.master/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py#line=367), in PyArrowFile.create(self, overwrite)
    366     if not overwrite and self.exists() is True:

And it expects the uri without "@akehouseaccount.dfs.core.windows.net". When we monkey-patch the PyArrowFile.init everything works fine:

PyArrowFile.old_init = PyArrowFile.__init__
def patched_init(self, location: str, path: str, fs: FileSystem, buffer_size: int = ONE_MEGABYTE):
    # Call the original __init__ method
    self.old_init(location, path, fs, buffer_size)
    self._path = remove_section_between_at_and_slash(path)
    print("Logging: PyArrowFile initialized")
PyArrowFile.__init__ = patched_init

It does not matter how and with which engine the table was created and written before: all pyarrow methods are not working, even those that are on read path, so it will be impossible to scan a non-empty table as well. We tested it by creating a table with fsspec file io and reading it with pyarrow file io.

It is hard to test this behavior with Azurite, because Azurite uris are different and do not contain "@<account_name>" part.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions