Skip to content

Absolute path reading #865

@kosinsky

Description

@kosinsky

Environment

Delta-rs version:

Binding: Python (however, I think it's true for all versions

Environment:

  • Cloud provider: Azure, however the same with local filesystems
  • OS: Windows or Linux

Bug

What happened:
I'm trying to read Delta table where absolute path is used for add and remove actions. If path looks like abfss://<container>@<account_name>.blob.core.windows.net/full/part-00000-a72b1fb3-f2df-41fe-a8f0-e65b746382dd-c000.snappy.parquet or any other absolute file (including local filesystem) file_uris method concats it with table uri.

What you expected to happen:
Just use provided absolute uri. https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-file-and-remove-file says

A relative path to a file from the root of the table or an absolute path to a file that should be added to the table.

How to reproduce it:
Delta Log entry looks like:

{"commitInfo":{"timestamp":1587968586154,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists","partitionBy":"[]"},"isBlindAppend":true}}
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"5fba94ed-9794-4965-ba6e-6ee3c0d22af9","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1587968585495}}
{"add":{"path":"abfss://<container>@<account_name>.blob.core.windows.net/full/part-00000-a72b1fb3-f2df-41fe-a8f0-e65b746382dd-c000.snappy.parquet","partitionValues":{},"size":262,"modificationTime":1587968586000,"dataChange":true}}

Python code:

import os
from deltalake import DeltaTable

storage_options={'AZURE_STORAGE_ACCOUNT_KEY': <access key>}
table_path = "abfss://<container>@<account_name>.blob.core.windows.net/full"
dt = DeltaTable(table_path,  storage_options=storage_options)
print(dt.file_uris())
print(dt.to_pyarrow_dataset().to_table().to_pydict())

I'm getting following output:

# results from file_uris() notice that it was just concat ['abfss://<container>@<account_name>.blob.core.windows.net/fullabfss://<container>@<account_name>.blob.core.windows.net/full/part-00000-a72b1fb3-f2df-41fe-a8f0-e65b746382dd-c000.snappy.parquet']

# exception due to just concat of file uris
Traceback (most recent call last):
  File "test.py", line 21, in <module>
    print(dt.to_pyarrow_dataset().to_table().to_pydict())
  File "pyarrow\_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
  File "pyarrow\_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
  File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\_fs.pyx", line 1544, in pyarrow._fs._cb_open_input_file
  File "\deltalake\fs.py", line 156, in open_input_file
    raw = self._storage.get_obj(path)
deltalake.PyDeltaTableError: Object at location full/abfss://<container>@<account_name>.blob.core.windows.net/full/part-00000-a72b1fb3-f2df-41fe-a8f0-e65b746382dd-c000.snappy.parquet not found: response error "<?xml version="1.0" encoding="utf-8"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.

Time:2022-10-05T23:32:33.5408464Z</Message></Error>", after 0 retries: HTTP status client error (404 Not Found) for url (https://<account_name>.blob.core.windows.net/<container>/full/abfss://<container>@<account_name>.blob.core.windows.net/full/part-00000-a72b1fb3-f2df-41fe-a8f0-e65b746382dd-c000.snappy.parquet)

More details:
I guess following code just concatenates uris:

    /// Returns a URIs for all active files present in the current table version.
    pub fn get_file_uris(&self) -> impl Iterator<Item = String> + '_ {
        self.state
            .files()
            .iter()
            .map(|add| self.storage.to_uri(&Path::from(add.path.as_ref())))
    }

Metadata

Metadata

Assignees

Labels

binding/rustIssues for the Rust cratebugSomething isn't working

Type

Projects

Status

Ready

Relationships

None yet

Development

No branches or pull requests

Issue actions