-
Notifications
You must be signed in to change notification settings - Fork 556
Description
Environment
Delta-rs version:
Binding: Python (however, I think it's true for all versions
Environment:
- Cloud provider: Azure, however the same with local filesystems
- OS: Windows or Linux
Bug
What happened:
I'm trying to read Delta table where absolute path is used for add and remove actions. If path looks like abfss://<container>@<account_name>.blob.core.windows.net/full/part-00000-a72b1fb3-f2df-41fe-a8f0-e65b746382dd-c000.snappy.parquet or any other absolute file (including local filesystem) file_uris method concats it with table uri.
What you expected to happen:
Just use provided absolute uri. https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-file-and-remove-file says
A relative path to a file from the root of the table or an absolute path to a file that should be added to the table.
How to reproduce it:
Delta Log entry looks like:
{"commitInfo":{"timestamp":1587968586154,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists","partitionBy":"[]"},"isBlindAppend":true}}
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"5fba94ed-9794-4965-ba6e-6ee3c0d22af9","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1587968585495}}
{"add":{"path":"abfss://<container>@<account_name>.blob.core.windows.net/full/part-00000-a72b1fb3-f2df-41fe-a8f0-e65b746382dd-c000.snappy.parquet","partitionValues":{},"size":262,"modificationTime":1587968586000,"dataChange":true}}Python code:
import os
from deltalake import DeltaTable
storage_options={'AZURE_STORAGE_ACCOUNT_KEY': <access key>}
table_path = "abfss://<container>@<account_name>.blob.core.windows.net/full"
dt = DeltaTable(table_path, storage_options=storage_options)
print(dt.file_uris())
print(dt.to_pyarrow_dataset().to_table().to_pydict())I'm getting following output:
# results from file_uris() notice that it was just concat ['abfss://<container>@<account_name>.blob.core.windows.net/fullabfss://<container>@<account_name>.blob.core.windows.net/full/part-00000-a72b1fb3-f2df-41fe-a8f0-e65b746382dd-c000.snappy.parquet']
# exception due to just concat of file uris
Traceback (most recent call last):
File "test.py", line 21, in <module>
print(dt.to_pyarrow_dataset().to_table().to_pydict())
File "pyarrow\_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
File "pyarrow\_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow\_fs.pyx", line 1544, in pyarrow._fs._cb_open_input_file
File "\deltalake\fs.py", line 156, in open_input_file
raw = self._storage.get_obj(path)
deltalake.PyDeltaTableError: Object at location full/abfss://<container>@<account_name>.blob.core.windows.net/full/part-00000-a72b1fb3-f2df-41fe-a8f0-e65b746382dd-c000.snappy.parquet not found: response error "<?xml version="1.0" encoding="utf-8"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.
Time:2022-10-05T23:32:33.5408464Z</Message></Error>", after 0 retries: HTTP status client error (404 Not Found) for url (https://<account_name>.blob.core.windows.net/<container>/full/abfss://<container>@<account_name>.blob.core.windows.net/full/part-00000-a72b1fb3-f2df-41fe-a8f0-e65b746382dd-c000.snappy.parquet)
More details:
I guess following code just concatenates uris:
/// Returns a URIs for all active files present in the current table version.
pub fn get_file_uris(&self) -> impl Iterator<Item = String> + '_ {
self.state
.files()
.iter()
.map(|add| self.storage.to_uri(&Path::from(add.path.as_ref())))
}Metadata
Metadata
Assignees
Labels
Type
Projects
Status