Skip to content

Feature/Bug: allow reading from continuously appendable logs #135

@landeholt

Description

@landeholt

Hey.

We are experimenting with writing data into AOF logs in azure adls gen2.

However, when we are trying to see if its possible to both write to the logs and concurrently reading from it, we stumbled upon a race condition (?). Our assumption is that the azure http client that this extension is using, invalidates files that have changed its Etag between listing/globbing files and the actual reading.

D SELECT count(*) FROM 'abfss://<path>/**.jsonl';
 94% ▕███████████████████████████████████▋  ▏ (~10 seconds remaining)   IO Error:
AzureBlobStorageFileSystem Read to 'abfss://<path>/<hive-partition>/*.jsonl' failed with ConditionNotMet Reason Phrase: The condition specified using HTTP conditional header(s) is not met.

void AzureDfsStorageFileSystem::ReadRange(AzureFileHandle &handle, idx_t file_offset, char *buffer_out,
idx_t buffer_out_len) {
auto &afh = handle.Cast<AzureDfsStorageFileHandle>();
try {
// Specify the range
Azure::Core::Http::HttpRange range;
range.Offset = (int64_t)file_offset;
range.Length = buffer_out_len;
Azure::Storage::Files::DataLake::DownloadFileToOptions options;
options.Range = range;
options.TransferOptions.Concurrency = afh.read_options.transfer_concurrency;
options.TransferOptions.InitialChunkSize = afh.read_options.transfer_chunk_size;
options.TransferOptions.ChunkSize = afh.read_options.transfer_chunk_size;
auto res = afh.file_client.DownloadTo((uint8_t *)buffer_out, buffer_out_len, options);
} catch (const Azure::Storage::StorageException &e) {
throw IOException("AzureBlobStorageFileSystem Read to '%s' failed with %s Reason Phrase: %s", afh.path,
e.ErrorCode, e.ReasonPhrase);
}
}

We propose these options to be added:

  • gracefully read files. Ignore those that "fail" [allows the reader to read most of the data]
  • lease the file during read [blocks the writer from reader]
  • allow ignoring Etag validation during partial read(s) [allows the reader to read all available data, but can introduce unwanted state where the reader needs to read the specified Etag]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions