Skip to content

Best practice for appending streams of rows #151

@adamjones2

Description

@adamjones2

Could I get a bit of advice on the best way to this problem using delta-dotnet? I have an app which I want to stream out a large number of rows into a Delta table. Eg.:

Schema GetTableSchema();
IAsyncEnumerable<RecordBatch> GetBatches(); // Long-lived sequence, each RecordBatch element is a large payload

var engine = new DeltaEngine(EngineOptions.Default);
var tableSchema = GetTableSchema();
var tableLoadOptions = new TableOptions { WithoutFiles = true };
var tableInsertOptions = new InsertOptions { SaveMode = SaveMode.Append };

await foreach (var batch in GetBatches())
{
    var table = await engine.LoadTableAsync(tableLoadOptions, default);
    
    await table.InsertAsync([batch], tableSchema, tableInsertOptions, default);
}

My understanding is I have to first call LoadTableAsync in order to get a handle to the table that I want to append the rows to so I can call InsertAsync on it, but this won't pull the data in the table down to my app (which I'm not interested in), just the metadata, schema, etc.

My issue is that for my table LoadTableAsync takes quite a long time (~2-10s), which stalls the rate at which I can call InsertAsync. I'm not really clear on why this is given I thought it was only pulling metadata not Parquet files. If I instead try to move the LoadTableAsync outside of the await foreach, and load the table once at the beginning of the process and reuse it for each RecordBatch, I find that the memory of my app quickly blows up - it looks like retaining the ITable reference also retains in-memory all the RecordBatches that have been inserted into it, and they're only deallocated when the table is disposed, is that right?

What would be the best advised calling pattern for a stream of appends like this? (Ideally some way I can pull the metadata of the table once on startup while also releasing the memory of inserted rows immediately after insertion.) For what it's worth I think I'm seeing similar behaviour in delta-rs so it might be more a question for that repo, I'm not sure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions