-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Could I get a bit of advice on the best way to this problem using delta-dotnet? I have an app which I want to stream out a large number of rows into a Delta table. Eg.:
Schema GetTableSchema();
IAsyncEnumerable<RecordBatch> GetBatches(); // Long-lived sequence, each RecordBatch element is a large payload
var engine = new DeltaEngine(EngineOptions.Default);
var tableSchema = GetTableSchema();
var tableLoadOptions = new TableOptions { WithoutFiles = true };
var tableInsertOptions = new InsertOptions { SaveMode = SaveMode.Append };
await foreach (var batch in GetBatches())
{
var table = await engine.LoadTableAsync(tableLoadOptions, default);
await table.InsertAsync([batch], tableSchema, tableInsertOptions, default);
}My understanding is I have to first call LoadTableAsync in order to get a handle to the table that I want to append the rows to so I can call InsertAsync on it, but this won't pull the data in the table down to my app (which I'm not interested in), just the metadata, schema, etc.
My issue is that for my table LoadTableAsync takes quite a long time (~2-10s), which stalls the rate at which I can call InsertAsync. I'm not really clear on why this is given I thought it was only pulling metadata not Parquet files. If I instead try to move the LoadTableAsync outside of the await foreach, and load the table once at the beginning of the process and reuse it for each RecordBatch, I find that the memory of my app quickly blows up - it looks like retaining the ITable reference also retains in-memory all the RecordBatches that have been inserted into it, and they're only deallocated when the table is disposed, is that right?
What would be the best advised calling pattern for a stream of appends like this? (Ideally some way I can pull the metadata of the table once on startup while also releasing the memory of inserted rows immediately after insertion.) For what it's worth I think I'm seeing similar behaviour in delta-rs so it might be more a question for that repo, I'm not sure.