Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 110 additions & 0 deletions docs/guides/DataframesViaArrow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Working with DataFrames via Arrow

ParquetSharp now provides Arrow-based APIs for reading and working with `.NET DataFrame objects`. Using Arrow can improve performance and reduce unnecessary memory copies. **However, there are limitations**.

## Prerequisites

You'll need these packages:
```xml
<PackageReference Include="ParquetSharp" Version="5.*" />
<PackageReference Include="Apache.Arrow" Version="14.*" />
<PackageReference Include="Microsoft.Data.Analysis" Version="0.23.*" />
```

## Reading a Single Batch from Parquet

Arrow integration works reliably for reading a single batch. Here's how to read one batch and convert it to a DataFrame:

```csharp
using ParquetSharp.Arrow;
using Microsoft.Data.Analysis;
using Apache.Arrow;

using var fileReader = new FileReader("sample.parquet");
using var batchReader = fileReader.GetRecordBatchReader();

var batch = await batchReader.ReadNextRecordBatchAsync();
if (batch != null)
{
using (batch)
{
var df = DataFrame.FromArrowRecordBatch(batch).Clone();
Console.WriteLine($"Rows: {df.Rows.Count}, Columns: {df.Columns.Count}");
Console.WriteLine(df.Head(5));
}
}
```

This works reliably for all standard DataFrames.


## Reading All Batches Separately
For files with multiple batches, each batch can be converted into a DataFrame individually.

**Note**: Combining multiple batches using `Append()` is unreliable... Particularly with string columns.

```csharp
using var fileReader = new FileReader("sample.parquet");
using var batchReader = fileReader.GetRecordBatchReader();

var dataFrames = new List<DataFrame>();
RecordBatch batch;

while ((batch = await batchReader.ReadNextRecordBatchAsync()) != null)
{
using (batch)
{
var df = DataFrame.FromArrowRecordBatch(batch).Clone();
dataFrames.Add(df);
}
}

Console.WriteLine($"Read {dataFrames.Count} batch(es)");
foreach (var df in dataFrames)
{
Console.WriteLine("\nDataFrame Batch:");
Console.WriteLine($"Rows: {df.Rows.Count}, Columns: {df.Columns.Count}");
Console.WriteLine(df.Head(5));
}
```

## Key Notes

- **Clone to avoid disposal issues:** Each DataFrame should be cloned to remain valid after the batch is disposed.

- **Do not rely on merging Arrow DataFrames:** Append and combining multiple batches is unreliable, particularly with string columns.

## Writing DataFrames to Parquet

- ToArrowRecordBatches() is not reliable for string columns.
- For safe writing, continue using ParquetSharp.DataFrame:

```csharp
using var reader = new ParquetSharp.ParquetReader("input.parquet");
var df = reader.ToDataFrame();
```

## When to Use Arrow vs ParquetSharp.DataFrame

| Task | Arrow API | ParquetSharp.DataFrame |
|------|-----------|------------------------|
| **Reading** Parquet to DataFrame | ✅ Recommended - Faster, less memory copying | ✅ Works - Simple one-line API |
| **Writing** DataFrame to Parquet | ❌ Unreliable - Fails with string columns | ✅ Recommended - Reliable for all column types |
| **String columns** | ⚠️ Read-only support | ✅ Full read/write support |
| **Merging batches** | ❌ `Append()` is unreliable | ✅ Works reliably |
| **Performance** | ⚠️ Faster for reads only | ⚠️ Slower but more reliable |
| **Use case** | Large file reads, streaming | Writing, string data, combining data |

## Key Takeaways

- **Arrow + FromArrowRecordBatch()** is safe and faster for reading Parquet files into DataFrames.
- **ParquetSharp.DataFrame is more reliable** for writing DataFrames back to Parquet.
- `ToArrowRecordBatches()` and `Append()` are unreliable for writing or merging batches..
- **Writing and combining DataFrames** still requires `ParquetSharp.DataFrame`.

## See Also

For more details, check out:
- [ParquetSharp Arrow API Documentation](https://g-research.github.io/ParquetSharp/guides/Arrow.html)
- [DataFrame.FromArrowRecordBatch Method](https://learn.microsoft.com/en-us/dotnet/api/microsoft.data.analysis.dataframe.fromarrowrecordbatch?view=ml-dotnet-preview)
- [DataFrame.ToArrowRecordBatches Method](https://learn.microsoft.com/en-us/dotnet/api/microsoft.data.analysis.dataframe.toarrowrecordbatches?view=ml-dotnet-preview)
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,7 @@ For more detailed information on how to use ParquetSharp, see the following guid
* [Reading Parquet files](guides/Reading.md)
* [Working with nested data](guides/Nested.md)
* [Reading and writing Arrow data](guides/Arrow.md) &mdash; how to read and write data using the [Apache Arrow format](https://arrow.apache.org/)
* [Working with DataFrames via Arrow](guides/DataframesViaArrow.md)
* [Row-oriented API](guides/RowOriented.md) &mdash; a higher level API that abstracts away the column-oriented nature of Parquet files
* [Custom types](guides/TypeFactories.md) &mdash; how to customize the mapping between .NET and Parquet types,
including using the `DateOnly` and `TimeOnly` types added in .NET 6.
Expand Down
Loading