Skip to content

Commit 244dd0e

Browse files
committed
Expression Trees
- serialise and deserialise simple first-level properties - Schema reflector supports simple dictionaries
1 parent 3b8430b commit 244dd0e

File tree

10 files changed

+372
-151
lines changed

10 files changed

+372
-151
lines changed

docs/legacy_serialisation.md

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# Class Serialisation
2+
3+
Parquet library is generally extremely flexible in terms of supporting internals of the Apache Parquet format and allows you to do whatever the low level API allow to. However, in many cases writing boilerplate code is not suitable if you are working with business objects and just want to serialise them into a parquet file.
4+
5+
Class serialisation is **really fast** as it generates [MSIL](https://en.wikipedia.org/wiki/Common_Intermediate_Language) on the fly. That means there is a tiny bit of delay when serialising a first entity, which in most cases is negligible. Once the class is serialised at least once, further operations become blazingly fast (around *x40* speed improvement comparing to reflection on relatively large amounts of data (~5 million records)).
6+
7+
> At the moment class serialisation supports only simple first-level class *properties* (having a getter and a setter). None of the complex types such as arrays etc. are supported. This is mostly due to lack of time rather than technical limitations.
8+
9+
## Quick Start
10+
11+
Both serialiser and deserialiser works with array of classes. Let's say you have the following class definition:
12+
13+
```csharp
14+
class Record {
15+
public DateTime Timestamp { get; set; }
16+
public string EventName { get; set; }
17+
public double MeterValue { get; set; }
18+
}
19+
```
20+
21+
Let's generate a few instances of those for a test:
22+
23+
```csharp
24+
var data = Enumerable.Range(0, 1_000_000).Select(i => new Record {
25+
Timestamp = DateTime.UtcNow.AddSeconds(i),
26+
EventName = i % 2 == 0 ? "on" : "off",
27+
MeterValue = i
28+
}).ToList();
29+
```
30+
31+
Here is what you can do to write out those classes in a single file:
32+
33+
```csharp
34+
await ParquetConvert.SerializeAsync(data, "/mnt/storage/data.parquet");
35+
```
36+
37+
That's it! Of course the `.SerializeAsync()` method also has overloads and optional parameters allowing you to control the serialization process slightly, such as selecting compression method, row group size etc.
38+
39+
Parquet.Net will automatically figure out file schema by reflecting class structure, types, nullability and other parameters for you.
40+
41+
In order to deserialise this file back to array of classes you would write the following:
42+
43+
```csharp
44+
Record[] data = await ParquetConvert.DeserializeAsync<Record>("/mnt/storage/data.parquet");
45+
```
46+
### Retrieve and Deserialize records by RowGroup:
47+
48+
If you have a huge parquet file(~10million records), you can also retrieve records by rowgroup index (which could help to keep low memory footprint as you don't load everything into memory).
49+
```csharp
50+
SimpleStructure[] structures = ParquetConvert.Deserialize<SimpleStructure>(stream,rowGroupIndex);
51+
```
52+
### Deserialize only few properties:
53+
54+
If you have a parquet file with huge number of columns and you only need few columns for processing, you can retrieve required columns only as described in the below code snippet.
55+
```csharp
56+
class MyClass
57+
{
58+
public int Id { get; set; }
59+
public string Name{get;set;}
60+
public string Address{get;set;}
61+
public int Age{get;set;}
62+
}
63+
class MyClassV1
64+
{
65+
public string Name { get; set; }
66+
}
67+
SimpleStructure[] structures = Enumerable
68+
.Range(0, 1000)
69+
.Select(i => new SimpleStructure
70+
{
71+
Id = i,
72+
Name = $"row {i}",
73+
})
74+
.ToArray();
75+
ParquetConvert.Serialize(structures, stream);
76+
77+
MyClassV1[] v1structures = ParquetConvert.Deserialize<MyClassV1>(stream,rowGroupIndex);
78+
```
79+
80+
## Customising Serialisation
81+
82+
Serialisation tries to fit into C# ecosystem like a ninja 🥷, including customisations. It supports the following attributes from [`System.Text.Json.Serialization` Namespace](https://learn.microsoft.com/en-us/dotnet/api/system.text.json.serialization?view=net-7.0):
83+
84+
- [`JsonPropertyName`](https://learn.microsoft.com/en-us/dotnet/api/system.text.json.serialization.jsonpropertynameattribute?view=net-7.0) - changes mapping of column name to property name.
85+
- [`JsonIgnore`](https://learn.microsoft.com/en-us/dotnet/api/system.text.json.serialization.jsonignoreattribute?view=net-7.0) - ignores property when reading or writing.

docs/serialisation.md

Lines changed: 21 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,16 @@
11
# Class Serialisation
22

3+
> for legacy serialisation refer to [this doc](legacy_serialisation.md).
4+
35
Parquet library is generally extremely flexible in terms of supporting internals of the Apache Parquet format and allows you to do whatever the low level API allow to. However, in many cases writing boilerplate code is not suitable if you are working with business objects and just want to serialise them into a parquet file.
46

5-
Class serialisation is **really fast** as it generates [MSIL](https://en.wikipedia.org/wiki/Common_Intermediate_Language) on the fly. That means there is a tiny bit of delay when serialising a first entity, which in most cases is negligible. Once the class is serialised at least once, further operations become blazingly fast (around *x40* speed improvement comparing to reflection on relatively large amounts of data (~5 million records)).
7+
Class serialisation is **really fast** as internally it generates [compiled expression trees](https://learn.microsoft.com/en-US/dotnet/csharp/programming-guide/concepts/expression-trees/) on the fly. That means there is a tiny bit of delay when serialising a first entity, which in most cases is negligible. Once the class is serialised at least once, further operations become blazingly fast (around *x40* speed improvement comparing to reflection on relatively large amounts of data (~5 million records)).
68

7-
> At the moment class serialisation supports only simple first-level class *properties* (having a getter and a setter). None of the complex types such as arrays etc. are supported. This is mostly due to lack of time rather than technical limitations.
9+
Class serialisation philosophy is trying to simply mimic .NET's built-in **json** serialisation infrastructure in order to ease in learning path and reuse as much existing code as possible.
810

911
## Quick Start
1012

11-
Both serialiser and deserialiser works with array of classes. Let's say you have the following class definition:
13+
Both serialiser and deserialiser works with collection of classes. Let's say you have the following class definition:
1214

1315
```csharp
1416
class Record {
@@ -31,7 +33,7 @@ var data = Enumerable.Range(0, 1_000_000).Select(i => new Record {
3133
Here is what you can do to write out those classes in a single file:
3234

3335
```csharp
34-
await ParquetConvert.SerializeAsync(data, "/mnt/storage/data.parquet");
36+
await ParquetSerializer.SerializeAsync(data, "/mnt/storage/data.parquet");
3537
```
3638

3739
That's it! Of course the `.SerializeAsync()` method also has overloads and optional parameters allowing you to control the serialization process slightly, such as selecting compression method, row group size etc.
@@ -41,45 +43,25 @@ Parquet.Net will automatically figure out file schema by reflecting class struct
4143
In order to deserialise this file back to array of classes you would write the following:
4244

4345
```csharp
44-
Record[] data = await ParquetConvert.DeserializeAsync<Record>("/mnt/storage/data.parquet");
45-
```
46-
### Retrieve and Deserialize records by RowGroup:
47-
48-
If you have a huge parquet file(~10million records), you can also retrieve records by rowgroup index (which could help to keep low memory footprint as you don't load everything into memory).
49-
```csharp
50-
SimpleStructure[] structures = ParquetConvert.Deserialize<SimpleStructure>(stream,rowGroupIndex);
51-
```
52-
### Deserialize only few properties:
53-
54-
If you have a parquet file with huge number of columns and you only need few columns for processing, you can retrieve required columns only as described in the below code snippet.
55-
```csharp
56-
class MyClass
57-
{
58-
public int Id { get; set; }
59-
public string Name{get;set;}
60-
public string Address{get;set;}
61-
public int Age{get;set;}
62-
}
63-
class MyClassV1
64-
{
65-
public string Name { get; set; }
66-
}
67-
SimpleStructure[] structures = Enumerable
68-
.Range(0, 1000)
69-
.Select(i => new SimpleStructure
70-
{
71-
Id = i,
72-
Name = $"row {i}",
73-
})
74-
.ToArray();
75-
ParquetConvert.Serialize(structures, stream);
76-
77-
MyClassV1[] v1structures = ParquetConvert.Deserialize<MyClassV1>(stream,rowGroupIndex);
46+
IList<Record> data = await ParquetSerializer.DeserializeAsync<Record>("/mnt/storage/data.parquet");
7847
```
79-
8048
## Customising Serialisation
8149

8250
Serialisation tries to fit into C# ecosystem like a ninja 🥷, including customisations. It supports the following attributes from [`System.Text.Json.Serialization` Namespace](https://learn.microsoft.com/en-us/dotnet/api/system.text.json.serialization?view=net-7.0):
8351

8452
- [`JsonPropertyName`](https://learn.microsoft.com/en-us/dotnet/api/system.text.json.serialization.jsonpropertynameattribute?view=net-7.0) - changes mapping of column name to property name.
8553
- [`JsonIgnore`](https://learn.microsoft.com/en-us/dotnet/api/system.text.json.serialization.jsonignoreattribute?view=net-7.0) - ignores property when reading or writing.
54+
55+
## Non-Trivial Types
56+
57+
You can also serialize more complex types supported by the Parquet format.
58+
59+
### Maps (Dictionaries)
60+
61+
62+
63+
## FAQ
64+
65+
**Q.** Can I specify schema for serialisation/deserialisation.
66+
67+
**A.** No. Your class definition is the schema, so you don't need to supply it separately.

src/Parquet.Test/Serialisation/CILProgramTest.cs

Lines changed: 0 additions & 36 deletions
This file was deleted.

src/Parquet.Test/Serialisation/ParquetSerializerTest.cs

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
using System;
22
using System.Collections.Generic;
3+
using System.Diagnostics;
34
using System.IO;
45
using System.Linq;
56
using System.Text;
@@ -10,14 +11,24 @@
1011
namespace Parquet.Test.Serialisation {
1112
public class ParquetSerializerTest {
1213

13-
class Record {
14+
class Record : IEquatable<Record> {
1415
public DateTime Timestamp { get; set; }
1516
public string? EventName { get; set; }
1617
public double MeterValue { get; set; }
18+
19+
public bool Equals(Record? other) {
20+
if(other == null)
21+
return false;
22+
23+
return Timestamp == other.Timestamp &&
24+
EventName == other.EventName &&
25+
MeterValue == other.MeterValue;
26+
}
1727
}
1828

1929
[Fact]
2030
public async Task SerializeDeserializeRecord() {
31+
2132
var data = Enumerable.Range(0, 1_000_000).Select(i => new Record {
2233
Timestamp = DateTime.UtcNow.AddSeconds(i),
2334
EventName = i % 2 == 0 ? "on" : "off",
@@ -26,6 +37,11 @@ public async Task SerializeDeserializeRecord() {
2637

2738
using var ms = new MemoryStream();
2839
await ParquetSerializer.SerializeAsync(data, ms);
40+
41+
ms.Position = 0;
42+
IList<Record> data2 = await ParquetSerializer.DeserializeAsync<Record>(ms);
43+
44+
Assert.Equal(data2, data);
2945
}
3046
}
3147
}

src/Parquet.Test/Serialisation/SchemaReflectorTest.cs

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
using System.Collections.Generic;
12
using System.Text.Json.Serialization;
23
using Parquet.Schema;
34
using Parquet.Serialization;
@@ -136,5 +137,22 @@ public void IgnoredProperties() {
136137
Assert.Equal(new ParquetSchema(
137138
new DataField<int>("NotIgnored")), schema);
138139
}
140+
141+
class SimpleMapPoco {
142+
public int? Id { get; set; }
143+
144+
public Dictionary<string, int> Tags { get; set; } = new Dictionary<string, int>();
145+
}
146+
147+
[Fact]
148+
public void SimpleMap() {
149+
ParquetSchema schema = typeof(SimpleMapPoco).GetParquetSchema(true);
150+
151+
Assert.Equal(new ParquetSchema(
152+
new DataField<int?>("Id"),
153+
new MapField("Tags",
154+
new DataField<string>("Key"),
155+
new DataField<int>("Value"))), schema);
156+
}
139157
}
140158
}

src/Parquet/Schema/MapField.cs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ public class MapField : Field {
2626
/// <summary>
2727
/// Declares a map field
2828
/// </summary>
29-
public MapField(string name, DataField keyField, DataField valueField)
29+
public MapField(string name, Field keyField, Field valueField)
3030
: base(name, SchemaType.Map) {
3131
Key = keyField;
3232
Value = valueField;

src/Parquet/Serialization/CILProgram.cs

Lines changed: 0 additions & 65 deletions
This file was deleted.

src/Parquet/Serialization/MSILGenerator.cs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -355,7 +355,7 @@ static class PropertyHelpers {
355355
PropertyInfo? prop = classType.GetTypeInfo().GetDeclaredProperty(fieldName);
356356

357357
// TODO: trying to get build, probably not the best solution
358-
var baseType = classType.BaseType;
358+
Type? baseType = classType.BaseType;
359359
while(prop == null && baseType != null) {
360360
// if pi is null, try the base class
361361
prop = baseType?.GetTypeInfo()?.GetDeclaredProperty(fieldName);

0 commit comments

Comments
 (0)