You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Parquet library is generally extremely flexible in terms of supporting internals of the Apache Parquet format and allows you to do whatever the low level API allow to. However, in many cases writing boilerplate code is not suitable if you are working with business objects and just want to serialise them into a parquet file.
4
+
5
+
Class serialisation is **really fast** as it generates [MSIL](https://en.wikipedia.org/wiki/Common_Intermediate_Language) on the fly. That means there is a tiny bit of delay when serialising a first entity, which in most cases is negligible. Once the class is serialised at least once, further operations become blazingly fast (around *x40* speed improvement comparing to reflection on relatively large amounts of data (~5 million records)).
6
+
7
+
> At the moment class serialisation supports only simple first-level class *properties* (having a getter and a setter). None of the complex types such as arrays etc. are supported. This is mostly due to lack of time rather than technical limitations.
8
+
9
+
## Quick Start
10
+
11
+
Both serialiser and deserialiser works with array of classes. Let's say you have the following class definition:
12
+
13
+
```csharp
14
+
classRecord {
15
+
publicDateTimeTimestamp { get; set; }
16
+
publicstringEventName { get; set; }
17
+
publicdoubleMeterValue { get; set; }
18
+
}
19
+
```
20
+
21
+
Let's generate a few instances of those for a test:
That's it! Of course the `.SerializeAsync()` method also has overloads and optional parameters allowing you to control the serialization process slightly, such as selecting compression method, row group size etc.
38
+
39
+
Parquet.Net will automatically figure out file schema by reflecting class structure, types, nullability and other parameters for you.
40
+
41
+
In order to deserialise this file back to array of classes you would write the following:
If you have a huge parquet file(~10million records), you can also retrieve records by rowgroup index (which could help to keep low memory footprint as you don't load everything into memory).
If you have a parquet file with huge number of columns and you only need few columns for processing, you can retrieve required columns only as described in the below code snippet.
Serialisation tries to fit into C# ecosystem like a ninja 🥷, including customisations. It supports the following attributes from [`System.Text.Json.Serialization` Namespace](https://learn.microsoft.com/en-us/dotnet/api/system.text.json.serialization?view=net-7.0):
83
+
84
+
-[`JsonPropertyName`](https://learn.microsoft.com/en-us/dotnet/api/system.text.json.serialization.jsonpropertynameattribute?view=net-7.0) - changes mapping of column name to property name.
85
+
-[`JsonIgnore`](https://learn.microsoft.com/en-us/dotnet/api/system.text.json.serialization.jsonignoreattribute?view=net-7.0) - ignores property when reading or writing.
> for legacy serialisation refer to [this doc](legacy_serialisation.md).
4
+
3
5
Parquet library is generally extremely flexible in terms of supporting internals of the Apache Parquet format and allows you to do whatever the low level API allow to. However, in many cases writing boilerplate code is not suitable if you are working with business objects and just want to serialise them into a parquet file.
4
6
5
-
Class serialisation is **really fast** as it generates [MSIL](https://en.wikipedia.org/wiki/Common_Intermediate_Language) on the fly. That means there is a tiny bit of delay when serialising a first entity, which in most cases is negligible. Once the class is serialised at least once, further operations become blazingly fast (around *x40* speed improvement comparing to reflection on relatively large amounts of data (~5 million records)).
7
+
Class serialisation is **really fast** as internally it generates [compiled expression trees](https://learn.microsoft.com/en-US/dotnet/csharp/programming-guide/concepts/expression-trees/) on the fly. That means there is a tiny bit of delay when serialising a first entity, which in most cases is negligible. Once the class is serialised at least once, further operations become blazingly fast (around *x40* speed improvement comparing to reflection on relatively large amounts of data (~5 million records)).
6
8
7
-
> At the moment class serialisation supports only simple first-level class *properties* (having a getter and a setter). None of the complex types such as arrays etc. are supported. This is mostly due to lack of time rather than technical limitations.
9
+
Class serialisation philosophy is trying to simply mimic .NET's built-in **json** serialisation infrastructure in order to ease in learning path and reuse as much existing code as possible.
8
10
9
11
## Quick Start
10
12
11
-
Both serialiser and deserialiser works with array of classes. Let's say you have the following class definition:
13
+
Both serialiser and deserialiser works with collection of classes. Let's say you have the following class definition:
12
14
13
15
```csharp
14
16
classRecord {
@@ -31,7 +33,7 @@ var data = Enumerable.Range(0, 1_000_000).Select(i => new Record {
31
33
Here is what you can do to write out those classes in a single file:
That's it! Of course the `.SerializeAsync()` method also has overloads and optional parameters allowing you to control the serialization process slightly, such as selecting compression method, row group size etc.
@@ -41,45 +43,25 @@ Parquet.Net will automatically figure out file schema by reflecting class struct
41
43
In order to deserialise this file back to array of classes you would write the following:
If you have a huge parquet file(~10million records), you can also retrieve records by rowgroup index (which could help to keep low memory footprint as you don't load everything into memory).
If you have a parquet file with huge number of columns and you only need few columns for processing, you can retrieve required columns only as described in the below code snippet.
Serialisation tries to fit into C# ecosystem like a ninja 🥷, including customisations. It supports the following attributes from [`System.Text.Json.Serialization` Namespace](https://learn.microsoft.com/en-us/dotnet/api/system.text.json.serialization?view=net-7.0):
83
51
84
52
-[`JsonPropertyName`](https://learn.microsoft.com/en-us/dotnet/api/system.text.json.serialization.jsonpropertynameattribute?view=net-7.0) - changes mapping of column name to property name.
85
53
-[`JsonIgnore`](https://learn.microsoft.com/en-us/dotnet/api/system.text.json.serialization.jsonignoreattribute?view=net-7.0) - ignores property when reading or writing.
54
+
55
+
## Non-Trivial Types
56
+
57
+
You can also serialize more complex types supported by the Parquet format.
58
+
59
+
### Maps (Dictionaries)
60
+
61
+
62
+
63
+
## FAQ
64
+
65
+
**Q.** Can I specify schema for serialisation/deserialisation.
66
+
67
+
**A.** No. Your class definition is the schema, so you don't need to supply it separately.
0 commit comments