Skip to content

Commit c53049b

Browse files
committed
Expression Tree Serializer
- documentation update
1 parent 244dd0e commit c53049b

File tree

11 files changed

+144
-206
lines changed

11 files changed

+144
-206
lines changed

docs/complex-types.md

Lines changed: 15 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -11,71 +11,33 @@ Arrays *aka repeatable fields* is a basis for understanding how more complex dat
1111
```csharp
1212
var field = new DataField<IEnumerable<int>>("items");
1313
```
14-
or
15-
```csharp
16-
var field= new DataField("items", DataType.Int32, isArray: true);
17-
```
18-
19-
Apparently to check if the field is repeated you can always check `.IsArray` Boolean flag.
14+
To check if the field is repeated you can always test `.IsArray` Boolean flag.
2015

21-
Array column is also a usual instance of the `DataColumm` class, however in order to populate it you need to pass **repetition levels**. Repetition levels specify *at which level array starts* (please read more details on this in the link above).
16+
Parquet columns are flat, so in order to store an array in the array which can only keep simple elements and not other arrays, you would *flatten* them. For instance to store two elements:
2217

23-
### Example
24-
25-
Let's say you have a following array of integer arrays:
26-
27-
```
28-
[1 2 3]
29-
[4 5]
30-
[6 7 8 9]
31-
```
18+
- `[1, 2, 3]`
19+
- `[4, 5]`
3220

33-
This can be represented as:
21+
in a flat array, it will look like `[1, 2, 3, 4, 5]`. And that's exactly how parquet stores them. Now, the problem starts when you want to read the values back. Is this `[1, 2]` and `[3, 4, 5]` or `[1]` and `[2, 3, 4, 5]`? There's no way to know without an extra information. Therefore, parquet also stores that extra information an an extra column per data column, which is called *repetition levels*. In the previous example, our array of arrays will expand into the following two columns:
3422

35-
```
36-
values: [1 2 3 4 5 6 7 8 9]
37-
repetition levels: [0 1 1 0 1 0 1 1 1]
38-
```
23+
| # | Data Column | Repetition Levels Column |
24+
| ---- | ----------- | ------------------------ |
25+
| 0 | 1 | 0 |
26+
| 1 | 2 | 1 |
27+
| 2 | 3 | 1 |
28+
| 3 | 4 | 0 |
29+
| 4 | 5 | 1 |
3930

40-
Where `0` means that this is a start of an array and `1` - it's a value continuation.
31+
In other words - it is the level at which we have to create a new list for the current value. In other words, the repetition level can be seen as a marker of when to start a new list and at which level.
4132

4233
To represent this in C# code:
4334

4435
```csharp
4536
var field = new DataField<IEnumerable<int>>("items");
4637
var column = new DataColumn(
4738
field,
48-
new int[] { 1, 2, 3, 4, 5, 6, 7, 8, 9 },
49-
new int[] { 0, 1, 1, 0, 1, 0, 1, 1, 1 });
39+
new int[] { 1, 2, 3, 4, 5 },
40+
new int[] { 0, 1, 1, 0, 1 });
5041
```
5142

52-
### Empty Arrays
53-
54-
Empty arrays can be represented by simply having no element in them. For instance
55-
56-
```
57-
[1 2]
58-
[]
59-
[3 4]
60-
```
61-
62-
Goes into following:
63-
64-
```
65-
values: [1 2 null 3 4]
66-
repetition levels: [0 1 0 0 1]
67-
```
68-
69-
> Note that anything other than plain columns add a performance overhead due to obvious reasons for the need to pack and unpack data structures.
70-
71-
## Structures
72-
73-
todo
74-
75-
## Lists
76-
77-
todo
78-
79-
## Maps
8043

81-
##

docs/serialisation.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,10 @@ Serialisation tries to fit into C# ecosystem like a ninja 🥷, including custom
5656

5757
You can also serialize more complex types supported by the Parquet format.
5858

59+
### Lists
60+
61+
62+
5963
### Maps (Dictionaries)
6064

6165

docs/writing.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ using(Stream fileStream = System.IO.File.OpenWrite("c:\\test.parquet")) {
3333
}
3434
```
3535

36-
## Specifying Compression Method and Level
36+
# Specifying Compression Method and Level
3737

3838
After constructing `ParquetWriter` you can optionally set compression method ([`CompressionMethod`](../src/Parquet/CompressionMethod.cs)) and/or compression level ([`CompressionLevel`](https://learn.microsoft.com/en-us/dotnet/api/system.io.compression.compressionlevel?view=net-7.0)) which defaults to `Snappy`. Unless you have specific needs to override compression, the default are very reasonable.
3939

@@ -48,7 +48,7 @@ using(ParquetWriter parquetWriter = await ParquetWriter.CreateAsync(schema, file
4848
```
4949

5050

51-
## Appending to Files
51+
# Appending to Files
5252

5353
This lib supports pseudo appending to files, however it's worth keeping in mind that *row groups are immutable* by design, therefore the only way to append is to create a new row group at the end of the file. It's worth mentioning that small row groups make data compression and reading extremely ineffective, therefore the larger your row group the better.
5454

@@ -96,7 +96,7 @@ Note that you have to specify that you are opening `ParquetWriter` in **append**
9696

9797
Please keep in mind that row groups are designed to hold a large amount of data (5'0000 rows on average) therefore try to find a large enough batch to append to the file. Do not treat parquet file as a row stream by creating a row group and placing 1-2 rows in it, because this will both increase file size massively and cause a huge performance degradation for a client reading such a file.
9898

99-
### Custom Metadata
99+
# Custom Metadata
100100

101101
To read and write custom file metadata, you can use `CustomMetadata` property on `ParquetFileReader` and `ParquetFileWriter`, i.e.
102102

@@ -122,3 +122,7 @@ using(ParquetReader reader = await ParquetReader.CreateAsync(ms)) {
122122
Assert.Equal("value2", reader.CustomMetadata["key2"]);
123123
}
124124
```
125+
126+
# Complex Types
127+
128+
To write complex types (arrays, lists, maps, structs) read [this guide](complex-types.md).
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
using System;
2+
using System.Collections.Generic;
3+
using System.Text;
4+
using Parquet.File;
5+
using Xunit;
6+
7+
namespace Parquet.Test.Extensions {
8+
public class TypeExtensionsTest {
9+
[Fact]
10+
public void String_array_is_enumerable() {
11+
Assert.True(typeof(string[]).TryExtractEnumerableType(out Type? et));
12+
Assert.Equal(typeof(string), et);
13+
}
14+
15+
[Fact]
16+
public void String_is_not_enumerable() {
17+
Assert.False(typeof(string).TryExtractEnumerableType(out Type? et));
18+
}
19+
20+
[Fact]
21+
public void StringIenumerable_is_enumerable() {
22+
Assert.True(typeof(IEnumerable<string>).TryExtractEnumerableType(out Type? et));
23+
Assert.Equal(typeof(string), et);
24+
}
25+
26+
[Fact]
27+
public void Nullable_element_is_not_stripped() {
28+
Assert.True(typeof(IEnumerable<int?>).TryExtractEnumerableType(out Type? et));
29+
Assert.Equal(typeof(int?), et);
30+
}
31+
32+
[Fact]
33+
public void ListOfT_is_ienumerable() {
34+
Assert.True(typeof(List<int>).TryExtractEnumerableType(out Type? baseType));
35+
Assert.Equal(typeof(int), baseType);
36+
}
37+
}
38+
}

src/Parquet.Test/RepeatableFieldsTest.cs

Lines changed: 0 additions & 34 deletions
This file was deleted.

src/Parquet.Test/SchemaTest.cs renamed to src/Parquet.Test/Schema/SchemaTest.cs

Lines changed: 31 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
using System.Numerics;
1212
using Parquet.Encodings;
1313

14-
namespace Parquet.Test {
14+
namespace Parquet.Test.Schema {
1515
public class SchemaTest : TestBase {
1616
[Fact]
1717
public void Creating_element_with_unsupported_type_throws_exception() {
@@ -30,7 +30,7 @@ public void SchemaElement_different_names_not_equal() {
3030

3131
[Fact]
3232
public void SchemaElement_different_types_not_equal() {
33-
Assert.NotEqual((Field)(new DataField<int>("id")), (Field)(new DataField<double>("id")));
33+
Assert.NotEqual(new DataField<int>("id"), (Field)new DataField<double>("id"));
3434
}
3535

3636
[Fact]
@@ -96,10 +96,10 @@ public void But_i_can_declare_a_dictionary() {
9696
[Fact]
9797
public void Map_fields_with_same_types_are_equal() {
9898
Assert.Equal(
99-
new MapField("dictionary",
100-
new DataField<int>("key"),
99+
new MapField("dictionary",
100+
new DataField<int>("key"),
101101
new DataField<string>("value")),
102-
new MapField("dictionary",
102+
new MapField("dictionary",
103103
new DataField<int>("key"),
104104
new DataField<string>("value")));
105105
}
@@ -242,20 +242,19 @@ public void List_of_structures_valid_levels() {
242242
[InlineData("legacy-list-onearray.parquet")]
243243
[InlineData("legacy-list-onearray.v2.parquet")]
244244
public async Task BackwardCompat_list_with_one_array(string parquetFile) {
245-
using(Stream input = OpenTestFile(parquetFile)) {
246-
using(ParquetReader reader = await ParquetReader.CreateAsync(input)) {
247-
ParquetSchema schema = reader.Schema;
248-
249-
//validate schema
250-
Assert.Equal("impurityStats", schema[3].Name);
251-
Assert.Equal(SchemaType.List, schema[3].SchemaType);
252-
Assert.Equal("gain", schema[4].Name);
253-
Assert.Equal(SchemaType.Data, schema[4].SchemaType);
254-
255-
//smoke test we can read it
256-
using(ParquetRowGroupReader rg = reader.OpenRowGroupReader(0)) {
257-
DataColumn values4 = await rg.ReadColumnAsync((DataField)schema[4]);
258-
}
245+
using(Stream input = OpenTestFile(parquetFile))
246+
using(ParquetReader reader = await ParquetReader.CreateAsync(input)) {
247+
ParquetSchema schema = reader.Schema;
248+
249+
//validate schema
250+
Assert.Equal("impurityStats", schema[3].Name);
251+
Assert.Equal(SchemaType.List, schema[3].SchemaType);
252+
Assert.Equal("gain", schema[4].Name);
253+
Assert.Equal(SchemaType.Data, schema[4].SchemaType);
254+
255+
//smoke test we can read it
256+
using(ParquetRowGroupReader rg = reader.OpenRowGroupReader(0)) {
257+
DataColumn values4 = await rg.ReadColumnAsync((DataField)schema[4]);
259258
}
260259
}
261260
}
@@ -266,32 +265,26 @@ public async Task Column_called_root() {
266265
var columns = new List<DataColumn>();
267266
columns.Add(new DataColumn(new DataField<string>("root"), new string[] { "AAA" }));
268267
columns.Add(new DataColumn(new DataField<string>("other"), new string[] { "BBB" }));
269-
List<Field> fields = new List<Field>();
270-
foreach(DataColumn column in columns) {
268+
var fields = new List<Field>();
269+
foreach(DataColumn column in columns)
271270
fields.Add(column.Field);
272-
}
273271

274272
// the writer used to create structure type under "root" (https://github.com/aloneguid/parquet-dotnet/issues/143)
275273
var schema = new ParquetSchema(fields);
276274
var ms = new MemoryStream();
277-
using(ParquetWriter parquetWriter = await ParquetWriter.CreateAsync(schema, ms)) {
278-
using(ParquetRowGroupWriter groupWriter = parquetWriter.CreateRowGroup()) {
279-
foreach(DataColumn column in columns) {
280-
await groupWriter.WriteColumnAsync(column);
281-
}
282-
}
283-
}
275+
using(ParquetWriter parquetWriter = await ParquetWriter.CreateAsync(schema, ms))
276+
using(ParquetRowGroupWriter groupWriter = parquetWriter.CreateRowGroup())
277+
foreach(DataColumn column in columns)
278+
await groupWriter.WriteColumnAsync(column);
284279

285280
ms.Position = 0;
286281
using(ParquetReader parquetReader = await ParquetReader.CreateAsync(ms)) {
287282
DataField[] dataFields = parquetReader.Schema.GetDataFields();
288-
for(int i = 0; i < parquetReader.RowGroupCount; i++) {
289-
using(ParquetRowGroupReader groupReader = parquetReader.OpenRowGroupReader(i)) {
283+
for(int i = 0; i < parquetReader.RowGroupCount; i++)
284+
using(ParquetRowGroupReader groupReader = parquetReader.OpenRowGroupReader(i))
290285
foreach(DataColumn column in columns) {
291286
DataColumn c = await groupReader.ReadColumnAsync(column.Field);
292287
}
293-
}
294-
}
295288
}
296289
}
297290

@@ -301,14 +294,13 @@ public async Task ReadSchemaActuallyEqualToWriteSchema() {
301294
var schema = new ParquetSchema(field);
302295

303296
using(var memoryStream = new MemoryStream()) {
304-
using(var parquetWriter = await ParquetWriter.CreateAsync(schema, memoryStream)) {
305-
using(var groupWriter = parquetWriter.CreateRowGroup()) {
306-
var dataColumn = new DataColumn(field, new List<DateTime>() { DateTime.Now }.ToArray());
307-
await groupWriter.WriteColumnAsync(dataColumn);
308-
}
297+
using(ParquetWriter parquetWriter = await ParquetWriter.CreateAsync(schema, memoryStream))
298+
using(ParquetRowGroupWriter groupWriter = parquetWriter.CreateRowGroup()) {
299+
var dataColumn = new DataColumn(field, new List<DateTime>() { DateTime.Now }.ToArray());
300+
await groupWriter.WriteColumnAsync(dataColumn);
309301
}
310302

311-
using(var parquetReader = await ParquetReader.CreateAsync(memoryStream)) {
303+
using(ParquetReader parquetReader = await ParquetReader.CreateAsync(memoryStream)) {
312304
parquetReader.Schema.Fields.ToString();
313305

314306
Assert.Single(schema.Fields);

src/Parquet.Test/TypeExtensionsTest.cs

Lines changed: 0 additions & 45 deletions
This file was deleted.

0 commit comments

Comments
 (0)