You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Apparently to check if the field is repeated you can always check `.IsArray` Boolean flag.
14
+
To check if the field is repeated you can always test `.IsArray` Boolean flag.
20
15
21
-
Array column is also a usual instance of the `DataColumm` class, however in order to populate it you need to pass **repetition levels**. Repetition levels specify *at which level array starts* (please read more details on this in the link above).
16
+
Parquet columns are flat, so in order to store an array in the array which can only keep simple elements and not other arrays, you would *flatten* them. For instance to store two elements:
22
17
23
-
### Example
24
-
25
-
Let's say you have a following array of integer arrays:
26
-
27
-
```
28
-
[1 2 3]
29
-
[4 5]
30
-
[6 7 8 9]
31
-
```
18
+
-`[1, 2, 3]`
19
+
-`[4, 5]`
32
20
33
-
This can be represented as:
21
+
in a flat array, it will look like `[1, 2, 3, 4, 5]`. And that's exactly how parquet stores them. Now, the problem starts when you want to read the values back. Is this `[1, 2]` and `[3, 4, 5]` or `[1]` and `[2, 3, 4, 5]`? There's no way to know without an extra information. Therefore, parquet also stores that extra information an an extra column per data column, which is called *repetition levels*. In the previous example, our array of arrays will expand into the following two columns:
34
22
35
-
```
36
-
values: [1 2 3 4 5 6 7 8 9]
37
-
repetition levels: [0 1 1 0 1 0 1 1 1]
38
-
```
23
+
| # | Data Column | Repetition Levels Column |
24
+
| ---- | ----------- | ------------------------ |
25
+
| 0 | 1 | 0 |
26
+
| 1 | 2 | 1 |
27
+
| 2 | 3 | 1 |
28
+
| 3 | 4 | 0 |
29
+
| 4 | 5 | 1 |
39
30
40
-
Where `0` means that this is a start of an array and `1` - it's a value continuation.
31
+
In other words - it is the level at which we have to create a new list for the current value. In other words, the repetition level can be seen as a marker of when to start a new list and at which level.
41
32
42
33
To represent this in C# code:
43
34
44
35
```csharp
45
36
varfield=newDataField<IEnumerable<int>>("items");
46
37
varcolumn=newDataColumn(
47
38
field,
48
-
newint[] { 1, 2, 3, 4, 5, 6, 7, 8, 9 },
49
-
newint[] { 0, 1, 1, 0, 1, 0, 1, 1, 1 });
39
+
newint[] { 1, 2, 3, 4, 5 },
40
+
newint[] { 0, 1, 1, 0, 1 });
50
41
```
51
42
52
-
### Empty Arrays
53
-
54
-
Empty arrays can be represented by simply having no element in them. For instance
55
-
56
-
```
57
-
[1 2]
58
-
[]
59
-
[3 4]
60
-
```
61
-
62
-
Goes into following:
63
-
64
-
```
65
-
values: [1 2 null 3 4]
66
-
repetition levels: [0 1 0 0 1]
67
-
```
68
-
69
-
> Note that anything other than plain columns add a performance overhead due to obvious reasons for the need to pack and unpack data structures.
After constructing `ParquetWriter` you can optionally set compression method ([`CompressionMethod`](../src/Parquet/CompressionMethod.cs)) and/or compression level ([`CompressionLevel`](https://learn.microsoft.com/en-us/dotnet/api/system.io.compression.compressionlevel?view=net-7.0)) which defaults to `Snappy`. Unless you have specific needs to override compression, the default are very reasonable.
This lib supports pseudo appending to files, however it's worth keeping in mind that *row groups are immutable* by design, therefore the only way to append is to create a new row group at the end of the file. It's worth mentioning that small row groups make data compression and reading extremely ineffective, therefore the larger your row group the better.
54
54
@@ -96,7 +96,7 @@ Note that you have to specify that you are opening `ParquetWriter` in **append**
96
96
97
97
Please keep in mind that row groups are designed to hold a large amount of data (5'0000 rows on average) therefore try to find a large enough batch to append to the file. Do not treat parquet file as a row stream by creating a row group and placing 1-2 rows in it, because this will both increase file size massively and cause a huge performance degradation for a client reading such a file.
98
98
99
-
###Custom Metadata
99
+
# Custom Metadata
100
100
101
101
To read and write custom file metadata, you can use `CustomMetadata` property on `ParquetFileReader` and `ParquetFileWriter`, i.e.
0 commit comments