You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: modules/components/pages/processors/parquet_decode.adoc
+46-16Lines changed: 46 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,42 +17,72 @@ Introduced in version 4.4.0.
17
17
endif::[]
18
18
19
19
```yml
20
-
# Config fields, showing default values
20
+
# Configuration fields, showing default values
21
21
label: ""
22
-
parquet_decode: {}
22
+
parquet_decode:
23
+
handle_logical_types: v1
23
24
```
24
25
25
-
This processor uses https://github.com/parquet-go/parquet-go[https://github.com/parquet-go/parquet-go^], which is itself experimental. Therefore changes could be made into how this processor functions outside of major version releases.
26
+
== Fields
27
+
28
+
=== `handle_logical_types`
29
+
30
+
Set to `v2` to enable enhanced decoding of logical types, or keep the default value (`v1`) to ignore logical type metadata when decoding values.
31
+
32
+
In Parquet format, logical types are represented using standard physical types along with metadata that provides additional context. For example, UUIDs are stored as a `FIXED_LEN_BYTE_ARRAY` physical type, but the schema metadata identifies them as UUIDs. By enabling `v2`, this processor uses the metadata descriptions of logical types to produce more meaningful values during decoding.
33
+
34
+
NOTE: For backward compatibility, this field enables logical-type handling for the specified Parquet format version, and all earlier versions. When creating new pipelines, Redpanda recommends that you use the newest documented version.
35
+
36
+
*Type*: `string`
37
+
38
+
*Default*: `v1`
39
+
40
+
Options:
41
+
42
+
[cols="2,8"]
43
+
|===
44
+
| Option | Description
45
+
46
+
| `v1`
47
+
| No special handling of logical types.
48
+
49
+
| `v2`
50
+
a| Logical types with enhanced decoding:
51
+
52
+
* `TIMESTAMP`: Decodes as an RFC3339 string describing the time. If the `isAdjustedToUTC` flag is set to `true` in the Parquet file, the time zone is set to UTC. If the flag is set to `false`, the time zone is set to local time.
53
+
54
+
* `UUID`: Decodes as a string: `00112233-4455-6677-8899-aabbccddeeff`.
55
+
56
+
|===
57
+
58
+
```yml
59
+
# Examples
60
+
61
+
handle_logical_types: v2
62
+
```
26
63
27
64
== Examples
28
65
29
-
[tabs]
30
-
======
31
-
Reading Parquet Files from AWS S3::
32
-
+
33
-
--
66
+
=== Reading Parquet files from AWS S3
34
67
35
-
In this example we consume files from AWS S3 as they're written by listening onto an SQS queue for upload events. We make sure to use the `to_the_end` scanner which means files are read into memory in full, which then allows us to use a `parquet_decode` processor to expand each file into a batch of messages. Finally, we write the data out to local files as newlinedelimited JSON.
68
+
In this example, a pipeline consumes Parquet files as soon as they are uploaded to an AWS S3 bucket. The pipeline listens to an SQS queue for upload events, and uses the `to_the_end` scanner to read the files into memory in full. The `parquet_decode` processor then decodes each file into a batch of structured messages. Finally, the data is written to local files in newline-delimited JSON format.
Copy file name to clipboardExpand all lines: modules/components/pages/processors/parquet_encode.adoc
+43-53Lines changed: 43 additions & 53 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,6 +12,7 @@ component_type_dropdown::[]
12
12
13
13
Encodes https://parquet.apache.org/docs/[Parquet files^] from a batch of structured messages.
14
14
15
+
15
16
ifndef::env-cloud[]
16
17
Introduced in version 4.4.0.
17
18
endif::[]
@@ -23,7 +24,7 @@ Common::
23
24
--
24
25
25
26
```yml
26
-
# Common config fields, showing default values
27
+
# Common configuration fields, showing default values
27
28
label: ""
28
29
parquet_encode:
29
30
schema: [] # No default (required)
@@ -36,7 +37,7 @@ Advanced::
36
37
--
37
38
38
39
```yml
39
-
# All config fields, showing default values
40
+
# All configuration fields, showing default values
40
41
label: ""
41
42
parquet_encode:
42
43
schema: [] # No default (required)
@@ -47,64 +48,25 @@ parquet_encode:
47
48
--
48
49
======
49
50
50
-
This processor uses https://github.com/parquet-go/parquet-go[https://github.com/parquet-go/parquet-go^], which is itself experimental. Therefore changes could be made into how this processor functions outside of major version releases.
51
-
52
-
53
-
== Examples
54
-
55
-
[tabs]
56
-
======
57
-
Writing Parquet Files to AWS S3::
58
-
+
59
-
--
60
-
61
-
In this example we use the batching mechanism of an `aws_s3` output to collect a batch of messages in memory, which then converts it to a parquet file and uploads it.
The type of the column, only applicable for leaf columns with no child fields. Some logical types can be specified here such as UTF8.
107
-
69
+
The data type of the column to encode. This field is only applicable for leaf columns with no child fields. The following options include logical types.
108
70
109
71
*Type*: `string`
110
72
@@ -116,12 +78,15 @@ Options:
116
78
, `FLOAT`
117
79
, `DOUBLE`
118
80
, `BYTE_ARRAY`
119
-
, `UTF8`
120
-
.
81
+
, `TIMESTAMP`
82
+
, `BSON`
83
+
, `ENUM`
84
+
, `JSON`
85
+
, `UUID`
121
86
122
87
=== `schema[].repeated`
123
88
124
-
Whether the field is repeated.
89
+
Whether a field is repeated.
125
90
126
91
127
92
*Type*: `bool`
@@ -130,7 +95,7 @@ Whether the field is repeated.
130
95
131
96
=== `schema[].optional`
132
97
133
-
Whether the field is optional.
98
+
Whether a field is optional.
134
99
135
100
136
101
*Type*: `bool`
@@ -162,7 +127,7 @@ The default compression type to use for fields.
162
127
163
128
*Type*: `string`
164
129
165
-
*Default*: `"uncompressed"`
130
+
*Default*: `uncompressed`
166
131
167
132
Options:
168
133
`uncompressed`
@@ -171,16 +136,14 @@ Options:
171
136
, `brotli`
172
137
, `zstd`
173
138
, `lz4raw`
174
-
.
175
139
176
140
=== `default_encoding`
177
141
178
-
The default encoding type to use for fields. A custom default encoding is only necessary when consuming data with libraries that do not support `DELTA_LENGTH_BYTE_ARRAY` and is therefore best left unset where possible.
179
-
142
+
The default encoding type to use for fields. A custom default encoding is only necessary when consuming data with libraries that do not support `DELTA_LENGTH_BYTE_ARRAY`.
180
143
181
144
*Type*: `string`
182
145
183
-
*Default*: `"DELTA_LENGTH_BYTE_ARRAY"`
146
+
*Default*: `DELTA_LENGTH_BYTE_ARRAY`
184
147
185
148
ifndef::env-cloud[]
186
149
Requires version 4.11.0 or newer
@@ -189,6 +152,33 @@ endif::[]
189
152
Options:
190
153
`DELTA_LENGTH_BYTE_ARRAY`
191
154
, `PLAIN`
192
-
.
155
+
156
+
== Examples
157
+
158
+
=== Writing Parquet files to AWS S3
159
+
160
+
In this example, a pipeline uses an `aws_s3` output as a batching mechanism. Messages are collected in memory and encoded into a Parquet file, which is then uploaded to an AWS S3 bucket.
0 commit comments