DOC-1242 Handling logical types in parquet_encode and parquet_decode processors (#222)

asimms41 · web-flow · commit f679813f2d27 · 2025-05-02T09:34:56.000+01:00
diff --git a/modules/components/pages/processors/parquet_decode.adoc b/modules/components/pages/processors/parquet_decode.adoc
@@ -17,42 +17,72 @@ Introduced in version 4.4.0.
 endif::[]
 
 ```yml
-# Config fields, showing default values
+# Configuration fields, showing default values
 label: ""
-parquet_decode: {}
+parquet_decode:
+  handle_logical_types: v1
 ```
 
-This processor uses https://github.com/parquet-go/parquet-go[https://github.com/parquet-go/parquet-go^], which is itself experimental. Therefore changes could be made into how this processor functions outside of major version releases.
+== Fields
+
+=== `handle_logical_types`
+ 
+Set to `v2` to enable enhanced decoding of logical types, or keep the default value (`v1`) to ignore logical type metadata when decoding values.
+
+In Parquet format, logical types are represented using standard physical types along with metadata that provides additional context. For example, UUIDs are stored as a `FIXED_LEN_BYTE_ARRAY` physical type, but the schema metadata identifies them as UUIDs. By enabling `v2`, this processor uses the metadata descriptions of logical types to produce more meaningful values during decoding.
+
+NOTE: For backward compatibility, this field enables logical-type handling for the specified Parquet format version, and all earlier versions. When creating new pipelines, Redpanda recommends that you use the newest documented version.
+
+*Type*: `string`
+ 
+*Default*: `v1`
+
+Options:
+ 
+[cols="2,8"]
+|===
+| Option | Description
+ 
+| `v1`
+| No special handling of logical types.
+
+| `v2`
+a| Logical types with enhanced decoding:
+
+* `TIMESTAMP`: Decodes as an RFC3339 string describing the time. If the `isAdjustedToUTC` flag is set to `true` in the Parquet file, the time zone is set to UTC. If the flag is set to `false`, the time zone is set to local time.
+
+* `UUID`: Decodes as a string: `00112233-4455-6677-8899-aabbccddeeff`.
+ 
+|===
+ 
+```yml
+ # Examples
+ 
+handle_logical_types: v2
+```
 
 == Examples
 
-[tabs]
-======
-Reading Parquet Files from AWS S3::
-+
---
+=== Reading Parquet files from AWS S3
 
-In this example we consume files from AWS S3 as they're written by listening onto an SQS queue for upload events. We make sure to use the `to_the_end` scanner which means files are read into memory in full, which then allows us to use a `parquet_decode` processor to expand each file into a batch of messages. Finally, we write the data out to local files as newline delimited JSON.
+In this example, a pipeline consumes Parquet files as soon as they are uploaded to an AWS S3 bucket. The pipeline listens to an SQS queue for upload events, and uses the `to_the_end` scanner to read the files into memory in full. The `parquet_decode` processor then decodes each file into a batch of structured messages. Finally, the data is written to local files in newline-delimited JSON format.
 
 ```yaml
 input:
   aws_s3:
     bucket: TODO
-    prefix: foos/
+    prefix: files/
     scanner:
       to_the_end: {}
     sqs:
       url: TODO
   processors:
-    - parquet_decode: {}
-
+    - parquet_decode:
+        handle_logical_types: v2
 output:
   file:
     codec: lines
-    path: './foos/${! meta("s3_key") }.jsonl'
+    path: './files/${! meta("s3_key") }.jsonl'
 ```
 
---
-======
-
 // end::single-source[]
diff --git a/modules/components/pages/processors/parquet_encode.adoc b/modules/components/pages/processors/parquet_encode.adoc
@@ -12,6 +12,7 @@ component_type_dropdown::[]
 
 Encodes https://parquet.apache.org/docs/[Parquet files^] from a batch of structured messages.
 
+
 ifndef::env-cloud[]
 Introduced in version 4.4.0.
 endif::[]
@@ -23,7 +24,7 @@ Common::
 --
 
 ```yml
-# Common config fields, showing default values
+# Common configuration fields, showing default values
 label: ""
 parquet_encode:
   schema: [] # No default (required)
@@ -36,7 +37,7 @@ Advanced::
 --
 
 ```yml
-# All config fields, showing default values
+# All configuration fields, showing default values
 label: ""
 parquet_encode:
   schema: [] # No default (required)
@@ -47,64 +48,25 @@ parquet_encode:
 --
 ======
 
-This processor uses https://github.com/parquet-go/parquet-go[https://github.com/parquet-go/parquet-go^], which is itself experimental. Therefore changes could be made into how this processor functions outside of major version releases.
-
-
-== Examples
-
-[tabs]
-======
-Writing Parquet Files to AWS S3::
-+
---
-
-In this example we use the batching mechanism of an `aws_s3` output to collect a batch of messages in memory, which then converts it to a parquet file and uploads it.
-
-```yaml
-output:
-  aws_s3:
-    bucket: TODO
-    path: 'stuff/${! timestamp_unix() }-${! uuid_v4() }.parquet'
-    batching:
-      count: 1000
-      period: 10s
-      processors:
-        - parquet_encode:
-            schema:
-              - name: id
-                type: INT64
-              - name: weight
-                type: DOUBLE
-              - name: content
-                type: BYTE_ARRAY
-            default_compression: zstd
-```
-
---
-======
-
 == Fields
 
 === `schema`
 
 Parquet schema.
 
-
 *Type*: `array`
 
-
 === `schema[].name`
 
-The name of the column.
+The name of the column you want to encode.
 
 
 *Type*: `string`
 
 
 === `schema[].type`
 
-The type of the column, only applicable for leaf columns with no child fields. Some logical types can be specified here such as UTF8.
-
+The data type of the column to encode. This field is only applicable for leaf columns with no child fields. The following options include logical types.
 
 *Type*: `string`
 
@@ -116,12 +78,15 @@ Options:
 , `FLOAT`
 , `DOUBLE`
 , `BYTE_ARRAY`
-, `UTF8`
-.
+, `TIMESTAMP`
+, `BSON`
+, `ENUM`
+, `JSON`
+, `UUID`
 
 === `schema[].repeated`
 
-Whether the field is repeated.
+Whether a field is repeated.
 
 
 *Type*: `bool`
@@ -130,7 +95,7 @@ Whether the field is repeated.
 
 === `schema[].optional`
 
-Whether the field is optional.
+Whether a field is optional.
 
 
 *Type*: `bool`
@@ -162,7 +127,7 @@ The default compression type to use for fields.
 
 *Type*: `string`
 
-*Default*: `"uncompressed"`
+*Default*: `uncompressed`
 
 Options:
 `uncompressed`
@@ -171,16 +136,14 @@ Options:
 , `brotli`
 , `zstd`
 , `lz4raw`
-.
 
 === `default_encoding`
 
-The default encoding type to use for fields. A custom default encoding is only necessary when consuming data with libraries that do not support `DELTA_LENGTH_BYTE_ARRAY` and is therefore best left unset where possible.
-
+The default encoding type to use for fields. A custom default encoding is only necessary when consuming data with libraries that do not support `DELTA_LENGTH_BYTE_ARRAY`.
 
 *Type*: `string`
 
-*Default*: `"DELTA_LENGTH_BYTE_ARRAY"`
+*Default*: `DELTA_LENGTH_BYTE_ARRAY`
 
 ifndef::env-cloud[]
 Requires version 4.11.0 or newer
@@ -189,6 +152,33 @@ endif::[]
 Options:
 `DELTA_LENGTH_BYTE_ARRAY`
 , `PLAIN`
-.
+
+== Examples
+
+=== Writing Parquet files to AWS S3
+
+In this example, a pipeline uses an `aws_s3` output as a batching mechanism. Messages are collected in memory and encoded into a Parquet file, which is then uploaded to an AWS S3 bucket.
+
+```yaml
+output:
+  aws_s3:
+    bucket: TODO
+    path: 'stuff/${! timestamp_unix() }-${! uuid_v4() }.parquet'
+    batching:
+      count: 1000
+      period: 10s
+      processors:
+        - parquet_encode:
+            schema:
+              - name: id
+                type: INT64
+              - name: weight
+                type: DOUBLE
+              - name: content
+                type: BYTE_ARRAY
+            default_compression: zstd
+```
+
+
 
 // end::single-source[]