-
Notifications
You must be signed in to change notification settings - Fork 33
Description
-
What's the issue
In the parquet files generated from the conversion, strings are encoded in base64. It occurs to all the string fields, which may diverge from user's intentions.
Take RelationWriteSupport.java as an example.
memberRoleType = new PrimitiveType(REQUIRED, BINARY, "role");
In the above piece of code, we call this constructor ofprimitiveType,
we are actually setting its logicalTypeAnnotation to null. Therefore, parquet converter knows nothing about its actual type, then uses its default way to convert it as a binary - which is base64. -
How to fix
To fix, we can set thelogicalTypeAnnotationparameter tostringtype. We know the tags are actully in string format, it should be safe to do so, and parquet convert will be aware the field is string and convert it using UTF-8 instead of base64.