Skip to content

Strings are in base64 encoding after conversion. #15

@ciqle

Description

@ciqle
  • What's the issue
    In the parquet files generated from the conversion, strings are encoded in base64. It occurs to all the string fields, which may diverge from user's intentions.
    Take RelationWriteSupport.java as an example.
    memberRoleType = new PrimitiveType(REQUIRED, BINARY, "role");
    In the above piece of code, we call this constructor of primitiveType,
    we are actually setting its logicalTypeAnnotation to null. Therefore, parquet converter knows nothing about its actual type, then uses its default way to convert it as a binary - which is base64.

  • How to fix
    To fix, we can set the logicalTypeAnnotation parameter to stringtype. We know the tags are actully in string format, it should be safe to do so, and parquet convert will be aware the field is string and convert it using UTF-8 instead of base64.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions