Skip to content

ISO_8859_1 breaking UTF-8 in CLP logtype String #42

@intr3p1d

Description

@intr3p1d

Bug

Cause

clp-ffi-java internally use StandardCharsets.ISO_8859_1 in EncodedMessage.getLogTypeAsString();

public String getLogTypeAsString() {
if (null == logtype) {
return null;
} else {
return new String(logtype, StandardCharsets.ISO_8859_1);
}
}

(getDictionaryVarsAsStrings also)

Effect

https://github.com/apache/pinot/blob/0a4398634be81cdbbe891b3da249134ef98743e7/pinot-plugins/pinot-input-format/pinot-clp-log/src/main/java/org/apache/pinot/plugin/inputformat/clplog/CLPLogRecordExtractor.java#L151-L154

This makes some characters broken like this:
Request processing failed: jakarta.validation.ConstraintViolationException: getAgentsList.from: /u0011 이상이어야 합니다
into
Request processing failed: jakarta.validation.ConstraintViolationException: getAgentsList.from: � ����� ���

This is fine after going through the decode function, but when dealing with individual logtype, these broken strings don't seem appropriate (LIKE searches, etc).

clp-ffi version

0.4.4

Environment

Linux, Java
https://github.com/apache/pinot/blob/1d490c1ac3268103a16d77ddfa70f8f8602f9e96/pom.xml#L160

Reproduction steps

Encode some characters which is not supported by ISO_8859_1
Request processing failed: jakarta.validation.ConstraintViolationException: getAgentsList.from: /u0011 이상이어야 합니다
Then get the logtype
Request processing failed: jakarta.validation.ConstraintViolationException: getAgentsList.from: � ����� ���

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions