Skip to content

Byte strings aren't able to compare UTF32 strings #177

@ate47

Description

@ate47

I've noticed that if we take characters with surrogate (To have UTF-32), for example this symbol: 𦳣 and another one without surrogate, for example this symbol: , we don't have the same results.

The code to reproduce is here, I've used the code points to get the strings.

String ss1 = new String(Character.toChars(0x26ce3)); // 𦳣
String ss2 = new String(Character.toChars(0xf4d1)); // 

CompactString b1 = new CompactString(ss1);
CompactString b2 = new CompactString(ss2);

assertEquals(ss1, b1.toString());
assertEquals(ss2, b2.toString());

// I clamp the value between -1 and 1 to have the same result
int cmpByte = Math.max(-1, Math.min(1, b1.compareTo(b2)));
int cmpStr = Math.max(-1, Math.min(1, b1.toString().compareTo(b2.toString())));

assertEquals(cmpStr, cmpByte);
// java.lang.AssertionError: 
// Expected :-1
// Actual   :1

It creates a bug with the generation of an HDT of a section of Wikidata

> .\rdf2hdt.bat .\chunk.nt.gz test.hdt
...
File converted in: 2 min 30 sec 463 ms 185 us
Total Triples: 49996305
Different subjects: 1206364
Different predicates: 3655
Different objects: 9917883
Common Subject/Object:603515
HDT saved to file in: 1 sec 242 ms 73 us

> .\hdtVerify.bat .\test.hdt
Checking subject entries
Checking predicate entries
Checking object entries
ERRA: "????"@zh-hant / "??"@lzh
ERRB: "????"@zh-hant / "??"@lzh
ERRA: "????"@zh-hant / "???"@lzh
ERRB: "????"@zh-hant / "???"@lzh
ERRA: "???????"@zh-hant / "?????"@got
ERRB: "???????"@zh-hant / "?????"@got
Checking shared entries

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions