-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
> perl -MDevel::Peek -MJSON -e'my $str = "é"; my $str2 = JSON::decode_json( JSON::encode_json([$str]) )->[0]; Dump $str; Dump $str2'
SV = PV(0xd3cd60) at 0xd58620
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK)
PV = 0xd5f3b0 "\303\251"\0
CUR = 2
LEN = 10
COW_REFCNT = 1
SV = PV(0xd3ceb0) at 0xd58698
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0xd4b0f0 "\303\203\302\251"\0 [UTF8 "\x{c3}\x{a9}"]
CUR = 4
LEN = 10
In default mode, CDB_File will print $str and $str2 differently: 2 bytes and 4 bytes, respectively.
The utf8 => 1 mode allows for one fix to that by printing both strings as 4 bytes. Effectively this mode says, “my strings are characters; please store their UTF-8 representation.”
Ideally there should be another mode that informs the encoder that the strings are bytes. It would do the opposite of utf8 => 1, i.e., store $str’s internal PV as-is, but store sv_2pvbyte() on $str2. This has the desirable effect of croaking if any UTF8-flagged SVPV contains a code point that exceeds 255 (and thus cannot be a byte string).
Thank you!
Metadata
Metadata
Assignees
Labels
No labels