Skip to content

Commit 95f8a0b

Browse files
committed
Add utf8_to_uv()
This performs the same function as utf8_to_uvchr_buf() with a more convenient API that is much harder to misuse. All code should convert to use this new function instead of the old. The behavior of utf8_to_uvchr_buf() varies depending on if <utf8> warnings are enabled or not, and no code in core actually takes that into account If warnings are enabled: A zero return can mean both success or failure Hence a zero return must be disambiguated. Success would come from the next character being a NUL. If failure, <retlen> will be -1, so can't be used to find where to start parsing again. If disabled: Both the return and <retlen> will be usable values, but the return of the REPLACEMENT CHARACTER is ambiguous. It could mean failure, or it could mean that that was the next character in the input and was successfully decoded. It may very well not matter to you what the source of this particular value was. It likely means a failure somewhere. But there are occasions where you might care. The new function returns true upon success; false on failure. And it is passed pointers to return the computed code point and byte length into. These values always contain the correct information, regardless of if the input is malformed or not. It is easy to test for failure in a conditional and then to take appropriate action. However, most often it seems the appropriate action is to use, going forward, the REPLACEMENT CHARACTER returned in failure cases. And if you don't care particularly if it succeeds or not, you just use it without testing the result. This happens when you are confident that the input is well-formed, or say in converting a string for display.
1 parent 16d0f3c commit 95f8a0b

File tree

4 files changed

+10
-0
lines changed

4 files changed

+10
-0
lines changed

embed.fnc

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3727,6 +3727,10 @@ EMXp |U8 * |utf16_to_utf8_reversed \
37273727
|NN U8 *d \
37283728
|Size_t bytelen \
37293729
|NN Size_t *newlen
3730+
ATmp |bool |utf8_to_uv |NN const U8 * const s \
3731+
|NN const U8 * const e \
3732+
|NN UV *cp_p \
3733+
|NULLOK Size_t *advance_p
37303734
ADbdp |UV |utf8_to_uvchr |NN const U8 *s \
37313735
|NULLOK STRLEN *retlen
37323736
AMdip |UV |utf8_to_uvchr_buf \

embed.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -862,6 +862,7 @@
862862
# define utf8_to_bytes_new_pv(a,b,c) Perl_utf8_to_bytes_new_pv(aTHX,a,b,c)
863863
# define utf8_to_bytes_overwrite(a,b) Perl_utf8_to_bytes_overwrite(aTHX,a,b)
864864
# define utf8_to_bytes_temp_pv(a,b) Perl_utf8_to_bytes_temp_pv(aTHX,a,b)
865+
# define utf8_to_uv Perl_utf8_to_uv
865866
# define utf8_to_uv_errors Perl_utf8_to_uv_errors
866867
# define utf8_to_uv_flags Perl_utf8_to_uv_flags
867868
# define utf8_to_uv_msgs Perl_utf8_to_uv_msgs

proto.h

Lines changed: 3 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

utf8.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,8 @@ typedef enum {
159159
#define Perl_utf8n_to_uvchr_error(s, len, lenp, flags, errors) \
160160
Perl_utf8n_to_uvchr_msgs(s, len, lenp, flags, errors, 0)
161161

162+
#define Perl_utf8_to_uv( s, e, cp_p, advance_p) \
163+
Perl_utf8_to_uv_flags( s, e, cp_p, advance_p, 0)
162164
#define Perl_utf8_to_uv_flags( s, e, cp_p, advance_p, flags) \
163165
Perl_utf8_to_uv_errors( s, e, cp_p, advance_p, flags, 0)
164166
#define Perl_utf8_to_uv_errors( s, e, cp_p, advance_p, flags, errors) \

0 commit comments

Comments
 (0)