Add utf8_to_uv()

khwilliamson · khwilliamson · commit 95f8a0bcabcf · 2024-12-02T10:48:02.000-07:00
This performs the same function as utf8_to_uvchr_buf() with a more
convenient API that is much harder to misuse.

All code should convert to use this new function instead of the old.

The behavior of utf8_to_uvchr_buf()  varies depending on if &lt;utf8&gt;
warnings are enabled or not, and no code in core actually takes that
into account

If warnings are enabled:

 A zero return can mean both success or failure

     Hence a zero return must be disambiguated.  Success would come
     from the next character being a NUL.

 If failure, &lt;retlen&gt; will be -1, so can't be used to find where to
 start parsing again.

If disabled:

 Both the return and &lt;retlen&gt; will be usable values, but the return
 of the REPLACEMENT CHARACTER is ambiguous.  It could mean failure,
 or it could mean that that was the next character in the input and
 was successfully decoded.  It may very well not matter to you what
 the source of this particular value was.  It likely means a failure
 somewhere.  But there are occasions where you might care.

The new function returns true upon success; false on failure.  And it is
passed pointers to return the computed code point and byte length into.
These values always contain the correct information, regardless of if
the input is malformed or not.

It is easy to test for failure in a conditional and then to take
appropriate action.  However, most often it seems the appropriate action
is to use, going forward, the REPLACEMENT CHARACTER returned in failure
cases.

And if you don't care particularly if it succeeds or not, you just use
it without testing the result.  This happens when you are confident that
the input is well-formed, or say in converting a string for display.
diff --git a/embed.fnc b/embed.fnc
@@ -3727,6 +3727,10 @@ EMXp	|U8 *	|utf16_to_utf8_reversed 				\
 				|NN U8 *d				\
 				|Size_t bytelen 			\
 				|NN Size_t *newlen
+ATmp	|bool	|utf8_to_uv	|NN const U8 * const s			\
+				|NN const U8 * const e			\
+				|NN UV *cp_p				\
+				|NULLOK Size_t *advance_p
 ADbdp	|UV	|utf8_to_uvchr	|NN const U8 *s 			\
 				|NULLOK STRLEN *retlen
 AMdip	|UV	|utf8_to_uvchr_buf					\
diff --git a/embed.h b/embed.h
@@ -862,6 +862,7 @@
 # define utf8_to_bytes_new_pv(a,b,c)            Perl_utf8_to_bytes_new_pv(aTHX,a,b,c)
 # define utf8_to_bytes_overwrite(a,b)           Perl_utf8_to_bytes_overwrite(aTHX,a,b)
 # define utf8_to_bytes_temp_pv(a,b)             Perl_utf8_to_bytes_temp_pv(aTHX,a,b)
+# define utf8_to_uv                             Perl_utf8_to_uv
 # define utf8_to_uv_errors                      Perl_utf8_to_uv_errors
 # define utf8_to_uv_flags                       Perl_utf8_to_uv_flags
 # define utf8_to_uv_msgs                        Perl_utf8_to_uv_msgs
diff --git a/proto.h b/proto.h
diff --git a/utf8.h b/utf8.h
@@ -159,6 +159,8 @@ typedef enum {
 #define Perl_utf8n_to_uvchr_error(s, len, lenp, flags, errors)                 \
                     Perl_utf8n_to_uvchr_msgs(s, len, lenp, flags, errors, 0)
 
+#define Perl_utf8_to_uv(         s, e, cp_p, advance_p)                     \
+        Perl_utf8_to_uv_flags(   s, e, cp_p, advance_p, 0)
 #define Perl_utf8_to_uv_flags(   s, e, cp_p, advance_p, flags)              \
         Perl_utf8_to_uv_errors(  s, e, cp_p, advance_p, flags, 0)
 #define Perl_utf8_to_uv_errors(  s, e, cp_p, advance_p, flags, errors)      \