Skip to content

Commit 959e6a9

Browse files
committed
Add utf8_hop_back_overshoot()
This is like plain utf8_hop_back() except it returns how many characters the request would have overshot the edge if it had been allowed to go beyond the edge. This allows the caller to do error handling. The code has to be changed to be more careful (than before this commit) with counting the actual number of characters consumed in the hop.
1 parent 76d4023 commit 959e6a9

File tree

4 files changed

+61
-23
lines changed

4 files changed

+61
-23
lines changed

embed.fnc

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3613,9 +3613,14 @@ ARdip |IV |utf8_distance |NN const U8 *a \
36133613
|NN const U8 *b
36143614
ARTdip |U8 * |utf8_hop |NN const U8 *s \
36153615
|SSize_t off
3616-
ARTdip |U8 * |utf8_hop_back |NN const U8 *s \
3616+
ARTdmp |U8 * |utf8_hop_back |NN const U8 *s \
36173617
|SSize_t off \
3618-
|NN const U8 *start
3618+
|NN const U8 * const start
3619+
ARTdip |U8 * |utf8_hop_back_overshoot \
3620+
|NN const U8 *s \
3621+
|SSize_t off \
3622+
|NN const U8 * const start \
3623+
|NULLOK SSize_t *remaining
36193624
ARTdip |U8 * |utf8_hop_forward \
36203625
|NN const U8 *s \
36213626
|SSize_t off \

embed.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -787,6 +787,7 @@
787787
# define utf8_distance(a,b) Perl_utf8_distance(aTHX_ a,b)
788788
# define utf8_hop Perl_utf8_hop
789789
# define utf8_hop_back Perl_utf8_hop_back
790+
# define utf8_hop_back_overshoot Perl_utf8_hop_back_overshoot
790791
# define utf8_hop_forward Perl_utf8_hop_forward
791792
# define utf8_hop_safe Perl_utf8_hop_safe
792793
# define utf8_length(a,b) Perl_utf8_length(aTHX_ a,b)

inline.h

Lines changed: 47 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -2708,28 +2708,48 @@ Perl_utf8_hop_forward(const U8 *s, SSize_t off, const U8 *end)
27082708
}
27092709

27102710
/*
2711-
=for apidoc utf8_hop_back
2712-
2713-
Return the UTF-8 pointer C<s> displaced by up to C<off> characters,
2714-
backward. C<s> does not need to be pointing to the starting byte of a
2715-
character. If it isn't, one count of C<off> will be used up to get to that
2716-
start.
2717-
2718-
C<off> must be non-positive.
2719-
2720-
C<s> must be after or equal to C<start>.
2721-
2722-
When moving backward it will not move before C<start>.
2723-
2724-
Will not exceed this limit even if the string is not valid "UTF-8".
2711+
=for apidoc utf8_hop_back
2712+
=for apidoc_item utf8_hop_back_overshoot
2713+
2714+
These each take as input a string encoded as UTF-8 which starts at C<start>,
2715+
and a position into it given by C<s>, and return the position within it that is
2716+
C<s> displaced by up to C<off> characters backwards.
2717+
2718+
If there are fewer than C<off> characters between C<start> and C<s>, the
2719+
functions return C<start>.
2720+
2721+
The functions differ in that C<utf8_hop_back_overshoot> can return how many
2722+
characters C<off> beyond the edge the request was for. When its parameter,
2723+
C<&remaining>, is not NULL, the function stores into it the count of the
2724+
excess; zero if the request was completely fulfilled. The actual number of
2725+
characters that were displaced can then be calculated as S<C<off - remaining>>.
2726+
This function acts identically to plain C<utf8_hop_back> when this parameter is
2727+
NULL.
2728+
2729+
C<s> does not need to be pointing to the starting byte of a character. If it
2730+
isn't, one count of C<off> will be used up to get to that start.
2731+
2732+
C<off> must be non-positive, and if zero, no action is taken; C<s> is returned
2733+
unchanged. That it otherwise must be negative means that the earlier
2734+
description is a lie, to avoid burdening you with this detail too soon. An
2735+
C<off> of C<-2> means to displace two characters backwards, so the displacement
2736+
is actually the absolute value of C<off>. C<remaining> will also be
2737+
non-positive. If there was only one character between C<start> and C<s>, and a
2738+
displacement of C<-2> was requested, C<remaining> would be set to C<-1>. The
2739+
subtraction formula works, yielding the result that only C<-1> character was
2740+
displaced.
27252741
27262742
=cut
27272743
*/
27282744

2745+
# define Perl_utf8_hop_back( s, off, start) \
2746+
Perl_utf8_hop_back_overshoot(s, off, start, NULL)
2747+
27292748
PERL_STATIC_INLINE U8 *
2730-
Perl_utf8_hop_back(const U8 *s, SSize_t off, const U8 *start)
2749+
Perl_utf8_hop_back_overshoot(const U8 *s, SSize_t off,
2750+
const U8 * const start, SSize_t *remaining)
27312751
{
2732-
PERL_ARGS_ASSERT_UTF8_HOP_BACK;
2752+
PERL_ARGS_ASSERT_UTF8_HOP_BACK_OVERSHOOT;
27332753
assert(start <= s);
27342754
assert(off <= 0);
27352755

@@ -2740,10 +2760,18 @@ Perl_utf8_hop_back(const U8 *s, SSize_t off, const U8 *start)
27402760
* moved is large, and core perl doesn't currently move more than a few
27412761
* characters at a time. You can reinstate it if it does become
27422762
* advantageous. */
2743-
while (off++ && s > start) {
2744-
do {
2763+
while (off < 0 && s > start) {
2764+
do { /* Find the beginning of this character */
27452765
s--;
2746-
} while (s > start && UTF8_IS_CONTINUATION(*s));
2766+
if (! UTF8_IS_CONTINUATION(*s)) {
2767+
off++;
2768+
break;
2769+
}
2770+
} while (s > start);
2771+
}
2772+
2773+
if (remaining) {
2774+
*remaining = off;
27472775
}
27482776

27492777
GCC_DIAG_IGNORE(-Wcast-qual)

proto.h

Lines changed: 6 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)