From ce6e6e69d2cd678d454fa73a006032980e83a5ac Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Tue, 6 May 2025 07:05:46 -0600 Subject: [PATCH 1/3] perlintro: Define metacharacter before using the term This adds a bit of text about metacharacters that was missing from this introductory pod. --- pod/perlintro.pod | 30 ++++++++++++++++++++++++++---- 1 file changed, 26 insertions(+), 4 deletions(-) diff --git a/pod/perlintro.pod b/pod/perlintro.pod index 4fdee7b16796..da428397ae7b 100644 --- a/pod/perlintro.pod +++ b/pod/perlintro.pod @@ -584,10 +584,32 @@ the meantime, here's a quick cheat sheet: ^ start of string $ end of string -Quantifiers can be used to specify how many of the previous thing you -want to match on, where "thing" means either a literal character, one -of the metacharacters listed above, or a group of characters or -metacharacters in parentheses. +Note that in the above, C<$> doesn't match a dollar sign. Similarly +C<.>, C<\>, C<[>, C<]>, C<(>, C<)>, and C<^> don't match the characters +you might expect. These are called "metacharacters". In contrast, the +characters C, C, C, C, and C, for example, are not +metacharacters. They match themselves literally. Metacharacters +normally match something that isn't their literal value. There are a few +more metacharacters than the ones above. Some quantifier ones are +given below, and the full list is in L. + +To make a metacharacter match its literal value, you "escape" (or "quote") +it, by preceding it with a backslash. Hence, C<\$> does match a dollar sign, +and C<\\> matches a literal backslash. + +Note also that above, the string C<\s>, for example, doesn't match a +backslash followed by the letter C. In this case, preceding the +non-metacharacter C with a backslash turns it into something that +doesn't match its literal value. Such a sequence is called an "escape +sequence". L documents all of the current ones. + +A warning is raised if you escape a character that isn't a metacharacter +and isn't part of a currently defined escape sequence. + +You can specify how many of the previous thing you want to match on by +using quantifiers (where "thing" means one of: a literal character, one +of the constructs listed above, or a group of either of them in +parentheses). * zero or more of the previous thing + one or more of the previous thing From 7a6669488174ee5438a9db1bfe15bfadd52f19f6 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Tue, 6 May 2025 20:57:43 -0600 Subject: [PATCH 2/3] pod and comments: Note escape vs quote Fixes #15221 The documentation and comments were misleading about conflating quoting a metacharacter and escaping it. Since \Q stands for quote, we have to continue to use that terminology. This commit clarifies that the two terms are often equivalent. This also adds detail about quotemeta and \Q. --- pod/perldiag.pod | 8 +++--- pod/perlfunc.pod | 5 ++++ pod/perlre.pod | 61 ++++++++++++++++++++++++++++++----------- pod/perlrebackslash.pod | 14 +++++----- pod/perlreref.pod | 2 +- pod/perlretut.pod | 2 +- pp.c | 4 +-- 7 files changed, 65 insertions(+), 31 deletions(-) diff --git a/pod/perldiag.pod b/pod/perldiag.pod index 5cf5fc7b3fde..6c9948f861e9 100644 --- a/pod/perldiag.pod +++ b/pod/perldiag.pod @@ -2602,8 +2602,8 @@ and perl's F emulation was unable to create an empty temporary file. (W regexp)(F) A character class range must start and end at a literal character, not another character class like C<\d> or C<[:alpha:]>. The "-" in your false range is interpreted as a literal "-". In a C<(?[...])> -construct, this is an error, rather than a warning. Consider quoting -the "-", "\-". The S<<-- HERE> shows whereabouts in the regular expression +construct, this is an error, rather than a warning. Consider escaping +the "-" as "\-". The S<<-- HERE> shows whereabouts in the regular expression the problem was discovered. See L. =item Fatal VMS error (status=%d) at %s, line %d @@ -5453,7 +5453,7 @@ S<<-- HERE> in m/%s/ (F) Within regular expression character classes ([]) the syntax beginning with "[." and ending with ".]" is reserved for future extensions. If you need to represent those character sequences inside a regular expression -character class, just quote the square brackets with the backslash: "\[." +character class, just escape the square brackets with the backslash: "\[." and ".\]". The S<<-- HERE> shows whereabouts in the regular expression the problem was discovered. See L. @@ -5463,7 +5463,7 @@ S<<-- HERE> in m/%s/ (F) Within regular expression character classes ([]) the syntax beginning with "[=" and ending with "=]" is reserved for future extensions. If you need to represent those character sequences inside a regular expression -character class, just quote the square brackets with the backslash: "\[=" +character class, just escape the square brackets with the backslash: "\[=" and "=\]". The S<<-- HERE> shows whereabouts in the regular expression the problem was discovered. See L. diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod index 99f40e54c6d2..388b4b9ee1a4 100644 --- a/pod/perlfunc.pod +++ b/pod/perlfunc.pod @@ -6536,6 +6536,11 @@ the C<\Q> escape in double-quoted strings. If EXPR is omitted, uses L|perlvar/$_>. +The motivation behind this is to make all characters in EXPR match their +literal selves. Otherwise any metacharacters in it could trigger +their "magic" matching behaviors. The characters this function has been +applied to are said to be "quoted" or "escaped". + quotemeta (and C<\Q> ... C<\E>) are useful when interpolating strings into regular expressions, because by default an interpolated variable will be considered a mini-regular expression. For example: diff --git a/pod/perlre.pod b/pod/perlre.pod index 3d046ac64f26..b834f9e1423d 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -1350,24 +1350,42 @@ X

X

=head2 Quoting metacharacters -Backslashed metacharacters in Perl are alphanumeric, such as C<\b>, -C<\w>, C<\n>. Unlike some other regular expression languages, there -are no backslashed symbols that aren't alphanumeric. So anything -that looks like C<\\>, C<\(>, C<\)>, C<\[>, C<\]>, C<\{>, or C<\}> is -always -interpreted as a literal character, not a metacharacter. This was -once used in a common idiom to disable or quote the special meanings -of regular expression metacharacters in a string that you want to -use for a pattern. Simply quote all non-"word" characters: +(Also known as "escaping".) - $pattern =~ s/(\W)/\\$1/g; +To cause a metacharacter to match its literal self, you precede it with +a backslash. Unlike some other regular expression languages, any +sequence consisting of a backslash followed by a non-alphanumeric +matches that non-alphanumeric, literally. So things like C<\\>, C<\(>, +C<\)>, C<\[>, C<\]>, C<\{>, or C<\}> are always interpreted as the +literal character that follows the backslash. + +(That's not true when an alphanumeric character is preceded by a +backslash. There are a few such "escape sequences", like C<\w>, which have +special matching behaviors in Perl. All such are currently limited to +ASCII-range alphanumerics.) + +The best method to escape metacharacters is to use the +C> function, or the equivalent, but the +more flexible, and often more convenient, C<\Q> metaquoting escape +sequence + + quotemeta $pattern; + +This changes C<$pattern> so that the metacharacters are quoted. You can +then do + + $string =~ s/$pattern/foo/; -(If C is set, then this depends on the current locale.) -Today it is more common to use the C> -function or the C<\Q> metaquoting escape sequence to disable all -metacharacters' special meanings like this: +and be assured that any metacharacters in C<$pattern> will match their +literal selves. If you instead use C<\Q>, like: - /$unquoted\Q$quoted\E$unquoted/ + $string =~ s/\Qpattern/foo/; + +you don't have to have a separate C<$pattern> variable. Further, there +is an additional escape sequence, C<\E> that can be combined with C<\Q> +to allow you to escape whatever portions of the pattern you desire: + + $string =~ s/$unquoted\Q$quoted\E$unquoted/foo/; Beware that if you put literal backslashes (those not inside interpolated variables) between C<\Q> and C<\E>, double-quotish @@ -1375,7 +1393,18 @@ backslash interpolation may lead to confusing results. If you I to use literal backslashes within C<\Q...\E>, consult L. -C and C<\Q> are fully described in L. +In older code, you may see something like this: + + $pattern =~ s/(\W)/\\$1/g; + $string =~ s/$pattern/foo/; + +This simply adds backslashes before all non-"word" characters to disable +any special meanings they might have. (If S> is in +effect, the current locale can affect the results.) This paradigm is +inadequate for Unicode. + +C and C<\Q> are more fully described in +L. =head2 Extended Patterns diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index 14afb9728455..39153f6a1f60 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -90,8 +90,8 @@ as C \o{} Octal escape sequence. \p{}, \pP Match any character with the given Unicode property. \P{}, \PP Match any character without the given property. - \Q Quote (disable) pattern metacharacters till \E. Not - in []. + \Q Quote (disable) pattern metacharacters till \E. + (Also called "escape".) Not in []. \r Return character. \R Generic new line. Not in []. \s Match any whitespace character. @@ -350,11 +350,11 @@ them, until either the end of the pattern or the next occurrence of C<\E>, whichever comes first. They provide functionality similar to what the functions C and C provide. -C<\Q> is used to quote (disable) pattern metacharacters, up to the next -C<\E> or the end of the pattern. C<\Q> adds a backslash to any character -that could have special meaning to Perl. In the ASCII range, it quotes -every character that isn't a letter, digit, or underscore. See -L for details on what gets quoted for non-ASCII +C<\Q> is used to quote or escape (disable) pattern metacharacters, up to +the next C<\E> or the end of the pattern. C<\Q> adds a backslash to any +character that could have special meaning to Perl. In the ASCII range, +it quotes every character that isn't a letter, digit, or underscore. +See L for details on what gets quoted for non-ASCII code points. Using this ensures that any character between C<\Q> and C<\E> will be matched literally, not interpreted as a metacharacter by the regex engine. diff --git a/pod/perlreref.pod b/pod/perlreref.pod index 6955a7fb7a65..624edf2a5e7d 100644 --- a/pod/perlreref.pod +++ b/pod/perlreref.pod @@ -318,7 +318,7 @@ Captured groups are numbered according to their I paren. fc Foldcase a string pos Return or set current match position - quotemeta Quote metacharacters + quotemeta Quote metacharacters (escape their normal meaning) reset Reset m?pattern? status study Analyze string for optimizing matching diff --git a/pod/perlretut.pod b/pod/perlretut.pod index 03ddaffe612f..43320963dafb 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -187,7 +187,7 @@ C<"["> respectively; other gotchas apply. The significance of each of these will be explained in the rest of the tutorial, but for now, it is important only to know that a metacharacter can be matched as-is by putting a backslash before -it: +it. This is called "escaping" or "quoting" it. Some examples: "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + diff --git a/pp.c b/pp.c index 6e53f9b71129..5d174fd45829 100644 --- a/pp.c +++ b/pp.c @@ -5082,7 +5082,7 @@ PP(pp_quotemeta) else if (UTF8_IS_NEXT_CHAR_DOWNGRADEABLE(s, s + len)) { if ( #ifdef USE_LOCALE_CTYPE - /* In locale, we quote all non-ASCII Latin1 chars. + /* In locale, we escape all non-ASCII Latin1 chars. * Otherwise use the quoting rules */ IN_LC_RUNTIME(LC_CTYPE) @@ -5116,7 +5116,7 @@ PP(pp_quotemeta) } } else { - /* For non UNI_8_BIT (and hence in locale) just quote all \W + /* For non UNI_8_BIT (and hence in locale) just escape all \W * including everything above ASCII */ while (len--) { if (!isWORDCHAR_A(*s)) From 1c9bc0228c36a68361131b1551b5172ac5ec535a Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 10 May 2025 07:59:36 -0600 Subject: [PATCH 3/3] perlre: Simplify some text By adding a word to a =heading, a sentence can be removed, and is clearer. But since some pod somewhere may link to that heading, a section at the end is added with the old name, and pointing to the new one. Dan Book searched and found no instances in CPAN of the old heading being linked to, but this guarantees that nothing breaks. --- pod/perlre.pod | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/pod/perlre.pod b/pod/perlre.pod index b834f9e1423d..9436d352cb8a 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -1348,9 +1348,7 @@ their punctuation character equivalents, however at the trade-off that you have to tell perl when you want to use them. X

X

-=head2 Quoting metacharacters - -(Also known as "escaping".) +=head2 Quoting (escaping) metacharacters To cause a metacharacter to match its literal self, you precede it with a backslash. Unlike some other regular expression languages, any @@ -3413,6 +3411,10 @@ Subroutine call to a named capture group. Equivalent to C<< (?&I) >>. =back +=head2 Quoting metacharacters + +This section has been replaced by L. + =head1 BUGS There are a number of issues with regard to case-insensitive matching