diff --git a/pod/perldiag.pod b/pod/perldiag.pod index 5cf5fc7b3fde..6c9948f861e9 100644 --- a/pod/perldiag.pod +++ b/pod/perldiag.pod @@ -2602,8 +2602,8 @@ and perl's F emulation was unable to create an empty temporary file. (W regexp)(F) A character class range must start and end at a literal character, not another character class like C<\d> or C<[:alpha:]>. The "-" in your false range is interpreted as a literal "-". In a C<(?[...])> -construct, this is an error, rather than a warning. Consider quoting -the "-", "\-". The S<<-- HERE> shows whereabouts in the regular expression +construct, this is an error, rather than a warning. Consider escaping +the "-" as "\-". The S<<-- HERE> shows whereabouts in the regular expression the problem was discovered. See L. =item Fatal VMS error (status=%d) at %s, line %d @@ -5453,7 +5453,7 @@ S<<-- HERE> in m/%s/ (F) Within regular expression character classes ([]) the syntax beginning with "[." and ending with ".]" is reserved for future extensions. If you need to represent those character sequences inside a regular expression -character class, just quote the square brackets with the backslash: "\[." +character class, just escape the square brackets with the backslash: "\[." and ".\]". The S<<-- HERE> shows whereabouts in the regular expression the problem was discovered. See L. @@ -5463,7 +5463,7 @@ S<<-- HERE> in m/%s/ (F) Within regular expression character classes ([]) the syntax beginning with "[=" and ending with "=]" is reserved for future extensions. If you need to represent those character sequences inside a regular expression -character class, just quote the square brackets with the backslash: "\[=" +character class, just escape the square brackets with the backslash: "\[=" and "=\]". The S<<-- HERE> shows whereabouts in the regular expression the problem was discovered. See L. diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod index 99f40e54c6d2..388b4b9ee1a4 100644 --- a/pod/perlfunc.pod +++ b/pod/perlfunc.pod @@ -6536,6 +6536,11 @@ the C<\Q> escape in double-quoted strings. If EXPR is omitted, uses L|perlvar/$_>. +The motivation behind this is to make all characters in EXPR match their +literal selves. Otherwise any metacharacters in it could trigger +their "magic" matching behaviors. The characters this function has been +applied to are said to be "quoted" or "escaped". + quotemeta (and C<\Q> ... C<\E>) are useful when interpolating strings into regular expressions, because by default an interpolated variable will be considered a mini-regular expression. For example: diff --git a/pod/perlintro.pod b/pod/perlintro.pod index 4fdee7b16796..da428397ae7b 100644 --- a/pod/perlintro.pod +++ b/pod/perlintro.pod @@ -584,10 +584,32 @@ the meantime, here's a quick cheat sheet: ^ start of string $ end of string -Quantifiers can be used to specify how many of the previous thing you -want to match on, where "thing" means either a literal character, one -of the metacharacters listed above, or a group of characters or -metacharacters in parentheses. +Note that in the above, C<$> doesn't match a dollar sign. Similarly +C<.>, C<\>, C<[>, C<]>, C<(>, C<)>, and C<^> don't match the characters +you might expect. These are called "metacharacters". In contrast, the +characters C, C, C, C, and C, for example, are not +metacharacters. They match themselves literally. Metacharacters +normally match something that isn't their literal value. There are a few +more metacharacters than the ones above. Some quantifier ones are +given below, and the full list is in L. + +To make a metacharacter match its literal value, you "escape" (or "quote") +it, by preceding it with a backslash. Hence, C<\$> does match a dollar sign, +and C<\\> matches a literal backslash. + +Note also that above, the string C<\s>, for example, doesn't match a +backslash followed by the letter C. In this case, preceding the +non-metacharacter C with a backslash turns it into something that +doesn't match its literal value. Such a sequence is called an "escape +sequence". L documents all of the current ones. + +A warning is raised if you escape a character that isn't a metacharacter +and isn't part of a currently defined escape sequence. + +You can specify how many of the previous thing you want to match on by +using quantifiers (where "thing" means one of: a literal character, one +of the constructs listed above, or a group of either of them in +parentheses). * zero or more of the previous thing + one or more of the previous thing diff --git a/pod/perlre.pod b/pod/perlre.pod index 3d046ac64f26..9436d352cb8a 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -1348,26 +1348,42 @@ their punctuation character equivalents, however at the trade-off that you have to tell perl when you want to use them. X

X

-=head2 Quoting metacharacters +=head2 Quoting (escaping) metacharacters -Backslashed metacharacters in Perl are alphanumeric, such as C<\b>, -C<\w>, C<\n>. Unlike some other regular expression languages, there -are no backslashed symbols that aren't alphanumeric. So anything -that looks like C<\\>, C<\(>, C<\)>, C<\[>, C<\]>, C<\{>, or C<\}> is -always -interpreted as a literal character, not a metacharacter. This was -once used in a common idiom to disable or quote the special meanings -of regular expression metacharacters in a string that you want to -use for a pattern. Simply quote all non-"word" characters: +To cause a metacharacter to match its literal self, you precede it with +a backslash. Unlike some other regular expression languages, any +sequence consisting of a backslash followed by a non-alphanumeric +matches that non-alphanumeric, literally. So things like C<\\>, C<\(>, +C<\)>, C<\[>, C<\]>, C<\{>, or C<\}> are always interpreted as the +literal character that follows the backslash. - $pattern =~ s/(\W)/\\$1/g; +(That's not true when an alphanumeric character is preceded by a +backslash. There are a few such "escape sequences", like C<\w>, which have +special matching behaviors in Perl. All such are currently limited to +ASCII-range alphanumerics.) + +The best method to escape metacharacters is to use the +C> function, or the equivalent, but the +more flexible, and often more convenient, C<\Q> metaquoting escape +sequence + + quotemeta $pattern; + +This changes C<$pattern> so that the metacharacters are quoted. You can +then do + + $string =~ s/$pattern/foo/; + +and be assured that any metacharacters in C<$pattern> will match their +literal selves. If you instead use C<\Q>, like: + + $string =~ s/\Qpattern/foo/; -(If C is set, then this depends on the current locale.) -Today it is more common to use the C> -function or the C<\Q> metaquoting escape sequence to disable all -metacharacters' special meanings like this: +you don't have to have a separate C<$pattern> variable. Further, there +is an additional escape sequence, C<\E> that can be combined with C<\Q> +to allow you to escape whatever portions of the pattern you desire: - /$unquoted\Q$quoted\E$unquoted/ + $string =~ s/$unquoted\Q$quoted\E$unquoted/foo/; Beware that if you put literal backslashes (those not inside interpolated variables) between C<\Q> and C<\E>, double-quotish @@ -1375,7 +1391,18 @@ backslash interpolation may lead to confusing results. If you I to use literal backslashes within C<\Q...\E>, consult L. -C and C<\Q> are fully described in L. +In older code, you may see something like this: + + $pattern =~ s/(\W)/\\$1/g; + $string =~ s/$pattern/foo/; + +This simply adds backslashes before all non-"word" characters to disable +any special meanings they might have. (If S> is in +effect, the current locale can affect the results.) This paradigm is +inadequate for Unicode. + +C and C<\Q> are more fully described in +L. =head2 Extended Patterns @@ -3384,6 +3411,10 @@ Subroutine call to a named capture group. Equivalent to C<< (?&I) >>. =back +=head2 Quoting metacharacters + +This section has been replaced by L. + =head1 BUGS There are a number of issues with regard to case-insensitive matching diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index 14afb9728455..39153f6a1f60 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -90,8 +90,8 @@ as C \o{} Octal escape sequence. \p{}, \pP Match any character with the given Unicode property. \P{}, \PP Match any character without the given property. - \Q Quote (disable) pattern metacharacters till \E. Not - in []. + \Q Quote (disable) pattern metacharacters till \E. + (Also called "escape".) Not in []. \r Return character. \R Generic new line. Not in []. \s Match any whitespace character. @@ -350,11 +350,11 @@ them, until either the end of the pattern or the next occurrence of C<\E>, whichever comes first. They provide functionality similar to what the functions C and C provide. -C<\Q> is used to quote (disable) pattern metacharacters, up to the next -C<\E> or the end of the pattern. C<\Q> adds a backslash to any character -that could have special meaning to Perl. In the ASCII range, it quotes -every character that isn't a letter, digit, or underscore. See -L for details on what gets quoted for non-ASCII +C<\Q> is used to quote or escape (disable) pattern metacharacters, up to +the next C<\E> or the end of the pattern. C<\Q> adds a backslash to any +character that could have special meaning to Perl. In the ASCII range, +it quotes every character that isn't a letter, digit, or underscore. +See L for details on what gets quoted for non-ASCII code points. Using this ensures that any character between C<\Q> and C<\E> will be matched literally, not interpreted as a metacharacter by the regex engine. diff --git a/pod/perlreref.pod b/pod/perlreref.pod index 6955a7fb7a65..624edf2a5e7d 100644 --- a/pod/perlreref.pod +++ b/pod/perlreref.pod @@ -318,7 +318,7 @@ Captured groups are numbered according to their I paren. fc Foldcase a string pos Return or set current match position - quotemeta Quote metacharacters + quotemeta Quote metacharacters (escape their normal meaning) reset Reset m?pattern? status study Analyze string for optimizing matching diff --git a/pod/perlretut.pod b/pod/perlretut.pod index 03ddaffe612f..43320963dafb 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -187,7 +187,7 @@ C<"["> respectively; other gotchas apply. The significance of each of these will be explained in the rest of the tutorial, but for now, it is important only to know that a metacharacter can be matched as-is by putting a backslash before -it: +it. This is called "escaping" or "quoting" it. Some examples: "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + diff --git a/pp.c b/pp.c index 6e53f9b71129..5d174fd45829 100644 --- a/pp.c +++ b/pp.c @@ -5082,7 +5082,7 @@ PP(pp_quotemeta) else if (UTF8_IS_NEXT_CHAR_DOWNGRADEABLE(s, s + len)) { if ( #ifdef USE_LOCALE_CTYPE - /* In locale, we quote all non-ASCII Latin1 chars. + /* In locale, we escape all non-ASCII Latin1 chars. * Otherwise use the quoting rules */ IN_LC_RUNTIME(LC_CTYPE) @@ -5116,7 +5116,7 @@ PP(pp_quotemeta) } } else { - /* For non UNI_8_BIT (and hence in locale) just quote all \W + /* For non UNI_8_BIT (and hence in locale) just escape all \W * including everything above ASCII */ while (len--) { if (!isWORDCHAR_A(*s))