Backslash sequences #5

SicroAtGit · 2021-08-14T10:56:50Z

SicroAtGit
Aug 14, 2021
Maintainer

In addition to character classes, there will also be shorthand character classes. However, I'm not quite sure yet which ones there should be and which characters they should cover.

According to this website, the different RegEx engines cover different characters in the shorthand character classes:
https://www.regular-expressions.info/shorthand.html

The current listing:

\r for carriage return (Add escape sequence \r (carriage return) #8)
\n for new line (Add escape sequence \n (line feed) #9)
\t for horizontal tab character (Add escape sequence \t (horizontal tab) #10)
\f for form feed (Add escape sequence \f (form feed) #22)
\d for digit (Add predefined character class \d (digit) #14)
\D for no digit (Add predefined character class \D (no digit) #23)
\s for whitespace character ( Add predefined character class \s (whitespace) #15)
\S for no whitespace character (Add predefined character class \S (no whitespace) #24)
\w for word character (Add predefined character class \w (word character) #25)
\W for no word character (Add predefined character class \W (no word character) #26)
\xhh (Add escape sequence \xhh (character with hex code hh) #17)
\uhhhh (Add escape sequence \uhhhh (character with hex code hhhh) #18)
\Q...\E (Add escape sequence \Q...\E #13)

tajmone · 2021-08-14T12:53:55Z

tajmone
Aug 14, 2021

According to this website, the different RegEx engines cover different characters in the shorthand character classes:
https://www.regular-expressions.info/shorthand.html

BTW, I've bought both RegexBuddy and RegexMagic form JGS (author of the website you linked), so if you need me to test some RegExs for you I'll happily do it. Both tools have a custom engine that includes all versions of the major RegEx engines (so that you can test backward compatibility issues with any engine) plus the custom engine by JGS, which is very powerful (also documented at the website).

One of these two programs also allows debugging a RegEx to break it down into each single passage, in case you need to compare expected behaviour in your code with actual behaviour by other engines.

As for the shorthand classes to implement, it really depends on what your engine goals are — which I'm guessing is mostly oriented toward lexers creation?

I'm not quite sure that \h and \v would be all that useful (vertical tabs are not used much in Western languages, and \t should suffice in place of \h), also these tend to have different meanings across engines.

Some other useful shorthands can be found here:

Special Characters — \Q...\E for escaping literal sequences.
Non-Printable Characters — \x ASCII chars, \u Unicode.
Anchors — \A and \Z are very useful for lexing.
Word Boundaries — Tcl style \m and \M.

I know that the above don't all qualify as characters shorthand, for some of them are more abstract in nature, but still...

14 replies

tajmone Aug 29, 2021

You mean PB 6 (upcoming version) and PB 5.46

Yes, I must I've typed the wrong keys (my eyesight has degraded a lot in the last year, so I can't really read well on the screen).

Where do you see a lot of multiplication calculations in the code?

In offsets, where the number of chars is multiplied by SizeOfChar().

SicroAtGit Aug 30, 2021
Maintainer Author

In offsets, where the number of chars is multiplied by SizeOfChar().

https://github.com/SicroAtGit/PB-RegEx-Engine/blob/b84cefe688ec9978bb0cac644f5497b26e97d91a/RegExEngine.pbi#L242

Where do you see multiplication in this line of code?

currentPosition is a memory pointer pointing to a string. And with SizeOf(Character) I add 1 (ASCII mode) or 2 (Unicode mode) to the value of currentPosition so that the memory pointer currentPosition points to the next character in the string afterwards.

It works like this little example code:

*text.Character = @"Example text"
Debug Chr(*text\c) ; Outputs "E"
*text + SizeOf(Character)
Debug Chr(*text\c) ; Outputs "x"

SicroAtGit Sep 5, 2021
Maintainer Author

I was confused by the article on the Regular Expression website. I think I have understood it now. \x saves writing the leading 00 at \u if you only want to define character codes between 00 and FF. So \x21 is the same as \u0021.

tajmone Sep 5, 2021

Yes, but then it's up to you to decide how to implement these. My suggestion would be to have \x handle ISO-8859-1 characters, and \u Unicode points. If you don't specify a single char encoding for \x, there's no way to know what any value above 127 would translate to, in terms of Unicode chars. Bear in mind that many of these RegEx engines do date back to the era of single-character encodings, code pages, etc., so their conventions might be rooted in some old standards.

SicroAtGit Sep 18, 2021
Maintainer Author

OK, thank you very much!

I have now added everything you suggested in your comment that is implementable with a standard DFA to the list.

Please create a new comment (no reply) for more suggestions.

SicroAtGit · 2022-01-02T13:18:40Z

SicroAtGit
Jan 2, 2022
Maintainer Author

I have now looked again at the documentation of Rust's crate regex and now have a picture of which characters (Unicode's character classes) the various predefined RegEx character classes should have.

With the help of the very useful tool UnicodeSet of the official Unicode website I can easily have the Unicode's character classes converted to RegEx character classes. For this output I wrote a tool which generates the tables for the character classes.

I have adjusted the listing at the top and the relevant issues to use the Unicode's character classes. For example, \d now matches not only [0-9], but all characters of the Unicode character class Nd, except those exceeding \uFFFF. The RegEx engine will still only support Unicode characters up to \uFFFF.

0 replies

SicroAtGit · 2022-08-06T21:24:28Z

SicroAtGit
Aug 6, 2022
Maintainer Author

A backslash sequence that matches a byte value can also be useful. So the RegEx engine can also be used to match byte sequences (beside characters) in binary data. This is not so unusual anymore. The question comes up again and again on the Internet (search for: regex binary data).

The NFA/DFA of the RegEx engine already works byte-based, so the implementation should not be difficult.

Example:

Syntax: \b00 up to \bFF

Match the byte sequence: 3C, 3D
Match the string: Test

\b3C\b3DTest

\b has been used only for the example. We should use another letter to avoid confusion because it is a common backslash sequence for word boundary.

2 replies

tajmone Aug 8, 2022

The NFA/DFA of the RegEx engine already works byte-based, so the implementation should not be difficult.

Interesting. I haven't yet had the time to look at the implementation details, but I assumed that since PB uses the UCS2 encoding internally to represent strings and "Unicode" characters every character would map to two bytes — i.e. if it's an ASCII char it would be followed by a null char (e.g. A → 0x41 + 0x00).

When using RegExs to parse binary data, end users probably want a per-byte fine-grain control over matched data, i.e. no null-padding to ensure a Char is a two-bytes UCS entity (A → 0x41). Isn't this going to create alignment problems?

SicroAtGit Aug 8, 2022
Maintainer Author

I assumed that since PB uses the UCS2 encoding internally to represent strings and "Unicode" characters every character would map to two bytes — i.e. if it's an ASCII char it would be followed by a null char (e.g. A → 0x41 + 0x00).

Yes, that's how the engine works:

A → 0x41 + 0x00
\x41 → 0x41 + 0x00
\u0041 → 0x41 + 0x00

When using RegExs to parse binary data, end users probably want a per-byte fine-grain control over matched data, i.e. no null-padding to ensure a Char is a two-bytes UCS entity (A → 0x41). Isn't this going to create alignment problems?

This is what the \b from the example above is for:

It matches a single byte, instead of a character (2 bytes)
\b41 → 0x41
\b3C\b3DTest → 0x3C + 0x3D + 0x54 + 0x00 + 0x65 + 0x00 + 0x73 + 0x00 + 0x74 + 0x00

Alternatively, we could implement a new RegEx mode that only allows characters from \x00 to \xFF and where every character matches only a single byte:

(?b)Test → 0x54 + 0x65 + 0x73 + 0x74
(?b)\x41\x00 → 0x41 + 0x00

Backslash sequences #5

Uh oh!

Uh oh!

SicroAtGit Aug 14, 2021 Maintainer

Replies: 3 comments · 16 replies

Uh oh!

tajmone Aug 14, 2021

Uh oh!

tajmone Aug 29, 2021

Uh oh!

Uh oh!

SicroAtGit Aug 30, 2021 Maintainer Author

Uh oh!

SicroAtGit Sep 5, 2021 Maintainer Author

Uh oh!

tajmone Sep 5, 2021

Uh oh!

SicroAtGit Sep 18, 2021 Maintainer Author

Uh oh!

Uh oh!

SicroAtGit Jan 2, 2022 Maintainer Author

Uh oh!

SicroAtGit Aug 6, 2022 Maintainer Author

Uh oh!

tajmone Aug 8, 2022

Uh oh!

Uh oh!

SicroAtGit Aug 8, 2022 Maintainer Author

SicroAtGit
Aug 14, 2021
Maintainer

Replies: 3 comments 16 replies

tajmone
Aug 14, 2021

SicroAtGit Aug 30, 2021
Maintainer Author

SicroAtGit Sep 5, 2021
Maintainer Author

SicroAtGit Sep 18, 2021
Maintainer Author

SicroAtGit
Jan 2, 2022
Maintainer Author

SicroAtGit
Aug 6, 2022
Maintainer Author

SicroAtGit Aug 8, 2022
Maintainer Author