Skip to content

Commit 89d254a

Browse files
add captures to MATCH
1 parent 741ab57 commit 89d254a

File tree

17 files changed

+855
-453
lines changed

17 files changed

+855
-453
lines changed

docpages/basic-language-reference/keywords/MATCH.md

Lines changed: 65 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -2,42 +2,46 @@
22

33
```basic
44
MATCH result, pattern$, haystack$
5+
MATCH result, pattern$, haystack$, var1$, var2$, ...
56
```
67

7-
Evaluates a **POSIX ERE** (extended regular expression) against a string and stores **1** for a match or **0** for no match into `result`.
8+
Evaluates a **POSIX ERE** (extended regular expression) against a string.
89

9-
* `result` must be an **integer** variable.
10-
* `pattern$` and `haystack$` are **strings**.
11-
* Matching is **ASCII-only** (no locale/Unicode).
12-
* No capture groups or sub-matches are returned; this is a **yes/no** test.
13-
14-
`MATCH` runs **cooperatively**: very large or pathological patterns are executed in slices.
10+
* In the first form, stores **1** for a match or **0** for no match into `result`.
11+
* In the second form, also assigns text captured by **parenthesised sub-expressions** to additional string variables (`var1$`, `var2$`, …).
1512

16-
\remark If the pattern is invalid, an error is raised with a descriptive message from the regex engine. Without an error handler, the program terminates. With an `ON ERROR` handler, control passes there.
13+
\remark Matching is **ASCII-only** (no locale or Unicode).
14+
\remark All regular expressions follow **POSIX ERE** syntax.
1715

1816
---
1917

20-
### Supported syntax (POSIX ERE subset)
18+
### Forms
2119

22-
* Literals: `ABC`
23-
* Any char: `.`
24-
* Quantifiers: `* + ?` (greedy)
25-
* Character classes: `[abc]`, ranges `[a-z]`, negation `[^0-9]`
26-
* Alternation: `A|B`
27-
* Anchors: `^` (start of string), `$` (end of string)
20+
#### Boolean match
2821

29-
### Not supported
22+
```basic
23+
MATCH result, pattern$, haystack$
24+
```
3025

31-
* Backreferences `\1`, `\2`, …
32-
* Inline flags like `(?i)` (use explicit classes instead, or upper/lower where appropriate)
33-
* PCRE extensions (`\d`, `\w`, lookaround, etc.)
34-
* Multiline mode: `^` and `$` match **string** boundaries only.
26+
* `result` must be an **integer** variable.
27+
* `pattern$` and `haystack$` are **strings**.
28+
* Returns 1 for a match, 0 for no match.
29+
30+
#### Match with captures
31+
32+
```basic
33+
MATCH result, pattern$, haystack$, cap1$, cap2$, ...
34+
```
35+
36+
* Each parenthesised group in `pattern$` (e.g. `(abc)`) is captured and copied into successive string variables.
37+
* Missing or non-participating groups yield `""`.
38+
* If the pattern contains fewer capture groups than variables, the extras receive empty strings.
3539

3640
---
3741

3842
### Examples
3943

40-
**Simple literal**
44+
**Simple match**
4145

4246
```basic
4347
MATCH R, "HELLO", "HELLO WORLD"
@@ -52,48 +56,33 @@ PRINT R ' 1
5256
5357
MATCH R, "END$", "THE END"
5458
PRINT R ' 1
55-
56-
MATCH R, "^A", "BA"
57-
PRINT R ' 0
5859
```
5960

60-
**Alternation**
61+
**Alternation and character classes**
6162

6263
```basic
6364
MATCH R, "CAT|DOG", "HOTDOG"
6465
PRINT R ' 1
6566
66-
MATCH R, "RED|GREEN", "BLUE"
67-
PRINT R ' 0
68-
```
69-
70-
**Character classes and ranges**
71-
72-
```basic
73-
MATCH R, "[0-9]+", "foo123bar"
74-
PRINT R ' 1
75-
7667
MATCH R, "[A-Z][a-z]+", "Title"
7768
PRINT R ' 1
78-
79-
MATCH R, "[^x]*z$", "crab ballz"
80-
PRINT R ' 1
8169
```
8270

83-
**Wildcard and quantifiers**
71+
**Capturing sub-expressions**
8472

8573
```basic
86-
MATCH R, "A.*C", "AXYZC"
87-
PRINT R ' 1
74+
MATCH R, "([A-Za-z]+),([A-Za-z]+)", "Hello,World", FIRST$, SECOND$
75+
PRINT R, FIRST$, SECOND$ ' 1 Hello World
76+
```
8877

89-
MATCH R, "A.+C", "AC"
90-
PRINT R ' 0
78+
**No match clears captures**
9179

92-
MATCH R, "B*", "AAAA"
93-
PRINT R ' 1 ' empty match is allowed
80+
```basic
81+
MATCH R, "(\d+)", "No digits here", NUM$
82+
PRINT R, NUM$ ' 0 ""
9483
```
9584

96-
**Handling invalid patterns**
85+
**Invalid pattern handling**
9786

9887
```basic
9988
ON ERROR PROCbad
@@ -108,9 +97,35 @@ END
10897

10998
---
11099

100+
### Supported syntax (POSIX ERE subset)
101+
102+
| Feature | Example | Description | |
103+
| ----------------- | -------------------------- | ---------------------------- | ------------ |
104+
| Literals | `ABC` | exact characters | |
105+
| Any char | `.` | matches any single character | |
106+
| Quantifiers | `* + ?` | greedy repetition | |
107+
| Character classes | `[abc]`, `[A-Z]`, `[^0-9]` | set, range, negation | |
108+
| Alternation | `A | B` | match A or B |
109+
| Anchors | `^`, `$` | start / end of string | |
110+
| Capturing groups | `(ABC)` | capture substring | |
111+
112+
---
113+
114+
### Not supported
115+
116+
* Backreferences `\1`, `\2`, …
117+
* Inline flags `(?i)` etc.
118+
* PCRE-style escapes (`\d`, `\w`, lookaround, …)
119+
* Multiline mode (`^` and `$` match string boundaries only)
120+
121+
---
122+
111123
### Notes
112124

113-
* Matching is **case-sensitive** by default. To approximate case-insensitive tests, normalise your data (e.g., convert both strings to upper case before matching) or use character classes (e.g., `[Hh][Ee][Ll][Ll][Oo]`).
114-
* Because `MATCH` is cooperative, very large inputs or patterns may take multiple idle ticks to complete. You do not need to poll—control returns to your program automatically once finished.
115-
* `^` and `$` are **string** anchors, not line anchors; there is no multiline mode.
116-
* The engine is compiled with `REG_NOSUB`; capture offsets are not available to BASIC code.
125+
* Matching is **case-sensitive**. To simulate case-insensitive matching, normalise both strings or use explicit character classes.
126+
* With captures, **co-operative execution is disabled** — the operation completes immediately.
127+
* Without captures, matching runs **co-operatively** across idle ticks for long inputs.
128+
* If the pattern is invalid, the engine reports a descriptive message.
129+
Without an error handler, the program terminates;
130+
with `ON ERROR PROCname`, control transfers to the handler.
131+
* Capture results are always independent copies; modifying the original string has no effect on captured values.

docs/MATCH.html

Lines changed: 67 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -96,38 +96,38 @@
9696
</div><!--header-->
9797
<div class="contents">
9898
<div class="textblock"><div class="fragment"><div class="line">MATCH result, pattern$, haystack$</div>
99-
</div><!-- fragment --><p>Evaluates a <b>POSIX ERE</b> (extended regular expression) against a string and stores <b>1</b> for a match or <b>0</b> for no match into <span class="tt">result</span>.</p>
99+
<div class="line">MATCH result, pattern$, haystack$, var1$, var2$, ...</div>
100+
</div><!-- fragment --><p>Evaluates a <b>POSIX ERE</b> (extended regular expression) against a string.</p>
100101
<ul>
102+
<li>In the first form, stores <b>1</b> for a match or <b>0</b> for no match into <span class="tt">result</span>.</li>
103+
<li>In the second form, also assigns text captured by <b>parenthesised sub-expressions</b> to additional string variables (<span class="tt">var1$</span>, <span class="tt">var2$</span>, …).</li>
104+
</ul>
105+
<dl class="section remark"><dt>Remarks</dt><dd>Matching is <b>ASCII-only</b> (no locale or Unicode). </dd>
106+
<dd>
107+
All regular expressions follow <b>POSIX ERE</b> syntax.</dd></dl>
108+
<hr />
109+
<h3 class="doxsection"><a class="anchor" id="forms-1"></a>
110+
Forms</h3>
111+
<h4 class="doxsection"><a class="anchor" id="boolean-match"></a>
112+
Boolean match</h4>
113+
<div class="fragment"><div class="line">MATCH result, pattern$, haystack$</div>
114+
</div><!-- fragment --><ul>
101115
<li><span class="tt">result</span> must be an <b>integer</b> variable.</li>
102116
<li><span class="tt">pattern$</span> and <span class="tt">haystack$</span> are <b>strings</b>.</li>
103-
<li>Matching is <b>ASCII-only</b> (no locale/Unicode).</li>
104-
<li>No capture groups or sub-matches are returned; this is a <b>yes/no</b> test.</li>
117+
<li>Returns 1 for a match, 0 for no match.</li>
105118
</ul>
106-
<p><span class="tt">MATCH</span> runs <b>cooperatively</b>: very large or pathological patterns are executed in slices.</p>
107-
<dl class="section remark"><dt>Remarks</dt><dd>If the pattern is invalid, an error is raised with a descriptive message from the regex engine. Without an error handler, the program terminates. With an <span class="tt">ON ERROR</span> handler, control passes there.</dd></dl>
108-
<hr />
109-
<h3 class="doxsection"><a class="anchor" id="supported-syntax-posix-ere-subset"></a>
110-
Supported syntax (POSIX ERE subset)</h3>
111-
<ul>
112-
<li>Literals: <span class="tt">ABC</span></li>
113-
<li>Any char: <span class="tt">.</span></li>
114-
<li>Quantifiers: <span class="tt">* + ?</span> (greedy)</li>
115-
<li>Character classes: <span class="tt">[abc]</span>, ranges <span class="tt">[a-z]</span>, negation <span class="tt">[^0-9]</span></li>
116-
<li>Alternation: <span class="tt">A|B</span></li>
117-
<li>Anchors: <span class="tt">^</span> (start of string), <span class="tt">$</span> (end of string)</li>
118-
</ul>
119-
<h3 class="doxsection"><a class="anchor" id="not-supported"></a>
120-
Not supported</h3>
121-
<ul>
122-
<li>Backreferences <span class="tt">\1</span>, <span class="tt">\2</span>, …</li>
123-
<li>Inline flags like <span class="tt">(?i)</span> (use explicit classes instead, or upper/lower where appropriate)</li>
124-
<li>PCRE extensions (<span class="tt">\d</span>, <span class="tt">\w</span>, lookaround, etc.)</li>
125-
<li>Multiline mode: <span class="tt">^</span> and <span class="tt">$</span> match <b>string</b> boundaries only.</li>
119+
<h4 class="doxsection"><a class="anchor" id="match-with-captures"></a>
120+
Match with captures</h4>
121+
<div class="fragment"><div class="line">MATCH result, pattern$, haystack$, cap1$, cap2$, ...</div>
122+
</div><!-- fragment --><ul>
123+
<li>Each parenthesised group in <span class="tt">pattern$</span> (e.g. <span class="tt">(abc)</span>) is captured and copied into successive string variables.</li>
124+
<li>Missing or non-participating groups yield <span class="tt">""</span>.</li>
125+
<li>If the pattern contains fewer capture groups than variables, the extras receive empty strings.</li>
126126
</ul>
127127
<hr />
128128
<h3 class="doxsection"><a class="anchor" id="examples-140"></a>
129129
Examples</h3>
130-
<p><b>Simple literal</b></p>
130+
<p><b>Simple match</b></p>
131131
<div class="fragment"><div class="line">MATCH R, &quot;HELLO&quot;, &quot;HELLO WORLD&quot;</div>
132132
<div class="line">IF R THEN PRINT &quot;Found&quot;</div>
133133
</div><!-- fragment --><p><b>Anchors</b></p>
@@ -136,34 +136,19 @@ <h3 class="doxsection"><a class="anchor" id="examples-140"></a>
136136
<div class="line"> </div>
137137
<div class="line">MATCH R, &quot;END$&quot;, &quot;THE END&quot;</div>
138138
<div class="line">PRINT R &#39; 1</div>
139-
<div class="line"> </div>
140-
<div class="line">MATCH R, &quot;^A&quot;, &quot;BA&quot;</div>
141-
<div class="line">PRINT R &#39; 0</div>
142-
</div><!-- fragment --><p><b>Alternation</b></p>
139+
</div><!-- fragment --><p><b>Alternation and character classes</b></p>
143140
<div class="fragment"><div class="line">MATCH R, &quot;CAT|DOG&quot;, &quot;HOTDOG&quot;</div>
144141
<div class="line">PRINT R &#39; 1</div>
145142
<div class="line"> </div>
146-
<div class="line">MATCH R, &quot;RED|GREEN&quot;, &quot;BLUE&quot;</div>
147-
<div class="line">PRINT R &#39; 0</div>
148-
</div><!-- fragment --><p><b>Character classes and ranges</b></p>
149-
<div class="fragment"><div class="line">MATCH R, &quot;[0-9]+&quot;, &quot;foo123bar&quot;</div>
150-
<div class="line">PRINT R &#39; 1</div>
151-
<div class="line"> </div>
152143
<div class="line">MATCH R, &quot;[A-Z][a-z]+&quot;, &quot;Title&quot;</div>
153144
<div class="line">PRINT R &#39; 1</div>
154-
<div class="line"> </div>
155-
<div class="line">MATCH R, &quot;[^x]*z$&quot;, &quot;crab ballz&quot;</div>
156-
<div class="line">PRINT R &#39; 1</div>
157-
</div><!-- fragment --><p><b>Wildcard and quantifiers</b></p>
158-
<div class="fragment"><div class="line">MATCH R, &quot;A.*C&quot;, &quot;AXYZC&quot;</div>
159-
<div class="line">PRINT R &#39; 1</div>
160-
<div class="line"> </div>
161-
<div class="line">MATCH R, &quot;A.+C&quot;, &quot;AC&quot;</div>
162-
<div class="line">PRINT R &#39; 0</div>
163-
<div class="line"> </div>
164-
<div class="line">MATCH R, &quot;B*&quot;, &quot;AAAA&quot;</div>
165-
<div class="line">PRINT R &#39; 1 &#39; empty match is allowed</div>
166-
</div><!-- fragment --><p><b>Handling invalid patterns</b></p>
145+
</div><!-- fragment --><p><b>Capturing sub-expressions</b></p>
146+
<div class="fragment"><div class="line">MATCH R, &quot;([A-Za-z]+),([A-Za-z]+)&quot;, &quot;Hello,World&quot;, FIRST$, SECOND$</div>
147+
<div class="line">PRINT R, FIRST$, SECOND$ &#39; 1 Hello World</div>
148+
</div><!-- fragment --><p><b>No match clears captures</b></p>
149+
<div class="fragment"><div class="line">MATCH R, &quot;(\d+)&quot;, &quot;No digits here&quot;, NUM$</div>
150+
<div class="line">PRINT R, NUM$ &#39; 0 &quot;&quot;</div>
151+
</div><!-- fragment --><p><b>Invalid pattern handling</b></p>
167152
<div class="fragment"><div class="line">ON ERROR PROCbad</div>
168153
<div class="line">MATCH R, &quot;(?i)HELLO&quot;, &quot;hello&quot; &#39; invalid: (?i) not supported</div>
169154
<div class="line">PRINT &quot;this line is not reached&quot;</div>
@@ -173,13 +158,44 @@ <h3 class="doxsection"><a class="anchor" id="examples-140"></a>
173158
<div class="line">PRINT &quot;Regex error!&quot;</div>
174159
<div class="line">END</div>
175160
</div><!-- fragment --><hr />
161+
<h3 class="doxsection"><a class="anchor" id="supported-syntax-posix-ere-subset"></a>
162+
Supported syntax (POSIX ERE subset)</h3>
163+
<table class="markdownTable">
164+
<tr class="markdownTableHead">
165+
<th class="markdownTableHeadNone">Feature </th><th class="markdownTableHeadNone">Example </th><th class="markdownTableHeadNone">Description </th><th class="markdownTableHeadNone"></th></tr>
166+
<tr class="markdownTableRowOdd">
167+
<td class="markdownTableBodyNone">Literals </td><td class="markdownTableBodyNone"><span class="tt">ABC</span> </td><td class="markdownTableBodyNone">exact characters </td><td class="markdownTableBodyNone"></td></tr>
168+
<tr class="markdownTableRowEven">
169+
<td class="markdownTableBodyNone">Any char </td><td class="markdownTableBodyNone"><span class="tt">.</span> </td><td class="markdownTableBodyNone">matches any single character </td><td class="markdownTableBodyNone"></td></tr>
170+
<tr class="markdownTableRowOdd">
171+
<td class="markdownTableBodyNone">Quantifiers </td><td class="markdownTableBodyNone"><span class="tt">* + ?</span> </td><td class="markdownTableBodyNone">greedy repetition </td><td class="markdownTableBodyNone"></td></tr>
172+
<tr class="markdownTableRowEven">
173+
<td class="markdownTableBodyNone">Character classes </td><td class="markdownTableBodyNone"><span class="tt">[abc]</span>, <span class="tt">[A-Z]</span>, <span class="tt">[^0-9]</span> </td><td class="markdownTableBodyNone">set, range, negation </td><td class="markdownTableBodyNone"></td></tr>
174+
<tr class="markdownTableRowOdd">
175+
<td class="markdownTableBodyNone">Alternation </td><td class="markdownTableBodyNone"><span class="tt">A \ilinebr &lt;/td&gt; &lt;td class="markdownTableBodyNone"&gt; B</span> </td><td class="markdownTableBodyNone">match A or B </td></tr>
176+
<tr class="markdownTableRowEven">
177+
<td class="markdownTableBodyNone">Anchors </td><td class="markdownTableBodyNone"><span class="tt">^</span>, <span class="tt">$</span> </td><td class="markdownTableBodyNone">start / end of string </td><td class="markdownTableBodyNone"></td></tr>
178+
<tr class="markdownTableRowOdd">
179+
<td class="markdownTableBodyNone">Capturing groups </td><td class="markdownTableBodyNone"><span class="tt">(ABC)</span> </td><td class="markdownTableBodyNone">capture substring </td><td class="markdownTableBodyNone"></td></tr>
180+
</table>
181+
<hr />
182+
<h3 class="doxsection"><a class="anchor" id="not-supported"></a>
183+
Not supported</h3>
184+
<ul>
185+
<li>Backreferences <span class="tt">\1</span>, <span class="tt">\2</span>, …</li>
186+
<li>Inline flags <span class="tt">(?i)</span> etc.</li>
187+
<li>PCRE-style escapes (<span class="tt">\d</span>, <span class="tt">\w</span>, lookaround, …)</li>
188+
<li>Multiline mode (<span class="tt">^</span> and <span class="tt">$</span> match string boundaries only)</li>
189+
</ul>
190+
<hr />
176191
<h3 class="doxsection"><a class="anchor" id="notes-151"></a>
177192
Notes</h3>
178193
<ul>
179-
<li>Matching is <b>case-sensitive</b> by default. To approximate case-insensitive tests, normalise your data (e.g., convert both strings to upper case before matching) or use character classes (e.g., <span class="tt">[Hh][Ee][Ll][Ll][Oo]</span>).</li>
180-
<li>Because <span class="tt">MATCH</span> is cooperative, very large inputs or patterns may take multiple idle ticks to complete. You do not need to poll—control returns to your program automatically once finished.</li>
181-
<li><span class="tt">^</span> and <span class="tt">$</span> are <b>string</b> anchors, not line anchors; there is no multiline mode.</li>
182-
<li>The engine is compiled with <span class="tt">REG_NOSUB</span>; capture offsets are not available to BASIC code. </li>
194+
<li>Matching is <b>case-sensitive</b>. To simulate case-insensitive matching, normalise both strings or use explicit character classes.</li>
195+
<li>With captures, <b>co-operative execution is disabled</b> — the operation completes immediately.</li>
196+
<li>Without captures, matching runs <b>co-operatively</b> across idle ticks for long inputs.</li>
197+
<li>If the pattern is invalid, the engine reports a descriptive message. Without an error handler, the program terminates; with <span class="tt">ON ERROR PROCname</span>, control transfers to the handler.</li>
198+
<li>Capture results are always independent copies; modifying the original string has no effect on captured values. </li>
183199
</ul>
184200
</div></div><!-- contents -->
185201
</div><!-- PageDoc -->

docs/doxygen_crawl.html

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -371,7 +371,10 @@
371371
<a href="LTRIM.html#examples-104"/>
372372
<a href="LTRIM.html#notes-103"/>
373373
<a href="MATCH.html"/>
374+
<a href="MATCH.html#boolean-match"/>
374375
<a href="MATCH.html#examples-140"/>
376+
<a href="MATCH.html#forms-1"/>
377+
<a href="MATCH.html#match-with-captures"/>
375378
<a href="MATCH.html#not-supported"/>
376379
<a href="MATCH.html#notes-151"/>
377380
<a href="MATCH.html#supported-syntax-posix-ere-subset"/>

0 commit comments

Comments
 (0)