Skip to content

Commit 83eeeeb

Browse files
authored
compilation flags (#3)
* refactoring * compilation flags through `/.../flags` syntax * parser fix for flags * caseless groups, better compilation * better anchoring, README updates
1 parent 02e0c8d commit 83eeeeb

File tree

11 files changed

+378
-365
lines changed

11 files changed

+378
-365
lines changed

MIK.md

Lines changed: 20 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,10 @@ Accepts `mikmatch` syntax, along with some nice to haves.
77
The grammar accepted by this extensions is the following
88

99
```bnf
10-
<main_match_case> ::= "/" <pattern> "/" EOF
10+
<main_match_case> ::= <pattern> EOF
11+
| "/" <pattern> "/" <flags> EOF
12+
13+
<flags> ::= "" | "i" <flags> | "u" <flags>
1114
1215
<main_let_expr> ::= <pattern> EOF
1316
@@ -24,6 +27,7 @@ The grammar accepted by this extensions is the following
2427
| <basic_atom> "*"
2528
| <basic_atom> "+"
2629
| <basic_atom> "?"
30+
| <basic_atom> "~" # caseless matching
2731
| <basic_atom> "{" INT (n) "}" # match n times
2832
| <basic_atom> "{" INT (n) "-" INT (m) "}" # match at least n times, at most m times
2933
@@ -39,12 +43,12 @@ The grammar accepted by this extensions is the following
3943
| "(" <pattern> ")"
4044
| "(" IDENT ")"
4145
| "(" IDENT "as" IDENT ")"
42-
| "(" IDENT "as" IDENT ":" INT_CONVERTER ")"
43-
| "(" IDENT "as" IDENT ":" FLOAT_CONVERTER ")"
46+
| "(" IDENT "as" IDENT ":" "int" ")"
47+
| "(" IDENT "as" IDENT ":" "float" ")"
4448
| "(" IDENT "as" IDENT ":=" <func_name> ")"
4549
| "(" <pattern> "as" IDENT ")"
46-
| "(" <pattern> "as" IDENT ":" INT_CONVERTER ")"
47-
| "(" <pattern> "as" IDENT ":" FLOAT_CONVERTER ")"
50+
| "(" <pattern> "as" IDENT ":" "int" ")"
51+
| "(" <pattern> "as" IDENT ":" "float" ")"
4852
| "(" <pattern> "as" IDENT ":=" <func_name> ")"
4953
5054
<func_name> ::= IDENT
@@ -142,13 +146,19 @@ let mk_example name num mode = match mode with
142146
| Some _ | None -> { name; num; mode = `Default}
143147
144148
let mk_example_re = function%mikmatch
145-
| {|/ (['a'-'z'] as name := String.capitalize_ascii) ' ' (digit+ as num : int) ' ' ('a'|'b' as mode)? >>> mk_example as res /|} -> (* (res : example) available here, and all other bound variables *)
149+
| {|/ (['a'-'z']~ as name := String.capitalize_ascii) ' ' (digit+ as num : int) ' ' ('a'|'b' as mode)? >>> mk_example as res /|} -> (* (res : example) available here, and all other bound variables *)
146150
| _ -> ...
147151
```
148152

149-
## Case Insensitive Match
153+
### Default catch-all case
154+
The PPX generates a default catch-all case if none is provided. This catch-all case executes if none of the RE match cases does, and it raises a `Failure` exception with the location of the function and name of the file where it was raised.
155+
156+
## Flags
150157

151-
You can use `%mikmatch_i`: `match%mikmatch_i` and `function%mikmatch_i`. (not available at the variable level)
158+
The `/` delimiters are optional, except if flags are needed using the syntax `/ ... / flags`, where `flags` can be
159+
- `i` for caseless matching
160+
- `u` for unanchored matching (`%mikmatch` is anchored at the beginning and end by default)
161+
- or both
152162

153163
## Alternatives
154164
### Defining variables
@@ -159,7 +169,6 @@ let%mikmatch re = {|some regex|}
159169
let re = {%mikmatch|some regex|}
160170
```
161171

162-
No `/` delimiters are needed here.
163172

164173
### Matching:
165174
#### `match%mikmatch` and `function%mikmatch`
@@ -172,7 +181,7 @@ function%mikmatch
172181
| _ -> ...
173182
```
174183

175-
This match expression will compile all of the REs in the branches into one, and use marks to find which branch was executed.
184+
This match expression will compile all of the REs in the branches into one, with some exceptions around pattern guards, and use marks to find which branch was executed.
176185
Efficient if you have multiple branches.
177186

178187
The regexes are anchored both at the beginning, and at the end. So, for example, the first match case will be compiled to `^some regex$`.
@@ -185,12 +194,10 @@ function
185194
| {%mikmatch|/ some regex /|} -> ...
186195
...
187196
| "another string" -> ...
188-
| {%mikmatch_i|/ some regex /|} -> ...
197+
| {%mikmatch|/ some regex /|} -> ...
189198
...
190199
| _ -> ...
191200
```
192201

193202
This match expression will compile all of the REs **individually**, and test each one in sequence.
194-
Recommended if you only matching one RE. It is less efficient than the first option for more than one RE, but allows raw string matching.
195-
196203
It keeps all of the features (guards and such) of the previous extension, explored in [Semantics](#Semantics_and_Examples)

README.md

Lines changed: 79 additions & 112 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,11 @@
1-
# PPXes for Working with Regular Expressions
1+
# PPX for Working with Regular Expressions
22

3-
This repo provides PPXes providing regular expression-based routing:
3+
This repo provides a PPX providing regular expression-based routing:
44

55
- `ppx_regexp_extended` maps to [re][] with the conventional last-match extraction
66
into `string` and `string option`. Two syntaxes for regular expressions available:
77
- `pcre`: The syntax of regular PCRE expressions
88
- `mikmatch`: Mimics the syntax of the [mikmatch](https://mjambon.github.io/mjambon2016/mikmatch-manual.html) tool
9-
- `ppx_tyre` maps to [Tyre][tyre] providing typed extraction into options,
10-
lists, tuples, objects, and polymorphic variants.
11-
12-
Another difference is that `ppx_regexp` works directly on strings
13-
essentially hiding the library calls, while `ppx_tyre` provides `Tyre.t` and
14-
`Tyre.route` which can be composed an applied using the Tyre library.
15-
16-
## `ppx_regexp_extended` - Regular Expression Matching with OCaml Patterns
179

1810
This syntax extension turns:
1911
```ocaml
@@ -32,6 +24,83 @@ let%pcre var = {| some regex |}
3224
let%mikmatch var = {| some regex |}
3325
```
3426

27+
### `%mikmatch`
28+
29+
Full [%mikmatch guide](./MIK.md).
30+
31+
#### Quick Links
32+
- [Variable capture](./MIK.md#variable-capture)
33+
- [Type conversion](./MIK.md#type-conversion)
34+
- [Different extensions](./MIK.md#alternatives)
35+
36+
#### Motivational Examples
37+
38+
URL parsing:
39+
```ocaml
40+
let parse s =
41+
let (scheme, first) =
42+
match s.[4] with
43+
| ':' -> `Http, 7
44+
| 's' -> `Https, 8
45+
| _ -> Exn.fail "parse: %S" s
46+
in
47+
let last = String.index_from s first '/' in
48+
let host = String.slice s ~first ~last in
49+
let (host,port) =
50+
match Stre.splitc host ':' with
51+
| exception _ -> host, default_port scheme
52+
| (host,port) -> host, int_of_string port
53+
in
54+
let (path,query,fragment) = make_path @@ String.slice s ~first:last in
55+
{ scheme; host; port; path; query; fragment }
56+
57+
(* in mikmatch: *)
58+
59+
let parse s =
60+
match%mikmatch s with
61+
| {|/ "http" ('s' as https)? "://" ([^ '/' ':']+ as host) (":" (digit+ as port : int))? '/'? (_* as rest) /|} ->
62+
let scheme = match https with Some _ -> `Https | None -> `Http in
63+
let port = match port with Some p -> p | None -> default_port scheme in
64+
let path, query, fragment = make_path ("/" ^ rest) in
65+
{ scheme; host; port; path; query; fragment }
66+
| _ -> Exn.fail "Url.parse: %S" s
67+
68+
```
69+
70+
```ocaml
71+
let rex =
72+
let origins = "csv|pdf|html|xlsv|xml"
73+
Re2.create_exn (sprintf {|^(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z)(?:\.(\d+))?\.(%s)\.(\d+)\.(\d+)$|} origins)
74+
75+
let of_string s =
76+
try
77+
let m = Re2.first_match_exn rex s in
78+
let start = Re2.Match.get_exn ~sub:(`Index 1) m |> U.strptime "%Y-%m-%dT%H:%M:%S%z" |> U.timegm in
79+
let shard = int_of_string (Re2.Match.get_exn ~sub:(`Index 2) m) in
80+
let origin = origin_of_string (Re2.Match.get_exn ~sub:(`Index 3) m) in
81+
let partition = int_of_string (Re2.Match.get_exn ~sub:(`Index 4) m) in
82+
let worker = int_of_string (Re2.Match.get_exn ~sub:(`Index 5) m) in
83+
{ start; shard; origin; partition; worker }
84+
with _ -> invalid_arg (sprintf "error: %s" s)
85+
86+
(* in mikmatch: *)
87+
88+
let%mikmatch origins = {| "csv" | "pdf" | "html" | "xlsv" | "xml" |}
89+
90+
let of_string s =
91+
match%mikmatch s with
92+
| {|/ (digit{4} '-' digit{2} '-' digit{2} 'T' digit{2} ':' digit{2} ':' digit{2} 'Z' as timestamp)
93+
('.' (digit+ as shard : int))?
94+
'.' (origins as origin := origin_of_string)
95+
'.' (digit+ as partition : int)
96+
'.' (digit+ as worker : int) /|} ->
97+
let start = U.strptime "%Y-%m-%dT%H:%M:%S%z" timestamp |> U.timegm in
98+
let shard = match shard with Some s -> s | None -> 0 in
99+
{ start; shard; origin; partition; worker }
100+
| _ -> invalid_arg (sprintf "error: %s" s)
101+
102+
```
103+
35104
### `%pcre`
36105

37106
The patterns are plain strings of the form accepted by `Re.Pcre`, with the following additions:
@@ -58,15 +127,6 @@ The patterns are plain strings of the form accepted by `Re.Pcre`, with the follo
58127
A variable is allowed for the universal case and is bound to the matched
59128
string.
60129

61-
### `%mikmatch`
62-
63-
Full [%mikmatch guide](./MIK.md).
64-
65-
#### Quick Links
66-
- [Variable capture](./MIK.md#variable-capture)
67-
- [Type conversion](./MIK.md#type-conversion)
68-
- [Different extensions](./MIK.md#alternatives)
69-
70130
### Example
71131

72132
The following prints out times and hosts for SMTP connections to the Postfix daemon:
@@ -115,100 +175,8 @@ let () = Lwt_main.run begin
115175
end
116176
```
117177

118-
## `ppx_tyre` - Syntax Support for Tyre Routes
119-
120-
### Typed regular expressions
121-
122-
This PPX compiles
123-
```ocaml
124-
[%tyre {|re|}]
125-
```
126-
into `'a Tyre.t`.
127-
128-
For instance, We can define a pattern that recognize strings of the form "dim:3x5" like so:
129-
130-
```ocaml
131-
# open Tyre ;;
132-
# let dim = [%tyre "dim:(?&int)x(?&int)"] ;;
133-
val dim : (int * int) Tyre.t
134-
```
135-
136-
The syntax `(?&id)` allows to call a typed regular expression named `id` of type `'a Tyre.t`, such as `Tyre.int`.
137-
138-
For convenience, you can also use *named* capture groups to name the captured elements.
139-
```ocaml
140-
# let dim = [%tyre "dim:(?<x>(?&int))x(?&y:int)"] ;;
141-
val dim : < x : int; y : int > Tyre.t
142-
```
143-
144-
Names given using the syntax `(?<foo>re)` will be used for the fields
145-
of the results. `(?&y:int)` is a shortcut for `(?<y>(?&int))`.
146-
This can also be used for alternatives, for instance:
147-
148-
```ocaml
149-
# let id_or_name = [%tyre "id:(?&id:int)|name:(?<name>[[:alnum:]]+)"] ;;
150-
val id_or_name : [ `id of int | `name of string ] Tyre.t
151-
```
152-
153-
Expressions of type `Tyre.t` can then be composed as part of bigger regular
154-
expressions, or compiled with `Tyre.compile`.
155-
See [tyre][]'s documentation for details.
156-
157-
### Routes
158-
159-
`ppx_tyre` can also be used for routing, in the style of `ppx_regexp`:
160-
161-
```ocaml
162-
function%tyre
163-
| {|re1|} -> e1
164-
...
165-
| {|reN|} -> eN
166-
```
167-
168-
is turned into a `'a Type.route`, where `re`, `re1`, ... are regular expressions
169-
using the same syntax as above. `"re" as v` is considered like `(?<v>re)` and
170-
`"re1" | "re2"` is turned into a regular expression alternative.
171-
172-
Once routes are defined, matching is done with `Tyre.exec`.
173-
174-
### Details
175-
176-
The syntax follow Perl's syntax:
177-
178-
- `re?` extracts an option of what `re` extracts.
179-
- `re+`, `re*`, `re{n,m}` extracts a list of what `re` extracts.
180-
- `(?&qname)` refers to any identifier bound to a typed regular expression
181-
of type `'a Tyre.t`.
182-
- Normal parens are *non-capturing*.
183-
- There are two ways to capture:
184-
- Anonymous capture `(+re)`
185-
- Named capture `(?<v>re)`
186-
- One or more `(?<v>re)` at the top level can be used to bind variables
187-
instead of `as ...`.
188-
- One or more `(?<v>re)` in a sequence extracts an object where each method
189-
`v` is bound to what `re` extracts.
190-
- An alternative with one `(?<v>re)` per branch extracts a polymorphic
191-
variant where each constructor `` `v`` receives what `re` extracts as its
192-
argument.
193-
- `(?&v:qname)` is a shortcut for `(?<v>(?&qname))`.
194-
195178
## Limitations
196179

197-
### (ppx_tyre) No Pattern Guards
198-
199-
Pattern guards are not supported for `ppx_tyre`.
200-
This is due to the fact that all match cases are combined into a single regular expression, so if one of the
201-
patterns succeed, the match is committed before we can check the guard
202-
condition.
203-
204-
`ppx_regexp_extended` gets around this by grouping match cases with the same guards and compiling those together, instead
205-
of every match case being compiled into one RE.
206-
> [!WARNING]
207-
> There is still a limitation with the guards: if two branches have overlapping REs, and the first has a guard that evaluates to false,
208-
> then the second branch will not be ran. This is because of a limitation with `ocaml-re`'s Marking machine, it only
209-
> tests until a mark is found, and doesn't search further.
210-
211-
212180
### No Exhaustiveness Check
213181

214182
The syntax extension will always warn if no catch-all case is provided. No
@@ -223,4 +191,3 @@ file bug reports in the GitHub issue tracker. Any exception raised by
223191
generated code except for `Match_failure` is a bug.
224192

225193
[re]: https://github.com/ocaml/ocaml-re
226-
[tyre]: https://github.com/Drup/tyre

common/mik_lexer.mll

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@ rule token = parse
8686
| '_' { UNDERSCORE }
8787
| ':' { COLON }
8888
| '=' { EQUAL }
89+
| '~' { TILDE }
8990
| "as" { AS }
9091
| ">>>" { PIPE }
9192
| "int" { INT_CONVERTER }

0 commit comments

Comments
 (0)