Skip to content

Commit 3e0202e

Browse files
committed
function conversion, README improved
1 parent ef73879 commit 3e0202e

File tree

6 files changed

+153
-55
lines changed

6 files changed

+153
-55
lines changed

README.md

Lines changed: 96 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -4,48 +4,102 @@
44

55
This repo provides two PPXes providing regular expression-based routing:
66

7-
- `ppx_regexp` maps to [re][] with the conventional last-match extraction
8-
into `string` and `string option`.
7+
- `ppx_regexp_extended` maps to [re][] with the conventional last-match extraction
8+
into `string` and `string option`. Two syntaxes for regular expressions available:
9+
- `pcre`: The syntax of regular PCRE expressions
10+
- `mikmatch`: Mimics the syntax of the [mikmatch](https://mjambon.github.io/mjambon2016/mikmatch-manual.html) tool
911
- `ppx_tyre` maps to [Tyre][tyre] providing typed extraction into options,
1012
lists, tuples, objects, and polymorphic variants.
1113

1214
Another difference is that `ppx_regexp` works directly on strings
1315
essentially hiding the library calls, while `ppx_tyre` provides `Tyre.t` and
1416
`Tyre.route` which can be composed an applied using the Tyre library.
1517

16-
## `ppx_regexp` - Regular Expression Matching with OCaml Patterns
18+
## `ppx_regexp_extended` - Regular Expression Matching with OCaml Patterns
1719

18-
This syntax extension turns
20+
This syntax extension turns:
1921
```ocaml
2022
function%pcre
2123
| {|re1|} -> e1
2224
...
2325
| {|reN|} -> eN
2426
| _ -> e0
2527
```
26-
into suitable invocations of the [Re library][re], and similar for
27-
`match%pcre`. The patterns are plain strings of the form accepted by
28-
`Re_pcre`, with the following additions:
28+
(or `function%mik`) into suitable invocations of the [Re library][re], and similar for `match%pcre`/`match%mik`.
29+
30+
It also accepts:
31+
```ocaml
32+
let%pcre var = {| some regex |}
33+
(* and *)
34+
let%mik var = {| some regex |}
35+
```
36+
37+
### `%pcre`
38+
39+
The patterns are plain strings of the form accepted by `Re.Pcre`, with the following additions:
2940

3041
- `(?<var>...)` defines a group and binds whatever it matches as `var`.
3142
The type of `var` will be `string` if the match is guaranteed given that
3243
the whole pattern matches, and `string option` if the variable is bound
3344
to or nested below an optionally matched group.
3445

46+
- `(N?<var>)` gets substituted by the value of the globally defined string variable named `var`,
47+
and binds whatever it matches as `var`.
48+
The type of `var` will be the same as `(?<var>...)`.
49+
50+
- `(N?<var as name>)` gets substituted by the value of the globally defined string variable named `var`,
51+
and binds whatever it matches as `name`.
52+
The type of `name` will be the same as `(?<var>...)`.
53+
54+
- `(U?<var>)` gets substituted by the value of the globally defined string variable named `var`,
55+
and does not bind its match to any name.
56+
3557
- `?<var>` at the start of a pattern binds group 0 as `var : string`.
3658
This may not be the full string if the pattern is unanchored.
3759

3860
A variable is allowed for the universal case and is bound to the matched
39-
string. A regular alias is currently not allowed for patterns, since it is
40-
not obvious whether is should bind the full string or group 0.
61+
string.
62+
63+
### `%mik`
64+
65+
The syntax that this extension accepts is as follows:
66+
67+
- `char-literal`: Match the given character (priority 0).
68+
- `_` (underscore): Match any character (priority 0).
69+
- `string-literal`: Match the given sequence of characters (priority 0).
70+
- `[set-of-characters]`: Character class, match one of the characters given by set-of-characters (priority 0). The grammar for set-of-characters is the following:
71+
- `char-literal``char-literal`: defines a range of characters according to the iso-8859-1 encoding (includes ASCII).
72+
- `char-literal`: defines a singleton (a set containing just this character).
73+
- `string-literal`: defines a set that contains all the characters present in the given string.
74+
- `lowercase-identifier`: is replaced by the corresponding predefined regular expression; this regular expression must be exactly of length 1 and therefore represents a set of characters.
75+
- `set-of-characters`: set-of-characters defines the union of two sets of characters.
76+
- `[^set-of-characters]`: Negative character class
77+
- `regexp *`: Match the pattern given by regexp 0 time or more (priority 0).
78+
- `regexp +`: Match the pattern given by regexp 1 time or more (priority 0).
79+
- `regexp ?`: Match the pattern given by regexp at most once (priority 0).
80+
- `regexp{m−n}`: Match regexp at least `m` times and up to `n` times. `m` and `n` must be integer literals (priority 0).
81+
- `regexp{n}`: Same as regexp{n−n} (priority 0).
82+
- `( regexp )`: Match regexp (priority 0).
83+
- `regexp regexp`: Match the first regular expressions and then the second one (priority 1).
84+
- `regexp | regexp`: Match one of these two regular expressions (priority 2).
85+
- `regexp as lowercase-identifier`: Give a name to the substring that will be matched by the given pattern. This string becomes available under this name (priority 3).
86+
In-place conversions of the matched substring can be performed using one these three mechanisms:
87+
- `regexp as lowercase-identifier : int`: `int` behaves as `int_of_string`
88+
- `regexp as lowercase-identifier : float`: `float` behaves as `float_of_string`
89+
- `regexp as lowercase-identifier := converter`: where `converter` is any function which converts a string into something else.
90+
91+
In addition, the following predefined character classes are available:
92+
- **POSIX character classes:** `lower`, `upper`, `alpha`, `digit`, `alnum`, `punct`, `graph`, `print`, `blank`, `space`, `cntrl`, `xdigit`, `word`
93+
- **Control sequences:** `eos` (same as `$`), `eol` (end of string or newline), `bnd` (word boundary `\b`), `bos` (same as `^`), `any` (any character except newline)
4194

4295
### Example
4396

44-
The following prints out times and hosts for SMTP connections to the Postfix
45-
daemon:
97+
The following prints out times and hosts for SMTP connections to the Postfix daemon:
98+
99+
#### `%pcre`
46100
```ocaml
47101
(* Link with re, re.pcre, lwt, lwt.unix.
48-
Preprocess with ppx_regexp.
102+
Preprocess with ppx_regexp_extended.
49103
Adjust to your OS. *)
50104
51105
open Lwt.Infix
@@ -63,6 +117,29 @@ let () = Lwt_main.run begin
63117
end
64118
```
65119

120+
#### `%mik`
121+
```ocaml
122+
(* Link with re, re.pcre, lwt, lwt.unix.
123+
Preprocess with ppx_regexp_extended.
124+
Adjust to your OS. *)
125+
126+
open Lwt.Infix
127+
128+
let%mik host = {| [a-z0-9.-]+ |}
129+
130+
let check_line =
131+
(function%mik
132+
| {|/ (any* ':' digit digit as t) ' ' (any*) ' ' "postfix/smtpd" '[' digit+ ']' ": connect from " (host) /|} ->
133+
Lwt_io.printlf "%s %s" t host
134+
| _ ->
135+
Lwt.return_unit)
136+
137+
let () = Lwt_main.run begin
138+
Lwt_io.printl "SMTP connections from:" >>= fun () ->
139+
Lwt_stream.iter_s check_line (Lwt_io.lines_of_file "/var/log/syslog")
140+
end
141+
```
142+
66143
## `ppx_tyre` - Syntax Support for Tyre Routes
67144

68145
### Typed regular expressions
@@ -142,13 +219,17 @@ The syntax follow Perl's syntax:
142219

143220
## Limitations
144221

145-
### No Pattern Guards
222+
### (ppx_tyre) No Pattern Guards
146223

147-
Pattern guards are not supported. This is due to the fact that all match
148-
cases are combined into a single regular expression, so if one of the
224+
Pattern guards are not supported for `ppx_tyre`.
225+
This is due to the fact that all match cases are combined into a single regular expression, so if one of the
149226
patterns succeed, the match is committed before we can check the guard
150227
condition.
151228

229+
`ppx_regexp_extended` gets around this by grouping match cases with the same guards and compiling those together, instead
230+
of every match case being compiled into one RE.
231+
232+
152233
### No Exhaustiveness Check
153234

154235
The syntax extension will always warn if no catch-all case is provided. No
@@ -162,8 +243,5 @@ The processor is currently new and not well tested. Please break it and
162243
file bug reports in the GitHub issue tracker. Any exception raised by
163244
generated code except for `Match_failure` is a bug.
164245

165-
166-
[ci]: https://travis-ci.org/paurkedal/ppx_regexp
167-
[ci-build-status]: https://travis-ci.org/paurkedal/ppx_regexp.svg?branch=master
168246
[re]: https://github.com/ocaml/ocaml-re
169247
[tyre]: https://github.com/Drup/tyre

common/mik_lexer.mll

Lines changed: 22 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -6,23 +6,24 @@ let new_line lexbuf =
66
Lexing.new_line lexbuf
77

88
let predefined_classes = [
9-
("lower", {|[a-z]|});
10-
("upper", {|[A-Z]|});
11-
("alpha", {|[a-zA-Z]|});
12-
("digit", {|[0-9]|});
13-
("alnum", {|[a-zA-Z0-9]|});
14-
("punct", {|[!-/:-@\\[-`{-~]|});
15-
(* ("graph", {|[!-~]|}); *)
16-
(* ("print", {|[!-~]|}); *)
17-
("blank", {|[ \t]|});
9+
("lower", {|[[:lower:]]|});
10+
("upper", {|[[:upper:]]|});
11+
("alpha", {|[[:alpha:]]|});
12+
("digit", {|[[:digit:]]|});
13+
("alnum", {|[[:alnum:]]|});
14+
("punct", {|[[:punct:]]|});
15+
("graph", {|[[:graph:]]|});
16+
("print", {|[[:print:]]|});
17+
("blank", {|[[:blank:]]|});
1818
("cntrl", {|[[:cntrl:]]|});
19-
("xdigit", {|[0-9A-Fa-f]|});
19+
("xdigit", {|[[:xdigit:]]|});
2020
("space", {|[[:space:]]|});
21-
("word", {|[a-zA-Z0-9_]|});
21+
("word", {|[[:word:]]|});
2222
("eos", {|$|});
2323
("eol", {|$|[\n]|});
24+
("bnd", {|\b|});
2425
("bos", {|^|});
25-
("any", {|.|});
26+
("any", {|[^\n]|});
2627
]
2728

2829
let escape_char = function
@@ -77,11 +78,12 @@ rule token = parse
7778
| '?' { QUESTION }
7879
| '_' { UNDERSCORE }
7980
| ':' { COLON }
81+
| '=' { EQUAL }
8082
| "as" { AS }
8183
| "int" { INT_CONVERTER }
8284
| "float" { FLOAT_CONVERTER }
8385
| digit+ as n { INT (int_of_string n) }
84-
| ident as id {
86+
| ident as id {
8587
match List.assoc_opt id predefined_classes with
8688
| Some pcre_class -> PREDEFINED_CLASS pcre_class
8789
| None -> IDENT id
@@ -99,33 +101,33 @@ and char_literal buf = parse
99101
| '\'' { CHAR_LITERAL (Buffer.contents buf) }
100102
| ('(' as c) | (')' as c) {
101103
Buffer.add_string buf (escape_special c);
102-
char_literal buf lexbuf
104+
char_literal buf lexbuf
103105
}
104106
| ('{' as c) | ('}' as c) {
105107
Buffer.add_string buf (escape_special c);
106-
char_literal buf lexbuf
108+
char_literal buf lexbuf
107109
}
108110
| ('[' as c) | (']' as c) {
109111
Buffer.add_string buf (escape_special c);
110-
char_literal buf lexbuf
112+
char_literal buf lexbuf
111113
}
112114
| ('.' as c) | ('*' as c) | ('+' as c) | ('?' as c) | ('^' as c) | ('$' as c) | ('|' as c) {
113115
Buffer.add_string buf (escape_special c);
114-
char_literal buf lexbuf
116+
char_literal buf lexbuf
115117
}
116-
| _ as c {
118+
| _ as c {
117119
Buffer.add_char buf c;
118120
char_literal buf lexbuf
119121
}
120122
| eof { raise (Error "Unterminated character literal") }
121123

122124
and string_literal buf = parse
123-
| '\\' (_ as c) {
125+
| '\\' (_ as c) {
124126
Buffer.add_char buf (escape_char c);
125127
string_literal buf lexbuf
126128
}
127129
| '"' { STRING_LITERAL (Buffer.contents buf) }
128-
| _ as c {
130+
| _ as c {
129131
Buffer.add_char buf c;
130132
string_literal buf lexbuf
131133
}

common/mik_parser.mly

Lines changed: 26 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ let unclosed_error what startpos endpos =
4343
%token <string> CHAR_LITERAL STRING_LITERAL IDENT PREDEFINED_CLASS
4444
%token <int> INT
4545
%token SLASH LPAREN RPAREN LBRACKET RBRACKET CARET LBRACE RBRACE
46-
%token DASH BAR STAR PLUS QUESTION UNDERSCORE COLON AS
46+
%token DASH BAR STAR PLUS QUESTION UNDERSCORE COLON EQUAL AS
4747
%token INT_CONVERTER FLOAT_CONVERTER EOF
4848

4949
%start <string t> main_match_case
@@ -160,13 +160,13 @@ basic_atom:
160160
| LPAREN RPAREN { missing_error "pattern inside parentheses" $startpos $endpos }
161161
| LPAREN pattern EOF? { unclosed_error "parentheses (missing ')')" $startpos($1) $endpos($2) }
162162

163-
| LPAREN IDENT RPAREN {
163+
| LPAREN IDENT RPAREN {
164164
let ident_loc = wrap_loc $startpos($2) $endpos($2) $2 in
165165
let pattern_node = to_pcre_regex $2 $startpos($2) $endpos($2) in
166166
wrap_loc $startpos $endpos (Named_subs (ident_loc, None, None, pattern_node))
167167
}
168168
| LPAREN IDENT AS RPAREN { missing_error "name after 'as'" $startpos($3) $endpos($4) }
169-
| LPAREN IDENT AS name = IDENT RPAREN {
169+
| LPAREN IDENT AS name = IDENT RPAREN {
170170
let ident_loc = wrap_loc $startpos($2) $endpos($2) $2 in
171171
let name_loc = wrap_loc $startpos(name) $endpos(name) name in
172172
let pattern_node = to_pcre_regex $2 $startpos($2) $endpos($2) in
@@ -175,20 +175,26 @@ basic_atom:
175175
| LPAREN IDENT AS IDENT COLON RPAREN {
176176
missing_error "type converter after ':'" $startpos($5) $endpos($6)
177177
}
178-
| LPAREN IDENT AS name = IDENT COLON INT_CONVERTER RPAREN {
178+
| LPAREN IDENT AS name = IDENT COLON INT_CONVERTER RPAREN {
179179
let ident_loc = wrap_loc $startpos($2) $endpos($2) $2 in
180180
let name_loc = wrap_loc $startpos(name) $endpos(name) name in
181181
let pattern_node = to_pcre_regex $2 $startpos($2) $endpos($2) in
182182
wrap_loc $startpos $endpos (Named_subs (ident_loc, Some name_loc, Some Int, pattern_node))
183183
}
184-
| LPAREN IDENT AS name = IDENT COLON FLOAT_CONVERTER RPAREN {
184+
| LPAREN IDENT AS name = IDENT COLON FLOAT_CONVERTER RPAREN {
185185
let ident_loc = wrap_loc $startpos($2) $endpos($2) $2 in
186186
let name_loc = wrap_loc $startpos(name) $endpos(name) name in
187187
let pattern_node = to_pcre_regex $2 $startpos($2) $endpos($2) in
188188
wrap_loc $startpos $endpos (Named_subs (ident_loc, Some name_loc, Some Float, pattern_node))
189189
}
190+
| LPAREN IDENT AS name = IDENT COLON EQUAL func = IDENT RPAREN {
191+
let ident_loc = wrap_loc $startpos($2) $endpos($2) $2 in
192+
let name_loc = wrap_loc $startpos(name) $endpos(name) name in
193+
let pattern_node = to_pcre_regex $2 $startpos($2) $endpos($2) in
194+
wrap_loc $startpos $endpos (Named_subs (ident_loc, Some name_loc, Some (Func func), pattern_node))
195+
}
190196
| LPAREN IDENT AS IDENT EOF? {
191-
unclosed_error "parentheses (missing ')')" $startpos($1) $endpos($4)
197+
unclosed_error "parentheses (missing ')')" $startpos($1) $endpos($4)
192198
}
193199
| LPAREN IDENT AS IDENT COLON INT_CONVERTER EOF? {
194200
unclosed_error "parentheses (missing ')')" $startpos($1) $endpos($6)
@@ -198,29 +204,36 @@ basic_atom:
198204
}
199205

200206
| LPAREN pattern AS RPAREN { missing_error "capture name after 'as'" $startpos($3) $endpos($4) }
201-
| LPAREN pattern AS name = IDENT RPAREN {
207+
| LPAREN pattern AS name = IDENT RPAREN {
202208
let name_loc = wrap_loc $startpos(name) $endpos(name) name in
203209
wrap_loc $startpos $endpos (Capture_as (name_loc, None, $2))
204210
}
205-
| LPAREN pattern AS IDENT COLON RPAREN {
206-
missing_error "type converter after ':'" $startpos($5) $endpos($6)
211+
| LPAREN pattern AS IDENT COLON RPAREN {
212+
missing_error "type converter after ':'" $startpos($5) $endpos($6)
207213
}
208-
| LPAREN pattern AS name = IDENT COLON INT_CONVERTER RPAREN {
214+
| LPAREN pattern AS IDENT COLON EQUAL RPAREN {
215+
missing_error "function name after ':='" $startpos($5) $endpos($6)
216+
}
217+
| LPAREN pattern AS name = IDENT COLON INT_CONVERTER RPAREN {
209218
let name_loc = wrap_loc $startpos(name) $endpos(name) name in
210219
wrap_loc $startpos $endpos (Capture_as (name_loc, Some Int, $2))
211220
}
212221
| LPAREN pattern AS name = IDENT COLON FLOAT_CONVERTER RPAREN {
213222
let name_loc = wrap_loc $startpos(name) $endpos(name) name in
214223
wrap_loc $startpos $endpos (Capture_as (name_loc, Some Float, $2))
215224
}
225+
| LPAREN pattern AS name = IDENT COLON EQUAL func = IDENT RPAREN {
226+
let name_loc = wrap_loc $startpos(name) $endpos(name) name in
227+
wrap_loc $startpos $endpos (Capture_as (name_loc, Some (Func func), $2))
228+
}
216229
| LPAREN pattern AS IDENT EOF? {
217-
unclosed_error "parentheses (missing ')')" $startpos($1) $endpos($4)
230+
unclosed_error "parentheses (missing ')')" $startpos($1) $endpos($4)
218231
}
219232
| LPAREN pattern AS IDENT COLON INT_CONVERTER EOF? {
220-
unclosed_error "parentheses (missing ')')" $startpos($1) $endpos($6)
233+
unclosed_error "parentheses (missing ')')" $startpos($1) $endpos($6)
221234
}
222235
| LPAREN pattern AS IDENT COLON FLOAT_CONVERTER EOF? {
223-
unclosed_error "parentheses (missing ')')" $startpos($1) $endpos($6)
236+
unclosed_error "parentheses (missing ')')" $startpos($1) $endpos($6)
224237
}
225238

226239
| LPAREN error { syntax_error "Invalid expression in parentheses" $startpos($2) $endpos }

common/regexp_types.ml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,5 +20,6 @@ and 'a node =
2020
(* TODO: | Case_sense of t | Case_blind of t *)
2121

2222
and conv_ty =
23-
| Int : conv_ty
24-
| Float : conv_ty
23+
| Int
24+
| Float
25+
| Func of string

common/regexp_types.mli

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,5 +20,6 @@ and 'a node =
2020
(* TODO: | Case_sense of t | Case_blind of t *)
2121

2222
and conv_ty =
23-
| Int : conv_ty
24-
| Float : conv_ty
23+
| Int
24+
| Float
25+
| Func of string

ppx_regexp/ppx_regexp.ml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,9 @@ let rec wrap_group_bindings ~loc rhs offG = function
182182
| None -> eG
183183
| Some Regexp_types.Int -> [%expr int_of_string [%e eG]]
184184
| Some Regexp_types.Float -> [%expr float_of_string [%e eG]]
185+
| Some (Regexp_types.Func func_name) ->
186+
let func_ident = pexp_ident ~loc { txt = Lident func_name; loc } in
187+
[%expr [%e func_ident] [%e eG]]
185188
in
186189
let eG = if mustG then eG else [%expr try Some [%e eG] with Not_found -> None] in
187190
[%expr

0 commit comments

Comments
 (0)