Skip to content

Commit c13460a

Browse files
Fix grammar, style, and formatting in Approximate Matching doc (#1240)
Signed-off-by: Dave <[email protected]> Co-authored-by: RD WebDesign <[email protected]>
1 parent 05d6537 commit c13460a

File tree

1 file changed

+31
-28
lines changed

1 file changed

+31
-28
lines changed

docs/regex/approximate.md

Lines changed: 31 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1-
# Approximative matching
1+
# Approximate matching
22

3-
You may or not know `agrep`. It is basically a "forgiving" `grep` and is, for instance, used for searching through (offline) dictionaries. It is tolerant against errors (up to degree you specify). It may be beneficial is you want to match against domains where you don't really know the pattern. It is just an idea, we will have to see if it is actually useful.
3+
You may or may not know `agrep`, it is basically a "forgiving" `grep` and is, for instance, used for searching through (offline) dictionaries. It is tolerant against errors up to a degree you specify. It may be beneficial if you want to match against domains where you don't really know the pattern. It is just an idea, we will have to see if it is actually useful.
44

5-
This is a somewhat complicated topic, we'll approach it by examples as it is very complicated to get the head around it by just listening to the specifications.
5+
This is a somewhat complicated topic, we'll approach it by examples as it is very complicated to get your head around it by just listening to the specifications.
66

77
The approximate matching settings for a subpattern can be changed by appending *approx-settings* to the subpattern. Limits for the number of errors can be set and an expression for specifying and limiting the costs can be given:
88

99
## Accepted **insertions** (`+`)
1010

11-
Use `(something){+x}` to specify that the regex should still be matching when `x` characters would need it be *inserted* into the sub-expression `something`.
11+
Use `(something){+x}` to specify that the regex should still match when `x` characters would need to be *inserted* into the sub-expression `something`.
1212

1313
Example:
1414

@@ -24,7 +24,7 @@ The missing characters in the domain are substituted. The maximum number of inse
2424

2525
## Accepted **deletions** (`-`)
2626

27-
Use `(something){-x}` to specify that the regex should still be matching when `x` characters would need it be *deleted* from the sub-expression `something`:
27+
Use `(something){-x}` to specify that the regex should still match when `x` characters would need to be *deleted* from the sub-expression `something`:
2828

2929
Example:
3030

@@ -35,60 +35,63 @@ The surplus `e` in `neet` is deleted.
3535
Similarly:
3636

3737
- `doubleclick.net` is matched by `^(doubleclicky\.netty){-3}$`
38-
- `doubleclick.net` is NOT matched by `^(doubleclicky\.nettfy){-3}$`
38+
- `doubleclick.net` is **not** matched by `^(doubleclicky\.nettfy){-3}$`
3939

4040
## Accepted **substitutions** (`#`)
4141

42-
Use `(something){#x}` to specify that the regex should still be matching when `x` characters would need to be *substituted* from the sub-expression `something`:
42+
Use `(something){#x}` to specify that the regex should still match when `x` characters would need to be *substituted* in the sub-expression `something`:
4343

4444
Example 1:
4545

46-
- `oobargoobaploowap` is matched by `(foobar){#2~2}`
47-
Hint: `goobap` is `foobar` with two substitutions `f->g` and `r->p`
46+
- `oobargoobaploowap` is matched by `(foobar){#2}`
47+
- Hint: `goobap` is `foobar` with `f` substituted for `g` and `r` substituted for `p`
4848

4949
Example 2:
5050

5151
- `doubleclick.net` is matched by `^doubleclick\.n(tt){#1}$`
5252

53-
The incorrect `t` in `ntt` is substituted. Note that substitutions are necessary when a character needs to be replaced as the corresponding realization with one insertion and one deletion is **not identical**:
53+
The incorrect `t` in `ntt` is substituted. Note that substitutions are necessary when a character needs to be replaced as the following example (with 1 insertion and 1 deletion) is **not identical**:
5454

55-
`doubleclick.net` is matched by `^doubleclick\.n(tt){+1-1}$`
55+
- `doubleclick.net` is matched by `^doubleclick\.n(tt){+1-1}$`
5656

5757
(`t` is removed, `e` is added), however
5858

59-
- `doubleclick.nt` is ALSO matched by `^doubleclick\.n(tt){+1-1}$`
59+
- `doubleclick.nt` is **also** matched by `^doubleclick\.n(tt){+1-1}$`
6060

61-
(the `t` is just removed, nothing had to be added) but
61+
(the `t` is removed, but nothing has to be added) but
6262

63-
- `doubleclick.nt` is NOT matched by `^doubleclick\.n(tt){#1}$`
63+
- `doubleclick.nt` is **not** matched by `^doubleclick\.n(tt){#1}$`
6464

65-
doesn't match as substitutions always require characters to be swapped by others.
65+
doesn't match as substitutions always require characters to be replaced by others.
6666

6767
## Combinations and total error limit (`~`)
6868

69-
All rules from above can be combined like as `{+2-5#6}` allowing (up to!) two insertions, five deletions, and six substitutions. You can enforce an upper limit on the number of tried realizations using the tilde. Even when `{+2-5#6}` can lead to up to 13 operations being tried, this can be limited to (at most) seven tries using `{+2-5#6~7}`.
69+
All rules from above can be combined, for example `{+2-5#6}` allows up to 2 insertions, 5 deletions, and 6 substitutions. You can enforce an upper limit on the number of attempted operations using `~x`, for example even though `{+2-5#6}` can lead to up to 13 operations being tried, this can be limited to at most 7 operations using `{+2-5#6~7}`.
7070

7171
Example:
7272

7373
- `oobargoobploowap` is matched by `(foobar){+2#2~3}`
74+
- Hint: `goobaap` is `foobar` with
75+
- 2 substitutions (`f` to `g` and `r` to `p`)
76+
- 1 addition (`a` in `bar` to make `baap`)
7477

75-
Hint: `goobaap` is `foobar` with
76-
- two substitutions `f->g` and `r->p`, and
77-
- one addition `a` between `bar` (to have `baap`)
78-
79-
Specifying `~2` instead of `~3` will lead to no match as three errors need to be corrected in total for a match in this example.
78+
Specifying `~2` instead of `~3` will not match as there are 3 errors which need to be corrected in this example.
8079

8180
## Advanced topic: Cost-equation
8281

83-
You can even weight the "costs" of insertions, deletions or substitutions. This is really an advanced topic and should only be touched when really needed.
82+
You can even weight the "costs" of insertions, deletions or substitutions. This is an advanced topic and should only be touched when really needed.
83+
84+
A *cost-equation* can be thought of as a mathematical equation where `i`, `d`, and `s` stand for the number of insertions, deletions, and substitutions respectively. The equation can have a multiplier for each of `i`, `d`, and `s`.
85+
The multiplier is the **cost of the error**, and the number after `<` is the maximum allowed total cost of a match. Spaces and pluses can be inserted to make the equation more readable. When specifying only a cost equation, adding a space after the opening `{` is **required**.
8486

85-
A *cost-equation* can be thought of as a mathematical equation, where `i`, `d`, and `s` stand for the number of insertions, deletions, and substitutions, respectively. The equation can have a multiplier for each of `i`, `d`, and `s`.
86-
The multiplier is the **cost of the error**, and the number after `<` is the maximum allowed total cost of a match. Spaces and pluses can be inserted to make the equation more readable. When specifying only a cost equation, adding a space after the opening `{` is **required** .
87+
Example 1:
8788

88-
Example 1: `{ 2i + 1d + 2s < 5 }`
89+
- `{ 2i + 1d + 2s < 5 }`
8990

90-
This sets the cost of an insertion to two, a deletion to one, a substitution to two, and the maximum cost to five.
91+
This sets the cost of an insertion to 2, a deletion to 1, a substitution to 2, and the maximum cost to 5.
92+
93+
Example 2:
9194

92-
Example 2: `{+2-5#6, 2i + 1d + 2s < 5 }`
95+
- `{ +2-5#6, 2i + 1d + 2s < 5 }`
9396

94-
This sets the cost of an insertion to two, a deletion to one, a substitution to two, and the maximum cost to five. Furthermore, it allows only up to 2 insertions (coming at a total cost of 4), five deletions and up to 6 substitutions. As six substitutions would come at a cost of `6*2 = 12`, exceeding the total allowed costs of 5, they cannot all be realized.
97+
This sets the cost of an insertion to 2, a deletion to 1, a substitution to 2, and the maximum cost to 5. Furthermore, it allows only up to 2 insertions (for a total cost of 4), up to 5 deletions, and up to 6 substitutions. As 6 substitutions would come at a cost of `6*2 = 12`, exceeding the total allowed costs of 5, they cannot all be performed.

0 commit comments

Comments
 (0)