You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Update API
Switch internal terminology from use of `Token` to `Tokenizer`.
Change external API from `matches(:)` back to the original `tokens(:)`.
Add a convenience method `components(:)` that returns just substrings
rather than array of `Token`.
* Update file name.
* Update README to show `Token` type.
* Include `import Mustard` in code snippet.
Copy file name to clipboardExpand all lines: Documentation/Expressive matching.md
+39-43Lines changed: 39 additions & 43 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,83 +1,79 @@
1
1
# Example: expressive matching
2
2
3
-
The results returned by `matches(from:)`returns an array tuples with the signature `(tokenizer: TokenType, text: String, range: Range<String.Index>)`
3
+
The results returned by `tokens(matchedWith:)`returns an array `Token` which in turn is a tuple with the signature `(tokenizer: TokenizerType, text: String, range: Range<String.Index>)`
4
4
5
5
To make use of the `tokenizer` element, you need to either use type casting (using `as?`) or type checking (using `is`) for the `tokenizer` element to be useful.
6
6
7
-
Maybe we want to filter out only tokens that are numbers:
7
+
Maybe we want to filter out only tokens that were matched with a number tokenizer:
8
8
9
9
````Swift
10
10
importMustard
11
11
12
-
let messy ="123Hello world&^45.67"
13
-
let matches = messy.matches(from: .decimalDigits, .letters)
14
-
// matches.count -> 5
12
+
let tokens ="123Hello world&^45.67".tokens(matchedWith: .decimalDigits, .letters)
This can lead to bugs in your logic-- in the example above `numberTokens` will be empty because the tokenizers used were the character sets `.decimalDigits`, and `.letters`, so the filter won't match any of the tokens.
20
+
This can lead to bugs in your logic-- in the example above `numberTokens` will be empty because the tokenizers used were `CharacterSet.decimalDigits`, and `CharacterSet.letters`, so the filter won't match any of the tokens.
22
21
23
22
This may seem like an obvious error, but it's the type of unexpected bug that can slip in when we're using loosely typed results.
24
23
25
-
Thankfully, Mustard can return a strongly typed set of matches if a single `TokenType` is used:
24
+
Thankfully, Mustard can return a strongly typed set of matches if a single `TokenizerType` is used:
26
25
27
26
````Swift
28
27
importMustard
29
28
30
-
let messy ="123Hello world&^45.67"
31
-
32
-
// call `matches()` method on string to get matching tokens from string
33
-
let numberMatches: [NumberToken.Match] = messy.matches()
34
-
// numberMatches.count -> 2
29
+
// call `tokens()` method on `String` to get matching tokens from the string
30
+
let numberTokens: [NumberTokenizer.Token] ="123Hello world&^45.67".tokens()
31
+
// numberTokens.count -> 2
35
32
36
33
````
37
34
38
-
Used in this way, this isn't very useful, but it does allow for multiple `TokenType` to be bundled together as a single `TokenType` by implementing a TokenType using an `enum`.
35
+
Used in this way, this isn't very useful, but it does allow for multiple `TokenizerType` to be bundled together as a single tokenizer by implementing with an `enum`.
39
36
40
-
An enum token type can either manage it's own internal state, or potentially act as a lightweight wrapper to existing tokenizers.
41
-
Here's an example `TokenType` that acts as a wrapper for word, number, and emoji tokenizers:
37
+
An enum tokenizer can either manage it's own internal state, or potentially act as a lightweight wrapper to other existing tokenizers.
42
38
43
-
````Swift
39
+
Here's an example `TokenizerType` that acts as a wrapper for word, number, and emoji tokenizers:
44
40
45
-
enumMixedToken: TokenType {
41
+
````Swift
42
+
enumMixedTokenizer: TokenizerType {
46
43
47
44
caseword
48
45
casenumber
49
46
caseemoji
50
47
casenone// 'none' case not strictly needed, and
51
48
// in this implementation will never be matched
52
-
53
49
init() {
54
50
self= .none
55
51
}
56
52
57
-
staticletwordToken=WordToken()
58
-
staticletnumberToken=NumberToken()
59
-
staticletemojiToken=EmojiToken()
53
+
staticletwordTokenizer=WordTokenizer()
54
+
staticletnumberTokenizer=NumberTokenizer()
55
+
staticletemojiTokenizer=EmojiTokenizer()
60
56
61
-
funccanAppend(nextscalar: UnicodeScalar) ->Bool {
57
+
functokenCanTake(_scalar: UnicodeScalar) ->Bool {
62
58
switchself {
63
-
case .word:returnMixedToken.wordToken.canAppend(next: scalar)
64
-
case .number:returnMixedToken.numberToken.canAppend(next: scalar)
65
-
case .emoji:returnMixedToken.emojiToken.canAppend(next: scalar)
59
+
case .word:returnMixedTokenizer.wordTokenizer.tokenCanTake(scalar)
60
+
case .number:returnMixedTokenizer.numberTokenizer.tokenCanTake(scalar)
61
+
case .emoji:returnMixedTokenizer.emojiTokenizer.tokenCanTake(scalar)
Tokenizers are greedy. The order that tokenizers are passed into the `matches(from: TokenType...)` will effect how substrings are matched.
3
+
Tokenizers are greedy. The order that tokenizers are passed into the `matches(from: TokenizerType...)` will effect how substrings are matched.
4
4
5
-
Here's an example using the `CharacterSet.decimalDigits` tokenizer and the custom tokenizer `DateToken` that matches dates in the format `MM/dd/yy` ([see example](Tokens with internal state.md) for implementation).
5
+
Here's an example using the `CharacterSet.decimalDigits` tokenizer and the custom tokenizer `DateTokenizer` that matches dates in the format `MM/dd/yy` ([see example](Tokens with internal state.md) for implementation).
Copy file name to clipboardExpand all lines: Documentation/Matching emoji.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,21 +6,21 @@ As an example, the character '👶🏿' is comprised by two scalars: '👶', and
6
6
The rainbow flag character '🏳️🌈' is again comprised by two adjacent scalars '🏳' and '🌈'.
7
7
A final example, the character '👨👨👧👦' is actually 7 scalars: '👨' '👨' '👧' '👦' joined by three ZWJs (zero-with joiner).
8
8
9
-
To create a TokenType that matches emoji we can instead check to see if a scalar falls within known range, or if it's a ZWJ.
9
+
To create a `TokenizerType` that matches emoji we can instead check to see if a scalar falls within known range, or if it's a ZWJ.
10
10
11
11
This isn't the most *accurate* emoji tokenizer because it would potentially matches an emoji scalar followed by 100 zero-width joiners, but for basic use it might be enough.
0 commit comments