Skip to content

Commit 6cbc2d8

Browse files
Change from token type to tokenizer type (#2)
* Update API Switch internal terminology from use of `Token` to `Tokenizer`. Change external API from `matches(:)` back to the original `tokens(:)`. Add a convenience method `components(:)` that returns just substrings rather than array of `Token`. * Update file name. * Update README to show `Token` type. * Include `import Mustard` in code snippet.
1 parent 56bfad7 commit 6cbc2d8

19 files changed

+614
-567
lines changed

Documentation/Expressive matching.md

Lines changed: 39 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,83 +1,79 @@
11
# Example: expressive matching
22

3-
The results returned by `matches(from:)`returns an array tuples with the signature `(tokenizer: TokenType, text: String, range: Range<String.Index>)`
3+
The results returned by `tokens(matchedWith:)`returns an array `Token` which in turn is a tuple with the signature `(tokenizer: TokenizerType, text: String, range: Range<String.Index>)`
44

55
To make use of the `tokenizer` element, you need to either use type casting (using `as?`) or type checking (using `is`) for the `tokenizer` element to be useful.
66

7-
Maybe we want to filter out only tokens that are numbers:
7+
Maybe we want to filter out only tokens that were matched with a number tokenizer:
88

99
````Swift
1010
import Mustard
1111

12-
let messy = "123Hello world&^45.67"
13-
let matches = messy.matches(from: .decimalDigits, .letters)
14-
// matches.count -> 5
12+
let tokens = "123Hello world&^45.67".tokens(matchedWith: .decimalDigits, .letters)
13+
// tokens.count -> 5
1514

16-
let numbers = matches.filter({ $0.tokenizer is NumberToken })
17-
// numbers.count -> 0
15+
let numberTokens = tokens.filter({ $0.tokenizer is NumberTokenizer })
16+
// numberTokens.count -> 0
1817

1918
````
2019

21-
This can lead to bugs in your logic-- in the example above `numberTokens` will be empty because the tokenizers used were the character sets `.decimalDigits`, and `.letters`, so the filter won't match any of the tokens.
20+
This can lead to bugs in your logic-- in the example above `numberTokens` will be empty because the tokenizers used were `CharacterSet.decimalDigits`, and `CharacterSet.letters`, so the filter won't match any of the tokens.
2221

2322
This may seem like an obvious error, but it's the type of unexpected bug that can slip in when we're using loosely typed results.
2423

25-
Thankfully, Mustard can return a strongly typed set of matches if a single `TokenType` is used:
24+
Thankfully, Mustard can return a strongly typed set of matches if a single `TokenizerType` is used:
2625

2726
````Swift
2827
import Mustard
2928

30-
let messy = "123Hello world&^45.67"
31-
32-
// call `matches()` method on string to get matching tokens from string
33-
let numberMatches: [NumberToken.Match] = messy.matches()
34-
// numberMatches.count -> 2
29+
// call `tokens()` method on `String` to get matching tokens from the string
30+
let numberTokens: [NumberTokenizer.Token] = "123Hello world&^45.67".tokens()
31+
// numberTokens.count -> 2
3532

3633
````
3734

38-
Used in this way, this isn't very useful, but it does allow for multiple `TokenType` to be bundled together as a single `TokenType` by implementing a TokenType using an `enum`.
35+
Used in this way, this isn't very useful, but it does allow for multiple `TokenizerType` to be bundled together as a single tokenizer by implementing with an `enum`.
3936

40-
An enum token type can either manage it's own internal state, or potentially act as a lightweight wrapper to existing tokenizers.
41-
Here's an example `TokenType` that acts as a wrapper for word, number, and emoji tokenizers:
37+
An enum tokenizer can either manage it's own internal state, or potentially act as a lightweight wrapper to other existing tokenizers.
4238

43-
````Swift
39+
Here's an example `TokenizerType` that acts as a wrapper for word, number, and emoji tokenizers:
4440

45-
enum MixedToken: TokenType {
41+
````Swift
42+
enum MixedTokenizer: TokenizerType {
4643

4744
case word
4845
case number
4946
case emoji
5047
case none // 'none' case not strictly needed, and
5148
// in this implementation will never be matched
52-
5349
init() {
5450
self = .none
5551
}
5652

57-
static let wordToken = WordToken()
58-
static let numberToken = NumberToken()
59-
static let emojiToken = EmojiToken()
53+
static let wordTokenizer = WordTokenizer()
54+
static let numberTokenizer = NumberTokenizer()
55+
static let emojiTokenizer = EmojiTokenizer()
6056

61-
func canAppend(next scalar: UnicodeScalar) -> Bool {
57+
func tokenCanTake(_ scalar: UnicodeScalar) -> Bool {
6258
switch self {
63-
case .word: return MixedToken.wordToken.canAppend(next: scalar)
64-
case .number: return MixedToken.numberToken.canAppend(next: scalar)
65-
case .emoji: return MixedToken.emojiToken.canAppend(next: scalar)
59+
case .word: return MixedTokenizer.wordTokenizer.tokenCanTake(scalar)
60+
case .number: return MixedTokenizer.numberTokenizer.tokenCanTake(scalar)
61+
case .emoji: return MixedTokenizer.emojiTokenizer.tokenCanTake(scalar)
6662
case .none:
6763
return false
6864
}
6965
}
7066

71-
func token(startingWith scalar: UnicodeScalar) -> TokenType? {
67+
func token(startingWith scalar: UnicodeScalar) -> TokenizerType? {
7268

73-
if let _ = MixedToken.wordToken.token(startingWith: scalar) {
74-
return MixedToken.word
69+
if let _ = MixedTokenizer.wordTokenizer.token(startingWith: scalar) {
70+
return MixedTokenizer.word
7571
}
76-
else if let _ = MixedToken.numberToken.token(startingWith: scalar) {
77-
return MixedToken.number
72+
else if let _ = MixedTokenizer.numberTokenizer.token(startingWith: scalar) {
73+
return MixedTokenizer.number
7874
}
79-
else if let _ = MixedToken.emojiToken.token(startingWith: scalar) {
80-
return MixedToken.emoji
75+
else if let _ = MixedTokenizer.emojiTokenizer.token(startingWith: scalar) {
76+
return MixedTokenizer.emoji
8177
}
8278
else {
8379
return nil
@@ -90,25 +86,25 @@ Mustard defines a default typealias for `Token` that exposes the specific type i
9086
results tuple.
9187

9288
````Swift
93-
public extension TokenType {
94-
typealias Match = (tokenizer: Self, text: String, range: Range<String.Index>)
89+
public extension TokenizerType {
90+
typealias Token = (tokenizer: Self, text: String, range: Range<String.Index>)
9591
}
9692
````
9793

98-
Setting your results array to this type gives you the option to use the shorter `matches()` method,
94+
Setting your results array to this type gives you the option to use the shorter `tokens()` method,
9995
where Mustard uses the inferred type to perform tokenization.
10096

101-
Since the matches array is strongly typed, you can be more expressive with the results, and the
97+
Since the tokens array is strongly typed, you can be more expressive with the results, and the
10298
complier can give you more hints to prevent you from making mistakes.
10399

104100
````Swift
105101

106-
// use the `matches()` method to grab matching substrings using a single tokenizer
107-
let matches: [MixedToken.Match] = "123👩‍👩‍👦‍👦Hello world👶 again👶🏿 45.67".matches()
108-
// matches.count -> 8
102+
// use the `tokens()` method to grab matching substrings using a single tokenizer
103+
let tokens: [MixedTokenizer.Token] = "123👩‍👩‍👦‍👦Hello world👶 again👶🏿 45.67".tokens()
104+
// tokens.count -> 8
109105

110-
matches.forEach({ match in
111-
switch (match.tokenizer, match.text) {
106+
tokens.forEach({ token in
107+
switch (token.tokenizer, token.text) {
112108
case (.word, let word): print("word:", word)
113109
case (.number, let number): print("number:", number)
114110
case (.emoji, let emoji): print("emoji:", emoji)
Lines changed: 29 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,43 @@
11
# Greedy tokens and tokenizer order
22

3-
Tokenizers are greedy. The order that tokenizers are passed into the `matches(from: TokenType...)` will effect how substrings are matched.
3+
Tokenizers are greedy. The order that tokenizers are passed into the `matches(from: TokenizerType...)` will effect how substrings are matched.
44

5-
Here's an example using the `CharacterSet.decimalDigits` tokenizer and the custom tokenizer `DateToken` that matches dates in the format `MM/dd/yy` ([see example](Tokens with internal state.md) for implementation).
5+
Here's an example using the `CharacterSet.decimalDigits` tokenizer and the custom tokenizer `DateTokenizer` that matches dates in the format `MM/dd/yy` ([see example](Tokens with internal state.md) for implementation).
66

77
````Swift
88
import Mustard
99

1010
let numbers = "03/29/17 36"
11-
let matches = numbers.matches(from: CharacterSet.decimalDigits, DateToken.tokenizer)
12-
// matches.count -> 4
11+
let tokens = numbers.tokens(matchedWith: CharacterSet.decimalDigits, DateTokenizer.defaultTokenizer)
12+
// tokens.count -> 4
1313
//
14-
// matches[0].text -> "03"
15-
// matches[0].tokenizer -> CharacterSet.decimalDigits
14+
// tokens[0].text -> "03"
15+
// tokens[0].tokenizer -> CharacterSet.decimalDigits
1616
//
17-
// matches[1].text -> "29"
18-
// matches[1].tokenizer -> CharacterSet.decimalDigits
17+
// tokens[1].text -> "29"
18+
// tokens[1].tokenizer -> CharacterSet.decimalDigits
1919
//
20-
// matches[2].text -> "17"
21-
// matches[2].tokenizer -> CharacterSet.decimalDigits
20+
// tokens[2].text -> "17"
21+
// tokens[2].tokenizer -> CharacterSet.decimalDigits
2222
//
23-
// matches[3].text -> "36"
24-
// matches[3].tokenizer -> CharacterSet.decimalDigits
23+
// tokens[3].text -> "36"
24+
// tokens[3].tokenizer -> CharacterSet.decimalDigits
2525
````
2626

27-
To get expected behavior, the `matches` method should be called with more specific tokenizers placed before more general tokenizers:
27+
To get expected behavior, the `tokens` method should be called with more specific tokenizers placed before more general tokenizers:
2828

2929
````Swift
3030
import Mustard
3131

3232
let numbers = "03/29/17 36"
33-
let matches = numbers.matches(from: DateToken.tokenizer, CharacterSet.decimalDigits)
34-
// matches.count -> 2
33+
let tokens = numbers.tokens(matchedWith: DateTokenizer.defaultTokenizer, CharacterSet.decimalDigits)
34+
// tokens.count -> 2
3535
//
36-
// matches[0].text -> "03/29/17"
37-
// matches[0].tokenizer -> DateToken()
36+
// tokens[0].text -> "03/29/17"
37+
// tokens[0].tokenizer -> DateTokenizer()
3838
//
39-
// matches[1].text -> "36"
40-
// matches[1].tokenizer -> CharacterSet.decimalDigits
39+
// tokens[1].text -> "36"
40+
// tokens[1].tokenizer -> CharacterSet.decimalDigits
4141
````
4242

4343
If the more specific tokenizer fails to match a token, the more general tokens still have a chance to perform matches:
@@ -46,18 +46,18 @@ If the more specific tokenizer fails to match a token, the more general tokens s
4646
import Mustard
4747

4848
let numbers = "99/99/99 36"
49-
let matches = numbers.matches(from: DateToken.tokenizer, CharacterSet.decimalDigits)
50-
// matches.count -> 4
49+
let tokens = numbers.tokens(matchedWith: DateTokenizer.defaultTokenizer, CharacterSet.decimalDigits)
50+
// tokens.count -> 4
5151
//
52-
// matches[0].text -> "99"
53-
// matches[0].tokenizer -> CharacterSet.decimalDigits
52+
// tokens[0].text -> "99"
53+
// tokens[0].tokenizer -> CharacterSet.decimalDigits
5454
//
55-
// matches[1].text -> "99"
56-
// matches[1].tokenizer -> CharacterSet.decimalDigits
55+
// tokens[1].text -> "99"
56+
// tokens[1].tokenizer -> CharacterSet.decimalDigits
5757
//
58-
// matches[2].text -> "99"
59-
// matches[2].tokenizer -> CharacterSet.decimalDigits
58+
// tokens[2].text -> "99"
59+
// tokens[2].tokenizer -> CharacterSet.decimalDigits
6060
//
61-
// matches[3].text -> "36"
62-
// matches[3].tokenizer -> CharacterSet.decimalDigits
61+
// tokens[3].text -> "36"
62+
// tokens[3].tokenizer -> CharacterSet.decimalDigits
6363
````

Documentation/Matching emoji.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,21 +6,21 @@ As an example, the character '👶🏿' is comprised by two scalars: '👶', and
66
The rainbow flag character '🏳️‍🌈' is again comprised by two adjacent scalars '🏳' and '🌈'.
77
A final example, the character '👨‍👨‍👧‍👦' is actually 7 scalars: '👨' '👨' '👧' '👦' joined by three ZWJs (zero-with joiner).
88

9-
To create a TokenType that matches emoji we can instead check to see if a scalar falls within known range, or if it's a ZWJ.
9+
To create a `TokenizerType` that matches emoji we can instead check to see if a scalar falls within known range, or if it's a ZWJ.
1010

1111
This isn't the most *accurate* emoji tokenizer because it would potentially matches an emoji scalar followed by 100 zero-width joiners, but for basic use it might be enough.
1212

1313
````Swift
14-
struct EmojiToken: TokenType {
14+
struct EmojiTokenizer: TokenizerType {
1515

1616
// (e.g. can't start with a ZWJ)
17-
func canStart(with scalar: UnicodeScalar) -> Bool {
18-
return EmojiToken.isEmojiScalar(scalar)
17+
func tokenCanStart(with scalar: UnicodeScalar) -> Bool {
18+
return EmojiTokenizer.isEmojiScalar(scalar)
1919
}
2020

2121
// either in the known range for a emoji, or a ZWJ
22-
func canTake(_ scalar: UnicodeScalar) -> Bool {
23-
return EmojiToken.isEmojiScalar(scalar) || EmojiToken.isJoiner(scalar)
22+
func tokenCanTake(_ scalar: UnicodeScalar) -> Bool {
23+
return EmojiTokenizer.isEmojiScalar(scalar) || EmojiTokenizer.isJoiner(scalar)
2424
}
2525

2626
static func isJoiner(_ scalar: UnicodeScalar) -> Bool {

Documentation/TallyType protocol.md

Lines changed: 0 additions & 114 deletions
This file was deleted.

0 commit comments

Comments
 (0)