Replies: 1 comment 1 reply
-
|
Tokenizer instances should return lists, and ideally a list of strings. Right now, here's an example of what's going on: import stringcompare
tokenizer = stringcompare.NGramTokenizer(3)
tokenizer("hello world")
<zip at ...>
list(tokenizer("Hello World"))
[('H', 'e', 'l'),
('e', 'l', 'l'),
('l', 'l', 'o'),
('l', 'o', ' '),
('o', ' ', 'W'),
(' ', 'W', 'o'),
('W', 'o', 'r'),
('o', 'r', 'l'),
('r', 'l', 'd')]It would be better for NGramTokeniser to return a list of strings instead of a list of character tuples (I'll fix that), but the two are basically equivalent. Here's how WhitespaceTokenizer works: tokenizer = stringcompare.WhitespaceTokenizer()
tokenizer("Hello world")
['Hello', 'world'] |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I was wondering about the return type of the Tokenizer Method.
The method "zip" seems to return a list of tuples, but we are unsure. We would like some clarification on what type is returned by the Tokenizer method.
Beta Was this translation helpful? Give feedback.
All reactions