Skip already "tokenized" string in doc #11587
Replies: 1 comment 2 replies
-
It's a little hard to understand what's going on in your code. What does your pipeline look like / what order are components in? Where is You normally shouldn't call Based on your sample, it looks like the way you're checking I'm also not sure what you mean by "already tokenized". When you get a Doc, every part of it is already tokenized. Is your hashtag related code running |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Am working with a code that split hashtags with multiple word list: 1) One full of city names, 2) another with specific terms, 3) two big and common wordlist (spanish & english), in order to detect city names in hashtags and then geolocate them. The script works ok, but its's really slow, becouse will look for lots of substrings in the city list array after a similar word it's finded in wordlist and then is validated in Geonamecache library | https://github.com/yaph/geonamescache
Am trying to skip hashtags that already pass trough the tokenizer or at least those that the LanguageComponent already tag.
I have this code:
It detects is_hashtag attribute but doesn't work wih is_geo and always execute parse_tag() function
@Language.component("mention_hashtags_set_extension")
So is there a way that Spacy marks already tokenized strings/words in this case hashtags so i can use this mark as a conditional to skip the whole process of spliting the hashtag, looking for substrings in the arrays and then validating in Geonamescache, and making this process faster?
Beta Was this translation helpful? Give feedback.
All reactions