Improve large dictionary matching performance #5532
Replies: 6 comments
-
You'd know for sure by doing more detailed profiling, but I suspect this may depend a lot on the speed of the tokenizer. The Your example looks like it didn't get copied quite correctly? You can also just use: for label, terms in word_dict.items():
matcher.add(label, [nlp.make_doc(term) for term in terms]) As a very rough estimate I'd expect the default English tokenizer to take about a minute to process 1 million 3-word texts, but it can depend a lot on the tokenizer settings and the contents of the texts (the tokenizer has a cache). |
Beta Was this translation helpful? Give feedback.
-
I am investigating the bottleneck. And the time spent on English is a good reference. Actually I am using character based token in this case. |
Beta Was this translation helpful? Give feedback.
-
The Reloading a pickled |
Beta Was this translation helpful? Give feedback.
-
It actually took about 2.10 minutes. I just watched it out and felt it took 10 minutes, but it's a bit more than 2 minutes. Not bad. |
Beta Was this translation helpful? Give feedback.
-
I built the dictionary in an initializer and once for each run. At development time, it's quite often to run repeatedly, so a pickled matcher is worth trying and thanks for the note. |
Beta Was this translation helpful? Give feedback.
-
You can use flashtext, there is a "plugin" in the spacy universe page: https://spacy.io/universe/project/spacy-lookup |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a dictionary which contains nearly 1 million text entries. I used the 'PhraseMatcher' to compile all the entries into patterns and it takes quite a while to complete the compiling process:
This piece of code alone takes about 10 minutes to complete. Is there a way to make it faster, given the size of the dictionary? Would the EntityRuler faster than PhraseMatcher?
Beta Was this translation helpful? Give feedback.
All reactions