Improve large dictionary matching performance #5532

lingvisa · 2020-06-02T04:33:14Z

lingvisa
Jun 2, 2020

I have a dictionary which contains nearly 1 million text entries. I used the 'PhraseMatcher' to compile all the entries into patterns and it takes quite a while to complete the compiling process:


for label, terms in word_dict.items()
  compiled_patterns = []
   for term in terms:
      doc = self.nlp.make_doc(term)
      compiled_patterns = patterns.append(doc)
  self.matcher.add(label, None, *patterns)

This piece of code alone takes about 10 minutes to complete. Is there a way to make it faster, given the size of the dictionary? Would the EntityRuler faster than PhraseMatcher?

adrianeboyd · 2020-06-02T17:32:16Z

adrianeboyd
Jun 2, 2020

You'd know for sure by doing more detailed profiling, but I suspect this may depend a lot on the speed of the tokenizer. The EntityRuler uses the PhraseMatcher underneath, so it wouldn't be any faster.

Your example looks like it didn't get copied quite correctly? You can also just use:

for label, terms in word_dict.items():
    matcher.add(label, [nlp.make_doc(term) for term in terms])

As a very rough estimate I'd expect the default English tokenizer to take about a minute to process 1 million 3-word texts, but it can depend a lot on the tokenizer settings and the contents of the texts (the tokenizer has a cache).

0 replies

lingvisa · 2020-06-03T04:55:13Z

lingvisa
Jun 3, 2020
Author

I am investigating the bottleneck. And the time spent on English is a good reference. Actually I am using character based token in this case.

0 replies

adrianeboyd · 2020-06-03T13:43:30Z

adrianeboyd
Jun 3, 2020

The PhraseMatcher may be the bottleneck, then. I don't think you're missing any usage details that would make it more efficient, so this may not be easy to improve for your example.

Reloading a pickled PhraseMatcher is likely a little bit faster than recreating it from scratch if it's something you need to run repeatedly, since it doesn't have to recreate all the docs. It still has to rebuild its internal prefix tree from the docs, so it's not extremely fast, though.

0 replies

lingvisa · 2020-06-04T04:20:56Z

lingvisa
Jun 4, 2020
Author

It actually took about 2.10 minutes. I just watched it out and felt it took 10 minutes, but it's a bit more than 2 minutes. Not bad.

0 replies

lingvisa · 2020-06-04T05:16:13Z

lingvisa
Jun 4, 2020
Author

I built the dictionary in an initializer and once for each run. At development time, it's quite often to run repeatedly, so a pickled matcher is worth trying and thanks for the note.

0 replies

p-sodmann · 2020-06-05T19:23:40Z

p-sodmann
Jun 5, 2020

You can use flashtext, there is a "plugin" in the spacy universe page: https://spacy.io/universe/project/spacy-lookup

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve large dictionary matching performance #5532

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 6 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Improve large dictionary matching performance #5532

Uh oh!

Uh oh!

lingvisa Jun 2, 2020

Replies: 6 comments

Uh oh!

adrianeboyd Jun 2, 2020

Uh oh!

lingvisa Jun 3, 2020 Author

Uh oh!

adrianeboyd Jun 3, 2020

Uh oh!

lingvisa Jun 4, 2020 Author

Uh oh!

lingvisa Jun 4, 2020 Author

Uh oh!

p-sodmann Jun 5, 2020

lingvisa
Jun 2, 2020

adrianeboyd
Jun 2, 2020

lingvisa
Jun 3, 2020
Author

adrianeboyd
Jun 3, 2020

lingvisa
Jun 4, 2020
Author

lingvisa
Jun 4, 2020
Author

p-sodmann
Jun 5, 2020