Can I pre-compile a large dictionary created though PhraseMatcher? #10514

lingvisa · 2022-03-17T23:25:18Z

lingvisa
Mar 17, 2022

I have a large dictionary created through PhraseMatcher. The initialization takes around 30 seconds to load the terms into dictionary.

def __init__(self, nlp, char_match):

       self.matcher = PhraseMatcher(nlp.vocab, attr = 'LOWER')

      ...
      for label, terms in attribute_word_dict.items():
           self.matcher.add(label, list(self.nlp.tokenizer.pipe(terms)))

Can I pre-create a dictionary object and hopefully load it later on without going through the same code like above? I am wondering whether and how I can speed up the initialization process.

Answered by polm

Mar 18, 2022

There isn't a feature for pre-compilation, and serialization builds the internal state again from text. You might be able to make things faster using Pickle, but that would have to pull in the Vocab/nlp object so I'm not sure how cleanly it would work (it might be fine).

The way you are adding things is a little weird and could maybe be improved. This might be faster:

for label, terms in attribute_word_dict.items():
    for term in self.nlp.tokenizer.pipe(terms):
        self.matcher.add(label, [term])

Normally reducing calls to self.matcher.add might be faster, but if you have a lot of terms then building the list all at once could be causing ineffcient behavior.

What tokenizer are you …

View full answer

polm · 2022-03-18T03:47:39Z

polm
Mar 18, 2022

There isn't a feature for pre-compilation, and serialization builds the internal state again from text. You might be able to make things faster using Pickle, but that would have to pull in the Vocab/nlp object so I'm not sure how cleanly it would work (it might be fine).

The way you are adding things is a little weird and could maybe be improved. This might be faster:

for label, terms in attribute_word_dict.items():
    for term in self.nlp.tokenizer.pipe(terms):
        self.matcher.add(label, [term])

Normally reducing calls to self.matcher.add might be faster, but if you have a lot of terms then building the list all at once could be causing ineffcient behavior.

What tokenizer are you using? How big is your dictionary (labels/terms)?

12 replies

polm Mar 22, 2022

I'm not sure there's any way to use multiprocessing to speed up the loading process, since even if nothing breaks you still have to synchronize on the same data structure.

I would recommend you try pickling the Matcher, and if that doesn't work or doesn't help then you profile it to see what's slow about the creation. Maybe it would help to serialize the Docs you are feeding to the Matcher in a DocBin and load from that? For the number of Docs deserializing will still take a bit though.

I am a little confused about why a 20s startup with that number of patterns is an issue though. Can you explain your use case, particularly your time constraints, a little more?

lingvisa Mar 22, 2022
Author

I will try pickling PhraseMatcher, and if that works, that would be most straightforward. We had a huge knowledge base, consisting of a large dictionary and a large set of patterns. In initialization, it takes around 20 ms to load the dictionary, another 20 ms to load the patterns, and another 10-15 ms to load bert models. The whole thing takes around 1 minute to initialize and startup. This won't be an issue at all for deployment, but for daily development, it would be great if the startup time can be reduced by half, since developers have to start and end the program frequently.

lingvisa Mar 22, 2022
Author

Sorry, not 'ms', but 'seconds'.

polm Mar 22, 2022

If this is only a problem during development I would suggest you use a smaller list of patterns for development.

lingvisa Mar 22, 2022
Author

Yes, I do have a 'test':true/false setting for loading full or little data to start up, but a lot of times only when a full KB is loaded, the real effect can be seen and tested, and evaluated. I will trying serialization first. For myself, I actually have got used to this time cost, but for the convenience of others, I hope to make it faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Can I pre-compile a large dictionary created though PhraseMatcher? #10514

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 12 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Can I pre-compile a large dictionary created though PhraseMatcher? #10514

Uh oh!

lingvisa Mar 17, 2022

Replies: 1 comment · 12 replies

Uh oh!

polm Mar 18, 2022

Uh oh!

polm Mar 22, 2022

Uh oh!

lingvisa Mar 22, 2022 Author

Uh oh!

lingvisa Mar 22, 2022 Author

Uh oh!

polm Mar 22, 2022

Uh oh!

lingvisa Mar 22, 2022 Author

lingvisa
Mar 17, 2022

Replies: 1 comment 12 replies

polm
Mar 18, 2022

lingvisa Mar 22, 2022
Author

lingvisa Mar 22, 2022
Author

lingvisa Mar 22, 2022
Author