PhraseMatcher memory consumption #9362

Pandalei97 · 2021-10-04T12:12:36Z

Pandalei97
Oct 4, 2021

Hi,

I have several questions about the phraseMatcher.

As I see, the phraseMatcher is adapted from FlashText, which means that it may also use the Trie data structure ?

Since I've found that Matcher is much slower since spaCy 2.0.18 (detailed description in this discussion), we have migrated all our simple patterns in phraseMatcher. However, we have realized that the phraseMatcher consumed much more memory than before.

For example, the patterns we added into the phraseMatcher is the from a text file of 28 MB. Before adding these patterns, our model takes about 1 GB of RAM. After adding all the patterns, the memory consumption goes up to 4 GB.

I understand that it may be hard to optimize the memory consumption, but do you have some suggestions for the optimization ?

This may be problematic for us because we will deploy the our pipeline on several workers, the difference in memory usage will be multiplied.

Spacy Version : 3.0.6

Answered by polm

Oct 10, 2021

Can you give us some more information about your patterns? The main thing that would affect size is the raw number of patterns and the complexity of each pattern.

For a large number of patterns large memory use is kind of unavoidable, but 3GB does sound like a whole lot. The Matcher does use a trie, but tries are usually more memory efficient than raw lists. On the other hand we don't regularly test with anywhere near as many patterns as you're using.

It depends on your patterns, but if you have a bunch of patterns matching on literal terms, you might be able to reduce memory usage by compiling them to regex matches. For example if you have matches for a single token like a, aa, aaa, and …

View full answer

polm · 2021-10-10T04:50:17Z

polm
Oct 10, 2021

Can you give us some more information about your patterns? The main thing that would affect size is the raw number of patterns and the complexity of each pattern.

For a large number of patterns large memory use is kind of unavoidable, but 3GB does sound like a whole lot. The Matcher does use a trie, but tries are usually more memory efficient than raw lists. On the other hand we don't regularly test with anywhere near as many patterns as you're using.

It depends on your patterns, but if you have a bunch of patterns matching on literal terms, you might be able to reduce memory usage by compiling them to regex matches. For example if you have matches for a single token like a, aa, aaa, and that's three patterns, if you make it one regex that matches a|aa|aaa I think that would reduce memory usage.

5 replies

Pandalei97 Oct 29, 2021
Author

Hi polm,

Thanks for your response. I apologise for my late reply.

I have done a memory_profiling for the following test code:

from essential_generators import DocumentGenerator
import spacy
from memory_profiler import profile
from spacy.matcher import PhraseMatcher
import gc

@profile
def phrase_marcher_memory_test():
    # generate 100000 random sentences, about 10-20 MB
    gen = DocumentGenerator()
    sentences = [gen.sentence() for _ in range(100000)]
    nlp = spacy.load("fr_core_news_sm")
    phrase_matcher = PhraseMatcher(vocab=nlp.vocab)
    # Garbage collection before generating patterns
    gc.collect()
    # Create patterns and add them into phraseMatcher
    patterns = [nlp(s) for s in sentences]
    phrase_matcher.add("PATTERN", patterns)
    # Release patterns and redo a garbage collection
    del patterns
    gc.collect()

if __name__ == "__main__":
    phrase_marcher_memory_test()

Here's the result of the memory profiling:

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     7     62.7 MiB     62.7 MiB           1   @profile
     8                                         def phrase_marcher_memory_test():
     9                                             # generate 100000 random sentences, about 10-20 MB
    10    353.6 MiB    290.9 MiB           1       gen = DocumentGenerator()
    11    367.5 MiB   -170.6 MiB      100003       sentences = [gen.sentence() for _ in range(100000)]
    12    532.6 MiB    165.1 MiB           1       nlp = spacy.load("fr_core_news_sm")
    13    532.6 MiB      0.0 MiB           1       phrase_matcher = PhraseMatcher(vocab=nlp.vocab)
    14                                             # Garbage collection before generating patterns
    15    532.4 MiB     -0.2 MiB           1       gc.collect()
    16                                             # Create patterns and add them into phraseMatcher
    17   2002.4 MiB   1461.7 MiB      100003       patterns = [nlp(s) for s in sentences]
    18   2397.0 MiB    394.6 MiB           1       phrase_matcher.add("PATTERN", patterns)
    19                                             # Release patterns and redo a garbage collection
    20   1221.5 MiB  -1175.5 MiB           1       del patterns
    21   1221.5 MiB      0.0 MiB           1       gc.collect()

As see, after adding the patterns into the phraseMatcher, the memory consumption goes up to 1.2 GB. And it doesn't come down even after we try to release the patterns generated.

The total memory consumption by phraseMatcher has exceeds that of the whole pipeline.

Do you have any clue of the optimization that we can do ? We can try to reduce some patterns by using the regex, but I think the memory saved in that way is quite limited.

Sorry to unmark the answer selected by svlandeg. I just think that this discussion hasn't reached the archive point yet.

polm Oct 31, 2021

Deleting the patterns won't free up memory because there's copies of them inside the Matcher.

I asked this before, but could you give us some more details about your patterns? For example are they PhraseMatcher style patterns, or are you using a lot of IN blocks, or are you using just one attribute, or... Depending on what features you're using there may be something that can help. In your code above you're using PhraseMatcher patterns but it's not clear if that's the case in your real code.

As a separate tactic, if you are memory bound with several workers but not CPU bound, it may make sense to use a single worker for your patterns and pass the results to pipelines in other workers.

Also, as an extra point, while we're happy to try to help you, I don't think we've ever had to explicitly focus on memory usage of the Matcher before, so we're going to have to figure this out as we go along.

Pandalei97 Oct 31, 2021
Author

Hi polm,

My patterns are PhraseMatcher style patterns, just like the example that I gave above.

All our patterns have just one attribute "LOWER". Thus, in our real implementation, we use phrase_matcher = PhraseMatcher(vocab=nlp.vocab, attr='LOWER') instead of phrase_matcher = PhraseMatcher(vocab=nlp.vocab). But it shouldn't make too much difference.

Sorry for not being clear, I thought it was sufficient to say that I was using PhraseMatcher in the title or in my discription.

In the discussions that I statred, I always try to generalize and simplify the problem by creating a test code like the one above so that everyone can produce in case that they have a similar issue.

As a separate tactic, if you are memory bound with several workers but not CPU bound, it may make sense to use a single worker for your patterns and pass the results to pipelines in other workers.

Thanks for your advice ! We will have a try.

Also, as an extra point, while we're happy to try to help you, I don't think we've ever had to explicitly focus on memory usage of the Matcher before, so we're going to have to figure this out as we go along.

I am happy to discuss with you, too. As a spaCy user, I find this framework splendid. I do understand the memory issue is nerver easy to solve. Thank you for paying attention on this point ! 😄

polm Nov 18, 2021

Hello, sorry for not following up on this again.

I experimented with adding an option to the Matcher to delete the internal patterns. Unfortunately just doing so didn't free a noticeable amount of memory - I didn't debug that in extensive detail, but it could be that most of the memory is just in the Vocab with the initial string allocations or something. That was the only potentially easy fix that came to mind, unfortunately.

Also, that was my mistake about the PhraseMatcher, you did clearly state that.

One thing to clarify is, is the memory used significantly larger than just storing the patterns in memory? If not (which I assume is the case) then that limits the options for reducing memory usage.

If having a master process to reduce memory usage doesn't help, one thing you might consider is taking a non-spaCy approach using a DAWG or DAFSA data structure. These are like tries, but they compress subtrees, which helps make them memory efficient. The PhraseMatcher should be more memory efficient since it operates in token space and not character space, but I don't think we've done a direct comparison, so it could be worth checking.

Pandalei97 Nov 18, 2021
Author

Hi polm,

One thing to clarify is, is the memory used significantly larger than just storing the patterns in memory? If not (which I assume is the case) then that limits the options for reducing memory usage

Yes, it is significantly larger than just storing the patterns. I "tricked" by preprocessing the documents in advance and stored patterns (which are listes of integers) in a file. Then when I need those patterns I just read the patterns from files and add patterns directly in phraseMatcher. That helps to reduce memory consumption from 6 GB to 3.4 GB.

I see in the code source that you have already intention to store only essential part of the doc in the phraseMatcher:

spaCy/spacy/matcher/phrasematcher.pyx

Lines 338 to 339 in 86fa37e

    
           def _convert_to_array(self, Doc doc): 
        
               return [Token.get_struct_attr(&doc.c[i], self.attr) for i in range(len(doc))]

Since this part is written in Cython, is it possible to create a copy of the attribute and free explicitly the doc object since it's no longer useful ?

In addition, I realized that in EntityRuler, phrase_patterns viciously stores every entire doc object that it has treated (

spaCy/spacy/pipeline/entityruler.py

Lines 314 to 323 in 86fa37e

    
           phrase_patterns = [] 
        
           for label, pattern, ent_id in zip( 
        
               phrase_pattern_labels, 
        
               self.nlp.pipe(phrase_pattern_texts), 
        
               phrase_pattern_ids, 
        
           ): 
        
               phrase_pattern = {"label": label, "pattern": pattern} 
        
               if ent_id: 
        
                   phrase_pattern["id"] = ent_id 
        
               phrase_patterns.append(phrase_pattern)

) only for search use, this may un-necessarily create memory issues for other users in the future.

Uh oh!

PhraseMatcher memory consumption #9362

Uh oh!

Uh oh!

Pandalei97 Oct 4, 2021

Replies: 1 comment · 5 replies

Uh oh!

polm Oct 10, 2021

Uh oh!

Uh oh!

Pandalei97 Oct 29, 2021 Author

Uh oh!

Uh oh!

polm Oct 31, 2021

Uh oh!

Uh oh!

Pandalei97 Oct 31, 2021 Author

Uh oh!

polm Nov 18, 2021

Uh oh!

Uh oh!

Pandalei97 Nov 18, 2021 Author

Pandalei97
Oct 4, 2021

Replies: 1 comment 5 replies

polm
Oct 10, 2021

Pandalei97 Oct 29, 2021
Author

Pandalei97 Oct 31, 2021
Author

Pandalei97 Nov 18, 2021
Author