PhraseMatcher memory consumption #9362
-
Hi, I have several questions about the phraseMatcher. As I see, the phraseMatcher is adapted from FlashText, which means that it may also use the Trie data structure ? Since I've found that Matcher is much slower since spaCy 2.0.18 (detailed description in this discussion), we have migrated all our simple patterns in phraseMatcher. However, we have realized that the phraseMatcher consumed much more memory than before. For example, the patterns we added into the phraseMatcher is the from a text file of 28 MB. Before adding these patterns, our model takes about 1 GB of RAM. After adding all the patterns, the memory consumption goes up to 4 GB. I understand that it may be hard to optimize the memory consumption, but do you have some suggestions for the optimization ? This may be problematic for us because we will deploy the our pipeline on several workers, the difference in memory usage will be multiplied. Spacy Version : 3.0.6 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
Can you give us some more information about your patterns? The main thing that would affect size is the raw number of patterns and the complexity of each pattern. For a large number of patterns large memory use is kind of unavoidable, but 3GB does sound like a whole lot. The Matcher does use a trie, but tries are usually more memory efficient than raw lists. On the other hand we don't regularly test with anywhere near as many patterns as you're using. It depends on your patterns, but if you have a bunch of patterns matching on literal terms, you might be able to reduce memory usage by compiling them to regex matches. For example if you have matches for a single token like |
Beta Was this translation helpful? Give feedback.
Can you give us some more information about your patterns? The main thing that would affect size is the raw number of patterns and the complexity of each pattern.
For a large number of patterns large memory use is kind of unavoidable, but 3GB does sound like a whole lot. The Matcher does use a trie, but tries are usually more memory efficient than raw lists. On the other hand we don't regularly test with anywhere near as many patterns as you're using.
It depends on your patterns, but if you have a bunch of patterns matching on literal terms, you might be able to reduce memory usage by compiling them to regex matches. For example if you have matches for a single token like
a
,aa
,aaa
, and …