How can I improve a Spacy matcher that uses too much memory? #11462

mahagilo · 2022-09-08T12:05:14Z

mahagilo
Sep 8, 2022

I need Spacy matcher to detect keywords from a database in a text (including varieties like singular/plural). I pre-build the Spacy matcher and use pickle to save both matcher and nlp, see code below:

Simplified version of matcher build

for term in keylist:
matcher.add(term, {"LOWER": term.lower_})
with open(save_matcher, "wb") as f: pickle.dump(matcher, f)
with open(save_nlp, "wb") as f: pickle.dump(nlp, f)

current, peak = tracemalloc.get_traced_memory()
print(f"Memory use {current / 106}MB; Peak {peak / 106}MB")

#Number of keywords: 8961
#Time spent building matcher: 62.46
#Memory use 22.664226MB; Peak 261.314393MB

Simplified version of load matcher and NLP

nlp = pickle.load(open(save_nlp, "rb"))
matcher = pickle.load(open(save_matcher, "rb"))
current, peak = tracemalloc.get_traced_memory()
print(f"Memory after loading Spacy is {current / 106}MB; Peak was {peak / 106}MB")

#Memory after loading Spacy is 719.07097MB; Peak was 934.495292MB
It takes a long time to build Spacy matcher with thousands of keywords, so I need to save it after building. Pickle save is the only option (?). When I load the matcher and nlp from pickle, it uses a lot more memory and my cloud bills will bankrupt me ☹ Any thoughts how to improve saving Spacy matcher?

rmitsch · 2022-09-14T16:08:50Z

rmitsch
Sep 14, 2022
Maintainer

Hi @mahagilo! Please use code formatting, it makes it easier for us to read the code and help you.
Could you provide your pipeline configuration and a reproducible, minimal example? Without those it's hard for us to pinpoint the issue or give advice on how to improve your situation.

2 replies

mahagilo Sep 14, 2022
Author

It's not really about the specifics; the problem we have is that it takes a long time to build Spacy matcher / nlp if you have 20K keywords. Therefore, we need to save it somehow. Pickle saving takes way too much memory. I was hoping that there would be an alternative that can efficiently save/load a matcher with many patterns in it.

Here is the code that creates the pickle:

import spacy
import pickle
import os
from spacy.matcher import Matcher
nlp = spacy.load( "en_core_web_md", exclude=["morphologizer, ner, senter, sentencizer", "entity_linker", "entity_ruler", "textcat_multilabel"])

import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from structure_db import * # KnowledgeDB Python classes
from const import * # DB connection details

engine = sqlalchemy.create_engine(f'postgresql://{DB_user}:{DB_pwd}@{DB_host}:5432/{DB_name}')
session_maker = sessionmaker() #create a session object
session = session_maker(bind=engine) #bind postgresql connection to session to get open DB connection
print("Session with", DB_host, DB_name, DB_user)

import spacy
import pickle
import os
from spacy.matcher import Matcher
nlp = spacy.load( "en_core_web_md", exclude=["morphologizer, ner, senter, sentencizer", "entity_linker", "entity_ruler", "textcat_multilabel"])
save_nlp = "nlp_keyword_acro_compact.pickle"
save_matcher = "matcher_keyword_acro_compact.pickle"

def make_into_pattern(keyword):
# Checks if a token is the final token in a span or doc
is_final_token = lambda token, span: token.i == len(span) - 1

tmpdoc = nlp(keyword)
has_hyphen = False
pattern = []
hyphenless_pattern = []
for token in tmpdoc:
    is_hyphen = token.lower_ == "-"
    if is_hyphen: has_hyphen = True
    
    # The last word of a multi-word entity is the one which carries inflection, so we match on the lemma for that one
    tokentext = token.text
    
    #pat_piece = ({"LEMMA": token.lemma_.lower()} if is_final_token(token, tmpdoc)
    #        else {"LOWER": token.lower_})
    
    token.set_extension("lemma_lower", force=True, getter=lambda t: t.lemma_.lower())
    pat_piece = {"_": {"lemma_lower": token.lemma_.lower()}} if is_final_token(token, tmpdoc) else {"LOWER": token.lower_}
    
    if has_hyphen:
        if len(hyphenless_pattern) == 0:
               hyphenless_pattern = pattern.copy()
        if not is_hyphen: hyphenless_pattern.append(pat_piece)
    pattern.append(pat_piece)
full_pattern = [pattern]
if has_hyphen: full_pattern.append(hyphenless_pattern)
return full_pattern
def make_acronym_pattern(acronym):
pattern = []
plural_pattern = []
for word in acronym.split():
pattern.append({"TEXT": word})
plural_pattern.append({"TEXT": word + "s"})
full_pattern = [pattern]
full_pattern.append(plural_pattern)
return full_pattern

-------------------------------------------------------------------------------------------------------
------------------------ Build Spacy Matcher from keywords SQL and save as Pickle ---------------------
-------------------------------------------------------------------------------------------------------

Step 1 Load keywords and acronyms from SQL
keylist = []
print("Loading keywords")
result = session.query(Keywords_distinct).distinct()
for row in result:
keylist.append(row.keyword)

print("Loading acronyms")
acronymlist = []
result = session.query(Acronyms.acronym).distinct()
for row in result:
acronymlist.append(row.acronym)

Step 2 BUILD SPACY MATCHER
print("Building Spacy Matcher")
matcher = Matcher(nlp.vocab)

for term in keylist:
matcher.add(term, make_into_pattern(term))

for term in acronymlist:
matcher.add(term, make_acronym_pattern(term))

Step 3 Save Matcher
with open(save_matcher, "wb") as f: pickle.dump(matcher, f)
with open(save_nlp, "wb") as f: pickle.dump(nlp, f)
print("saved as:", save_matcher, "-", save_nlp)

rmitsch Sep 27, 2022
Maintainer

Please format the code correctly and add all necessary information so that this snippet can be run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How can I improve a Spacy matcher that uses too much memory? #11462

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How can I improve a Spacy matcher that uses too much memory? #11462

Uh oh!

mahagilo Sep 8, 2022

Simplified version of matcher build

Simplified version of load matcher and NLP

Replies: 1 comment · 2 replies

Uh oh!

rmitsch Sep 14, 2022 Maintainer

Uh oh!

Uh oh!

mahagilo Sep 14, 2022 Author

Uh oh!

rmitsch Sep 27, 2022 Maintainer

mahagilo
Sep 8, 2022

Replies: 1 comment 2 replies

rmitsch
Sep 14, 2022
Maintainer

mahagilo Sep 14, 2022
Author

rmitsch Sep 27, 2022
Maintainer