How to apply rule-based matcher on a dataframe? #11247
-
Hello everyone! details = {
'Text_id' : [23, 21, 22, 21],
'Text' : ['All roads lead to Rome',
'All work and no play makes Jack a dull buy',
'Any port in a storm',
'Avoid a questioner, for he is also a tattler'],
}
# creating a Dataframe object
example_df = pd.DataFrame(details) I want to apply rule-based # import the matcher
from spacy.matcher import Matcher
# load the pipeline and create the nlp object
nlp = spacy.load("en_core_web_sm")
# define rule-based matching function
def rb_match(df_name, col_name, pattern):
# initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
# add the pattern to the matcher using .add method
pattern_name = "PATTERN_%s" %col_name
matcher.add(pattern_name, [pattern])
# process some text and store it in new column
# use nlp.pipe for better performance
df_name['Text_spacy'] = [d for d in nlp.pipe(df_name[col_name])]
# call the matcher on the doc, the result is a list of tuples
df_name['matches_tuples'] = df_name['Text_spacy'].apply(lambda x: matcher(x))
# generate matches and store them in a new column
df_name["matches"] = [doc[start:end].text for match_id, start, end in df_name['matches_tuples']]
return df_name Let's apply the function on the "Text" column in the example dataframe to extract verbs: rb_match(example_df, "Text", [{"POS":"VERB"}] ) I have the following error message: ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_33/185760541.py in <module>
----> 1 rb_match(example_df, "Text", [{"POS":"VERB"}] )
/tmp/ipykernel_33/66914527.py in rb_match(df_name, col_name, pattern)
13
14 # generate matches
---> 15 df_name["matches"] = [doc[start:end].text for match_id, start, end in df_name['matches_tuples']]
16
17 return df_name
/tmp/ipykernel_33/66914527.py in <listcomp>(.0)
13
14 # generate matches
---> 15 df_name["matches"] = [doc[start:end].text for match_id, start, end in df_name['matches_tuples']]
16
17 return df_name
ValueError: not enough values to unpack (expected 3, got 1) If we comment the following line
Basically, the function returns a list of tuples, where each tuple has the form: (match_id, start index of matched span, end index of matched span). However, it cannot iterate over matches. My Question: How I can fix my function to return new column with matches? Am I in the right direction if I want to apply it on a large dataframe or there is more efficient method? Thank you in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi @afi1289 , If you want the Matcher to immediately return the matches, you can replace that line with: df_name['matches_tuples'] = df_name['Text_spacy'].apply(lambda x: [i.text for i in matcher(x, as_spans=True)]) You can check the spaCy Matcher documentation for more information. I can't help you with your other questions because it's mostly related to the pandas library than with spaCy. If you're worried about performance / efficiency, I highly recommend checking out their scaling user guide as a start. Hope that helps! |
Beta Was this translation helpful? Give feedback.
Hi @afi1289 ,
If you want the Matcher to immediately return the matches, you can replace that line with:
You can check the spaCy Matcher documentation for more information. I can't help you with your other questions because it's mostly related to the pandas library than with spaCy. If you're worried about performance / efficiency, I highly recommend checking out their scaling user guide as a start.
Hope that helps!