How to apply rule-based matcher on a dataframe? #11247

ahmad-alismail · 2022-07-29T11:27:56Z

ahmad-alismail
Jul 29, 2022

Hello everyone!
I have the following dataframe:

details = {
    'Text_id' : [23, 21, 22, 21],
    'Text' : ['All roads lead to Rome', 
              'All work and no play makes Jack a dull buy', 
              'Any port in a storm', 
              'Avoid a questioner, for he is also a tattler'],
}
  
# creating a Dataframe object 
example_df = pd.DataFrame(details)

I want to apply rule-based Matcher of spaCy on the text column in the dataframe to create a new column containing matches. Let's assume the matches will be only verbs.
I define a function that takes dataframe, column name, and pattern as follows:

# import the matcher
from spacy.matcher import Matcher

# load the pipeline and create the nlp object
nlp = spacy.load("en_core_web_sm")

# define rule-based matching function
def rb_match(df_name, col_name, pattern):

    # initialize the matcher with the shared vocab
    matcher = Matcher(nlp.vocab)
    # add the pattern to the matcher using .add method
    pattern_name = "PATTERN_%s" %col_name  
    matcher.add(pattern_name, [pattern])
    
    # process some text and store it in new column
    # use nlp.pipe for better performance 
    df_name['Text_spacy'] = [d for d in nlp.pipe(df_name[col_name])]
    
    # call the matcher on the doc, the result is a list of tuples
    df_name['matches_tuples'] = df_name['Text_spacy'].apply(lambda x: matcher(x))
    
    # generate matches and store them in a new column
    df_name["matches"] = [doc[start:end].text for match_id, start, end in df_name['matches_tuples']]
    
    return df_name

Let's apply the function on the "Text" column in the example dataframe to extract verbs:

rb_match(example_df, "Text", [{"POS":"VERB"}] )

I have the following error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_33/185760541.py in <module>
----> 1 rb_match(example_df, "Text", [{"POS":"VERB"}] )

/tmp/ipykernel_33/66914527.py in rb_match(df_name, col_name, pattern)
     13 
     14     # generate matches
---> 15     df_name["matches"] = [doc[start:end].text for match_id, start, end in df_name['matches_tuples']]
     16 
     17     return df_name

/tmp/ipykernel_33/66914527.py in <listcomp>(.0)
     13 
     14     # generate matches
---> 15     df_name["matches"] = [doc[start:end].text for match_id, start, end in df_name['matches_tuples']]
     16 
     17     return df_name

ValueError: not enough values to unpack (expected 3, got 1)

If we comment the following line df_name["matches"] = [doc[start:end].text for match_id, start, end in df_name['matches_tuples']] in the function and reapply the function, we will get this output:

 Text_id                                          Text                                                Text_spacy                  matches_tuples
0       23                        All roads lead to Rome                              (All, roads, lead, to, Rome)  [(12643752728212218961, 2, 3)]
1       21    All work and no play makes Jack a dull buy     (All, work, and, no, play, makes, Jack, a, dull, buy)  [(12643752728212218961, 5, 6)]
2       22                           Any port in a storm                                 (Any, port, in, a, storm)                              []
3       21  Avoid a questioner, for he is also a tattler  (Avoid, a, questioner, ,, for, he, is, also, a, tattler)  [(12643752728212218961, 0, 1)]

Basically, the function returns a list of tuples, where each tuple has the form: (match_id, start index of matched span, end index of matched span). However, it cannot iterate over matches.

My Question: How I can fix my function to return new column with matches? Am I in the right direction if I want to apply it on a large dataframe or there is more efficient method?

Thank you in advance!

Answered by ljvmiranda921

Aug 1, 2022

Hi @afi1289 ,

If you want the Matcher to immediately return the matches, you can replace that line with:

df_name['matches_tuples'] = df_name['Text_spacy'].apply(lambda x: [i.text for i in matcher(x, as_spans=True)])

You can check the spaCy Matcher documentation for more information. I can't help you with your other questions because it's mostly related to the pandas library than with spaCy. If you're worried about performance / efficiency, I highly recommend checking out their scaling user guide as a start.

Hope that helps!

View full answer

ljvmiranda921 · 2022-08-01T10:06:44Z

ljvmiranda921
Aug 1, 2022

Hi @afi1289 ,

If you want the Matcher to immediately return the matches, you can replace that line with:

df_name['matches_tuples'] = df_name['Text_spacy'].apply(lambda x: [i.text for i in matcher(x, as_spans=True)])

You can check the spaCy Matcher documentation for more information. I can't help you with your other questions because it's mostly related to the pandas library than with spaCy. If you're worried about performance / efficiency, I highly recommend checking out their scaling user guide as a start.

Hope that helps!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to apply rule-based matcher on a dataframe? #11247

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to apply rule-based matcher on a dataframe? #11247

Uh oh!

Uh oh!

ahmad-alismail Jul 29, 2022

Replies: 1 comment

Uh oh!

ljvmiranda921 Aug 1, 2022

ahmad-alismail
Jul 29, 2022

ljvmiranda921
Aug 1, 2022