Any way to improve this summary? #7517

the1gofer · 2021-03-21T15:59:24Z

the1gofer
Mar 21, 2021

I have found spacy's text summary to be pretty good, but in some cases, it create assertions that are just wrong. Can I do anything to improve this?

Sample text:

She is survived by 4-daughters: Carla Melissa Todd (Chris), Angela Michelle Black (Chris), Toni Linder Carter (Chris) and Kimberly Ruth Sims (Josh); a son: Alpha Morris Solomon (Carrie); 15-grandchildren: Travis Black, Kristen Black, Breanna Black, Brooke Carter, Kevin Carter, Kaylynn Carter, Christopher Carter, Daniel Lane, Kendall Sims, Brian Wheat, Joseph Sims, Jeremy Sims, Abbigail Solomon, Logan Solomon and Emma Solomon; 2-great grandchildren: Braxton Black and Rose White; a sister: Karen Hodges; a brother: Lloyd Nowling and a number of nieces, nephews and other relatives.

Results:

Alpha Morris Solomon (Carrie); 15-grandchildren: Travis Black, Kristen Black, Breanna Black, Brooke Carter, Kevin Carter, Kaylynn Carter, Christopher Carter, Daniel Lane, Kendall Sims, Brian Wheat, Joseph Sims, Jeremy Sims, Abbigail Solomon, Logan Solomon and Emma Solomon; 2-great grandchildren: She is survived by 4-daughters: Carla Melissa Todd (Chris), Angela Michelle Black (Chris), Toni Linder Carter (Chris) and Kimberly Ruth Sims (Josh); a son: Braxton Black and Rose White; a sister: Karen Hodges; a brother: Lloyd Nowling and a number of nieces, nephews and other relatives.

Notice "a son: Braxton Black and Rose White;" That's just wrong. Son should be "Alpha Morris Solomon (Carrie)", and Braxton and rose should be great grandchildren.

`

    nlp = spacy.load("en_core_web_lg")
    doc = input("Enter text to be summarized: ")
    doc = nlp(doc)

    keyword = []
    stopwords = list(STOP_WORDS)
    pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
    for token in doc:
        if (token.text in stopwords or token.text in punctuation):
            continue
        if (token.pos_ in pos_tag):
            keyword.append(token.text)

    freq_word = Counter(keyword)

    max_freq = Counter(keyword).most_common(1)[0][1]
    for word in freq_word.keys():
        freq_word[word] = (freq_word[word] / max_freq)
    freq_word.most_common(5)

    sent_strength={}
    for sent in doc.sents:
        for word in sent:
            if word.text in freq_word.keys():
                if sent in sent_strength.keys():
                    sent_strength[sent]+=freq_word[word.text]
                else:
                    sent_strength[sent]=freq_word[word.text]

    summarized_sentences = nlargest(3, sent_strength, key=sent_strength.get)

    final_sentences = [w.text for w in summarized_sentences]
    summary = ' '.join(final_sentences)
    print("*******************")
    print("Summary: " + summary)`

Answered by polm

Mar 22, 2021

So it looks like you're applying extractive summarization on top of spaCy using the sentence tokenizer and POS tags. What's happening is you're running up against the limits of extractive summarization - if your basic unit is sentences, there's not really much you can do with your example document. In fact it would be reasonable to interpret it as a single sentence, in which case there's nothing you can do.

If you want to improve your results with minimal changes, you might look at using a custom Sentencizer to change how sentence splits are detected, possibly treating all colons and semicolons as sentence dividers. If you're focused on summarizing obituaries like this though I would hone…

View full answer

polm · 2021-03-22T05:03:50Z

polm
Mar 22, 2021

So it looks like you're applying extractive summarization on top of spaCy using the sentence tokenizer and POS tags. What's happening is you're running up against the limits of extractive summarization - if your basic unit is sentences, there's not really much you can do with your example document. In fact it would be reasonable to interpret it as a single sentence, in which case there's nothing you can do.

If you want to improve your results with minimal changes, you might look at using a custom Sentencizer to change how sentence splits are detected, possibly treating all colons and semicolons as sentence dividers. If you're focused on summarizing obituaries like this though I would honestly just focus on extracting phrases like "4 daughters" and removing named entities marked as PERSON; this seems like a reasonable summary:

She is survived by 4-daughters, a son, 15-grandchildren, 2-great grandchildren, a sister, a brother, and a number of nieces, nephews and other relatives.

Also, does your original document actually write it like 4-daughters and 15-grandchildren? That's really weird.

1 reply

the1gofer Mar 22, 2021
Author

Not all of them, but enough of them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Any way to improve this summary? #7517

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Any way to improve this summary? #7517

Uh oh!

Uh oh!

the1gofer Mar 21, 2021

Replies: 2 comments · 1 reply

Uh oh!

polm Mar 22, 2021

Uh oh!

the1gofer Mar 22, 2021 Author

the1gofer
Mar 21, 2021

Replies: 2 comments 1 reply

polm
Mar 22, 2021

the1gofer Mar 22, 2021
Author