Skip to content

Fix KeyError in SumBasicSummarizer#234

Open
bysiber wants to merge 1 commit intomiso-belica:mainfrom
bysiber:fix-sumbasic-keyerror
Open

Fix KeyError in SumBasicSummarizer#234
bysiber wants to merge 1 commit intomiso-belica:mainfrom
bysiber:fix-sumbasic-keyerror

Conversation

@bysiber
Copy link

@bysiber bysiber commented Feb 21, 2026

Fixes #176.

The SumBasicSummarizer crashes with a KeyError when processing certain texts because the word processing pipeline is inconsistent between two methods.

_get_content_words_in_sentence processes words in this order: normalize → filter stop words → stem

But _get_all_content_words_in_doc (via _get_all_words_in_doc) does it differently: stem → filter stop words → normalize

This means a word can end up in the per-sentence word list but be missing from the document frequency table (or vice versa), which causes the KeyError when looking up the word frequency.

The fix aligns _get_all_content_words_in_doc to follow the same processing order as _get_content_words_in_sentence: normalize → filter stop words → stem. _get_all_words_in_doc now returns raw (unstemmed) words so the caller can apply the pipeline consistently.

All existing sum_basic tests pass.

The _get_all_content_words_in_doc method was processing words in a
different order (stem -> filter -> normalize) compared to
_get_content_words_in_sentence (normalize -> filter -> stem). This
mismatch meant some words would appear in the per-sentence word
lists but not in the document frequency table, causing a KeyError
during summarization.

Aligned both methods to use the same processing order:
normalize -> filter stop words -> stem.

Also fixed _get_all_words_in_doc to return raw words instead of
pre-stemmed words, since stemming is now handled consistently
in _get_all_content_words_in_doc.

Fixes miso-belica#176
Copy link
Owner

@miso-belica miso-belica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, thank you for the fix. In order to accept it, please add a test reproducing the issue with the original code - red/green TDD approach. You are mentioning that there are 2 cases when this may happen so please add 2 tests - 1 for each.

Also, please add this into the CHANGELOG.md file.

Thank you for your help 🙂

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sumbasic: KeyError

2 participants