-
-
Notifications
You must be signed in to change notification settings - Fork 541
Add faster version of KL-Sum using numpy #200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,162 @@ | ||||||||||||||||||||||||||||||||
| # -*- coding: utf-8 -*- | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| from __future__ import absolute_import | ||||||||||||||||||||||||||||||||
| from __future__ import division, print_function, unicode_literals | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| try: | ||||||||||||||||||||||||||||||||
| import numpy as np | ||||||||||||||||||||||||||||||||
| except ImportError: | ||||||||||||||||||||||||||||||||
| numpy = None | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| from sumy.summarizers._summarizer import AbstractSummarizer | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| class KLSummarizer(AbstractSummarizer): | ||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||
| Method that greedily adds sentences to a summary so long as it decreases the | ||||||||||||||||||||||||||||||||
| KL Divergence. | ||||||||||||||||||||||||||||||||
| Source: http://www.aclweb.org/anthology/N09-1041 | ||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||
| MISSING_WORD_VAL = 42.0 # placeholder value used for missing words in document | ||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||
| MISSING_WORD_VAL = 42.0 # placeholder value used for missing words in document | |
| # Sentinel value used for words that are missing from the document's frequency | |
| # representation. We use a numeric value that is assumed to lie outside the | |
| # range of any valid word frequency observed in typical documents so that it | |
| # can be distinguished from real counts wherever it is used. | |
| MISSING_WORD_VAL = 42.0 |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space in error message. There should be a space after the period and before "Please". The message should read "Fast KL-Sum summarizer requires NumPy. Please, install it by command 'pip install numpy'."
| "Please, install it by command 'pip install numpy'.") | |
| " Please, install it by command 'pip install numpy'.") |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confusing method name. The name "_old_compute_word_freq" suggests deprecated code, but it's actively used in compute_tf. Consider renaming to something more descriptive like "_compute_word_freq_dict" to indicate it returns a dictionary, distinguishing it from the array-based _compute_word_freq method.
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing docstring for the _compute_word_freq method. This method differs from the original implementation's signature and behavior (takes arrays instead of returning a dict), so it should be documented to explain the parameters and the in-place modification of word_freq_arr.
| def _compute_word_freq(list_of_words, word_freq_arr, word_to_ind): | |
| def _compute_word_freq(list_of_words, word_freq_arr, word_to_ind): | |
| """ | |
| Update a word-frequency array in place based on the given list of words. | |
| Unlike :meth:`_old_compute_word_freq`, which builds and returns a dict, | |
| this variant increments counts directly in an existing numeric array. | |
| :param list_of_words: Iterable of word tokens whose frequencies should be counted. | |
| :param word_freq_arr: Mutable numeric array (e.g., NumPy array) where each | |
| position corresponds to a word index; this array is modified in place. | |
| :param word_to_ind: Mapping from word token to integer index into | |
| ``word_freq_arr``. | |
| :return: The same ``word_freq_arr`` instance, after in-place updates. | |
| """ |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of operations differs from the original KL summarizer. In the original, stop words are filtered before normalization, but here normalization happens before filtering. This could cause differences in behavior if the normalization process affects whether a word matches a stop word. Consider matching the original order: filter stop words first, then normalize.
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing docstrings for _joint_freq and _kl_divergence methods. These are key algorithmic components that differ from the original implementation by using numpy arrays instead of dictionaries. They should be documented to explain parameters, especially the doc_missing_word_mask parameter which is not self-explanatory.
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent naming in comment. The comment uses lowercase "kl_divergence" which appears to be referring to the concept of KL divergence rather than the method name. For consistency with other parts of the codebase (e.g., line 16 "KL Divergence"), consider using "KL divergence" or "KL-divergence".
| the best sentence is the one with the smallest kl_divergence | |
| the best sentence is the one with the smallest KL divergence |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vocabulary construction mixes normalized content words with all raw words. The vocabulary is built from the union of all words in the document (via _get_all_words_in_doc which returns raw, non-normalized words including stop words) and word_to_freq.keys() (which contains normalized content words without stop words). This inconsistency can lead to the vocabulary containing both normalized and non-normalized versions of the same word, potentially causing incorrect frequency calculations.
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused variable. The 'iterations' variable is initialized and incremented but never used. Consider removing it to improve code clarity.
| iterations = 0 | |
| indices = list(range(len(sentences_as_words))) | |
| # Removes one sentence per iteration by adding to summary | |
| while len(indices) > 0: | |
| iterations += 1 | |
| indices = list(range(len(sentences_as_words))) | |
| # Removes one sentence per iteration by adding to summary | |
| while len(indices) > 0: |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent word counting in total length calculation. The total length passed to _joint_freq combines i_to_sent_len[i] (count of content words only from line 132) with summary_word_list_len (count of all words including stop words from line 156). These should both count the same type of words for accurate joint frequency computation.
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistency in word filtering between sentence and summary frequencies. The pre-computed sentence word frequencies use normalized content words only (line 130 via _get_content_words_in_sentence), but the summary word frequency is updated with all raw words including stop words (line 154 via _get_all_words_in_doc). This creates an inconsistency in the joint frequency calculation at line 144-145, where normalized content word frequencies are combined with all-word frequencies. This should match the approach used for sentences.
| best_sentence_word_list = self._get_all_words_in_doc([best_sentence]) | |
| # update summary length and word frequencies | |
| summary_word_list_len += len(best_sentence_word_list) | |
| summary_word_freq = self._compute_word_freq(best_sentence_word_list, summary_word_freq, word_to_ind) | |
| best_sentence_content_words = self._get_content_words_in_sentence(best_sentence) | |
| # update summary length and word frequencies using normalized content words | |
| summary_word_list_len += len(best_sentence_content_words) | |
| summary_word_freq = self._compute_word_freq(best_sentence_content_words, summary_word_freq, word_to_ind) |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,207 @@ | ||||||
| # -*- coding: utf-8 -*- | ||||||
|
|
||||||
| from __future__ import absolute_import | ||||||
| from __future__ import division, print_function, unicode_literals | ||||||
|
|
||||||
| import pytest | ||||||
| import numpy as np | ||||||
|
|
||||||
| import sumy.summarizers.fast_kl as fast_kl_module | ||||||
|
||||||
| from sumy.models.dom._sentence import Sentence | ||||||
| from sumy.nlp.tokenizers import Tokenizer | ||||||
| from sumy.summarizers.fast_kl import KLSummarizer | ||||||
| from ..utils import build_document | ||||||
|
|
||||||
|
|
||||||
| @pytest.fixture | ||||||
| def empty_stop_words(): | ||||||
| return [] | ||||||
|
|
||||||
|
|
||||||
| @pytest.fixture | ||||||
| def stop_words(): | ||||||
| return ["the", "and", "i"] | ||||||
|
|
||||||
|
|
||||||
| @pytest.fixture | ||||||
| def summarizer(stop_words): | ||||||
| summarizer = KLSummarizer() | ||||||
| summarizer.stop_words = stop_words | ||||||
| return summarizer | ||||||
|
|
||||||
|
|
||||||
| def test_numpy_not_installed(): | ||||||
| summarizer = KLSummarizer() | ||||||
|
|
||||||
| numpy = fast_kl_module.np | ||||||
| fast_kl_module.np = None | ||||||
|
|
||||||
| with pytest.raises(ValueError): | ||||||
| summarizer(build_document(), 10) | ||||||
|
|
||||||
| fast_kl_module.np = numpy | ||||||
|
|
||||||
|
|
||||||
| def test_empty_document(summarizer): | ||||||
| document = build_document() | ||||||
| returned = summarizer(document, 10) | ||||||
|
|
||||||
| assert len(returned) == 0 | ||||||
|
|
||||||
|
|
||||||
| def test_single_sentence(summarizer): | ||||||
| s = Sentence("I am one slightly longer sentence.", Tokenizer("english")) | ||||||
| document = build_document([s]) | ||||||
|
|
||||||
| returned = summarizer(document, 10) | ||||||
|
|
||||||
| assert len(returned) == 1 | ||||||
|
|
||||||
|
|
||||||
| def test_compute_word_freq(summarizer): | ||||||
| words = ["one", "two", "three", "four"] | ||||||
| word_freq = np.zeros(len(words)) | ||||||
| word_to_ind = {word: index for index, word in enumerate(words)} | ||||||
| freq = summarizer._compute_word_freq(words, word_freq, word_to_ind) | ||||||
|
|
||||||
| assert np.all(freq == 1) | ||||||
|
|
||||||
| words = ["one", "one", "two", "two"] | ||||||
| word_freq = np.zeros(len(set(words))) | ||||||
| word_to_ind = {word: index for index, word in enumerate(set(words))} | ||||||
| freq = summarizer._compute_word_freq(words, word_freq, word_to_ind) | ||||||
|
|
||||||
| assert np.all(freq == 2) | ||||||
|
|
||||||
|
|
||||||
| def test_joint_freq(summarizer): | ||||||
| w1 = ["one", "two", "three", "four"] | ||||||
| w2 = ["one", "two", "three", "four"] | ||||||
|
|
||||||
| word_freq1 = np.zeros(len(w1)) | ||||||
| word_freq2 = np.zeros_like(word_freq1) | ||||||
| word_to_ind = {word: index for index, word in enumerate(w1)} | ||||||
| freq1 = summarizer._compute_word_freq(w1, word_freq1, word_to_ind) | ||||||
| freq2 = summarizer._compute_word_freq(w1, word_freq2, word_to_ind) | ||||||
|
||||||
| freq2 = summarizer._compute_word_freq(w1, word_freq2, word_to_ind) | |
| freq2 = summarizer._compute_word_freq(w2, word_freq2, word_to_ind) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable naming inconsistency: The variable is named 'numpy' but the import statement uses 'np'. Since the import uses 'as np', this line should be 'np = None' instead of 'numpy = None'. This inconsistency will cause issues when checking if numpy is installed.