Add faster version of KL-Sum using numpy by mamei16 · Pull Request #200 · miso-belica/sumy

mamei16 · 2023-12-02T01:21:06Z

Hi, thanks for creating this project!
I was looking for a summarization method for a project I'm working on and really liked KL-Sum. However, it was too slow for my specific case. Therefore, I rewrote it using vectorized operations. It is not a perfect replacement for the original, since the results can slightly differ. However, all original tests still pass. Feel free to reject this PR if you'd rather keep the original only.

Cheers

Copilot

Pull request overview

This PR introduces a faster implementation of the KL-Sum summarization algorithm using NumPy vectorized operations. The new implementation (fast_kl.py) aims to improve performance while maintaining compatibility with the original KL-Sum algorithm, though the author notes that results may slightly differ.

Key Changes:

New vectorized KL-Sum implementation using NumPy arrays instead of dictionaries for word frequency calculations
Comprehensive test suite mirroring the original KL-Sum tests
Pre-computation and caching of sentence word frequencies to avoid repeated calculations

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 14 comments.

File	Description
`sumy/summarizers/fast_kl.py`	New fast implementation of KLSummarizer using NumPy for vectorized operations and pre-computed word frequencies
`tests/test_summarizers/test_fast_kl.py`	Comprehensive test suite for the fast KL implementation, including tests for word frequency, joint frequency, KL divergence, and edge cases

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-28T16:48:57Z

sumy/summarizers/fast_kl.py

+    def _ensure_dependencies_installed():
+        if np is None:
+            raise ValueError("Fast KL-Sum summarizer requires NumPy."
+                             "Please, install it by command 'pip install numpy'.")


Missing space in error message. There should be a space after the period and before "Please". The message should read "Fast KL-Sum summarizer requires NumPy. Please, install it by command 'pip install numpy'."

Suggested change

"Please, install it by command 'pip install numpy'.")

" Please, install it by command 'pip install numpy'.")

Copilot · 2025-12-28T16:48:58Z

sumy/summarizers/fast_kl.py

+            best_sentence_word_list = self._get_all_words_in_doc([best_sentence])
+            # update summary length and word frequencies
+            summary_word_list_len += len(best_sentence_word_list)
+            summary_word_freq = self._compute_word_freq(best_sentence_word_list, summary_word_freq, word_to_ind)


Inconsistency in word filtering between sentence and summary frequencies. The pre-computed sentence word frequencies use normalized content words only (line 130 via _get_content_words_in_sentence), but the summary word frequency is updated with all raw words including stop words (line 154 via _get_all_words_in_doc). This creates an inconsistency in the joint frequency calculation at line 144-145, where normalized content word frequencies are combined with all-word frequencies. This should match the approach used for sentences.

Suggested change

best_sentence_word_list = self._get_all_words_in_doc([best_sentence])

# update summary length and word frequencies

summary_word_list_len += len(best_sentence_word_list)

summary_word_freq = self._compute_word_freq(best_sentence_word_list, summary_word_freq, word_to_ind)

best_sentence_content_words = self._get_content_words_in_sentence(best_sentence)

# update summary length and word frequencies using normalized content words

summary_word_list_len += len(best_sentence_content_words)

summary_word_freq = self._compute_word_freq(best_sentence_content_words, summary_word_freq, word_to_ind)

Copilot · 2025-12-28T16:48:58Z

sumy/summarizers/fast_kl.py

+        return word_freq
+
+    @staticmethod
+    def _compute_word_freq(list_of_words, word_freq_arr, word_to_ind):


Missing docstring for the _compute_word_freq method. This method differs from the original implementation's signature and behavior (takes arrays instead of returning a dict), so it should be documented to explain the parameters and the in-place modification of word_freq_arr.

Suggested change

def _compute_word_freq(list_of_words, word_freq_arr, word_to_ind):

def _compute_word_freq(list_of_words, word_freq_arr, word_to_ind):

"""

Update a word-frequency array in place based on the given list of words.

Unlike :meth:`_old_compute_word_freq`, which builds and returns a dict,

this variant increments counts directly in an existing numeric array.

:param list_of_words: Iterable of word tokens whose frequencies should be counted.

:param word_freq_arr: Mutable numeric array (e.g., NumPy array) where each

position corresponds to a word index; this array is modified in place.

:param word_to_ind: Mapping from word token to integer index into

``word_freq_arr``.

:return: The same ``word_freq_arr`` instance, after in-place updates.

"""

Copilot · 2025-12-28T16:48:58Z

sumy/summarizers/fast_kl.py

+    @staticmethod
+    def _joint_freq(wc1, wc2, total_len):
+        if total_len == 0:
+            return np.zeros_like(wc1)
+        joint_sum = wc1 + wc2
+        return joint_sum / total_len
+
+    @staticmethod
+    def _kl_divergence(summary_freq, doc_freq, doc_missing_word_mask):
+        summary_freq = np.where((summary_freq != 0.0) & doc_missing_word_mask, summary_freq, doc_freq)
+        return (doc_freq * np.log(doc_freq / summary_freq)).sum()


Missing docstrings for _joint_freq and _kl_divergence methods. These are key algorithmic components that differ from the original implementation by using numpy arrays instead of dictionaries. They should be documented to explain parameters, especially the doc_missing_word_mask parameter which is not self-explanatory.

Copilot · 2025-12-28T16:48:58Z

sumy/summarizers/fast_kl.py

+    KL Divergence.
+    Source: http://www.aclweb.org/anthology/N09-1041
+    """
+    MISSING_WORD_VAL = 42.0  # placeholder value used for missing words in document


Magic number used for MISSING_WORD_VAL. While the comment explains its purpose, the choice of 42.0 appears arbitrary and could potentially conflict with actual word frequencies. Consider using a more distinctive value like -1.0 or float('inf'), or adding a comment explaining why 42.0 was chosen specifically.

Suggested change

MISSING_WORD_VAL = 42.0 # placeholder value used for missing words in document

# Sentinel value used for words that are missing from the document's frequency

# representation. We use a numeric value that is assumed to lie outside the

# range of any valid word frequency observed in typical documents so that it

# can be distinguished from real counts wherever it is used.

MISSING_WORD_VAL = 42.0

Copilot · 2025-12-28T16:49:00Z

sumy/summarizers/fast_kl.py

+    def _get_all_content_words_in_doc(self, sentences):
+        all_words = self._get_all_words_in_doc(sentences)
+        normalized_words = self._normalize_words(all_words)
+        normalized_content_words = self._filter_out_stop_words(normalized_words)
+        return normalized_content_words


The order of operations differs from the original KL summarizer. In the original, stop words are filtered before normalization, but here normalization happens before filtering. This could cause differences in behavior if the normalization process affects whether a word matches a stop word. Consider matching the original order: filter stop words first, then normalize.

Copilot · 2025-12-28T16:49:00Z

sumy/summarizers/fast_kl.py

+    def _compute_ratings(self, sentences):
+        word_to_freq = self.compute_tf(sentences)
+
+        vocabulary = set(self._get_all_words_in_doc(sentences)).union(word_to_freq.keys())


Vocabulary construction mixes normalized content words with all raw words. The vocabulary is built from the union of all words in the document (via _get_all_words_in_doc which returns raw, non-normalized words including stop words) and word_to_freq.keys() (which contains normalized content words without stop words). This inconsistency can lead to the vocabulary containing both normalized and non-normalized versions of the same word, potentially causing incorrect frequency calculations.

Copilot · 2025-12-28T16:49:00Z

sumy/summarizers/fast_kl.py

+        iterations = 0
+        indices = list(range(len(sentences_as_words)))
+        # Removes one sentence per iteration by adding to summary
+        while len(indices) > 0:
+            iterations += 1


Unused variable. The 'iterations' variable is initialized and incremented but never used. Consider removing it to improve code clarity.

Suggested change

iterations = 0

indices = list(range(len(sentences_as_words)))

# Removes one sentence per iteration by adding to summary

while len(indices) > 0:

iterations += 1

indices = list(range(len(sentences_as_words)))

# Removes one sentence per iteration by adding to summary

while len(indices) > 0:

Copilot · 2025-12-28T16:49:01Z

sumy/summarizers/fast_kl.py

+    @staticmethod
+    def _old_compute_word_freq(list_of_words, d=None):
+        word_freq = {} if d is None else d
+        for w in list_of_words:
+            word_freq[w] = word_freq.get(w, 0) + 1
+        return word_freq


Confusing method name. The name "_old_compute_word_freq" suggests deprecated code, but it's actively used in compute_tf. Consider renaming to something more descriptive like "_compute_word_freq_dict" to indicate it returns a dictionary, distinguishing it from the array-based _compute_word_freq method.

Copilot · 2025-12-28T16:49:01Z

tests/test_summarizers/test_fast_kl.py

+import pytest
+import numpy as np
+
+import sumy.summarizers.fast_kl as fast_kl_module


Module 'sumy.summarizers.fast_kl' is imported with both 'import' and 'import from'.

mamei16 added 3 commits December 2, 2023 01:53

add fast KL-Sum

f54f8dd

remove print statement

3e73148

add test_numpy_not_installed

e28e34d

miso-belica requested a review from Copilot December 28, 2025 16:41

Copilot started reviewing on behalf of miso-belica December 28, 2025 16:42 View session

Copilot AI reviewed Dec 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add faster version of KL-Sum using numpy#200

Add faster version of KL-Sum using numpy#200
mamei16 wants to merge 3 commits intomiso-belica:mainfrom
mamei16:main

mamei16 commented Dec 2, 2023

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 28, 2025

Uh oh!

Copilot AI Dec 28, 2025

Uh oh!

Copilot AI Dec 28, 2025

Uh oh!

Copilot AI Dec 28, 2025

Uh oh!

Copilot AI Dec 28, 2025

Uh oh!

Copilot AI Dec 28, 2025

Uh oh!

Copilot AI Dec 28, 2025

Uh oh!

Copilot AI Dec 28, 2025

Uh oh!

Copilot AI Dec 28, 2025

Uh oh!

Copilot AI Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	"Please, install it by command 'pip install numpy'.")
	" Please, install it by command 'pip install numpy'.")

-    def _compute_word_freq(list_of_words, word_freq_arr, word_to_ind):
+    def _compute_word_freq(list_of_words, word_freq_arr, word_to_ind):
+        """
+        Update a word-frequency array in place based on the given list of words.
+        Unlike :meth:`_old_compute_word_freq`, which builds and returns a dict,
+        this variant increments counts directly in an existing numeric array.
+        :param list_of_words: Iterable of word tokens whose frequencies should be counted.
+        :param word_freq_arr: Mutable numeric array (e.g., NumPy array) where each
+            position corresponds to a word index; this array is modified in place.
+        :param word_to_ind: Mapping from word token to integer index into
+            ``word_freq_arr``.
+        :return: The same ``word_freq_arr`` instance, after in-place updates.
+        """

-    MISSING_WORD_VAL = 42.0  # placeholder value used for missing words in document
+    # Sentinel value used for words that are missing from the document's frequency
+    # representation. We use a numeric value that is assumed to lie outside the
+    # range of any valid word frequency observed in typical documents so that it
+    # can be distinguished from real counts wherever it is used.
+    MISSING_WORD_VAL = 42.0

Uh oh!

Conversation

mamei16 commented Dec 2, 2023

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants