Skip to content

Add faster version of KL-Sum using numpy#200

Open
mamei16 wants to merge 3 commits intomiso-belica:mainfrom
mamei16:main
Open

Add faster version of KL-Sum using numpy#200
mamei16 wants to merge 3 commits intomiso-belica:mainfrom
mamei16:main

Conversation

@mamei16
Copy link

@mamei16 mamei16 commented Dec 2, 2023

Hi, thanks for creating this project!
I was looking for a summarization method for a project I'm working on and really liked KL-Sum. However, it was too slow for my specific case. Therefore, I rewrote it using vectorized operations. It is not a perfect replacement for the original, since the results can slightly differ. However, all original tests still pass. Feel free to reject this PR if you'd rather keep the original only.

Cheers

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a faster implementation of the KL-Sum summarization algorithm using NumPy vectorized operations. The new implementation (fast_kl.py) aims to improve performance while maintaining compatibility with the original KL-Sum algorithm, though the author notes that results may slightly differ.

Key Changes:

  • New vectorized KL-Sum implementation using NumPy arrays instead of dictionaries for word frequency calculations
  • Comprehensive test suite mirroring the original KL-Sum tests
  • Pre-computation and caching of sentence word frequencies to avoid repeated calculations

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 14 comments.

File Description
sumy/summarizers/fast_kl.py New fast implementation of KLSummarizer using NumPy for vectorized operations and pre-computed word frequencies
tests/test_summarizers/test_fast_kl.py Comprehensive test suite for the fast KL implementation, including tests for word frequency, joint frequency, KL divergence, and edge cases

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

def _ensure_dependencies_installed():
if np is None:
raise ValueError("Fast KL-Sum summarizer requires NumPy."
"Please, install it by command 'pip install numpy'.")
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space in error message. There should be a space after the period and before "Please". The message should read "Fast KL-Sum summarizer requires NumPy. Please, install it by command 'pip install numpy'."

Suggested change
"Please, install it by command 'pip install numpy'.")
" Please, install it by command 'pip install numpy'.")

Copilot uses AI. Check for mistakes.
Comment on lines +154 to +157
best_sentence_word_list = self._get_all_words_in_doc([best_sentence])
# update summary length and word frequencies
summary_word_list_len += len(best_sentence_word_list)
summary_word_freq = self._compute_word_freq(best_sentence_word_list, summary_word_freq, word_to_ind)
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistency in word filtering between sentence and summary frequencies. The pre-computed sentence word frequencies use normalized content words only (line 130 via _get_content_words_in_sentence), but the summary word frequency is updated with all raw words including stop words (line 154 via _get_all_words_in_doc). This creates an inconsistency in the joint frequency calculation at line 144-145, where normalized content word frequencies are combined with all-word frequencies. This should match the approach used for sentences.

Suggested change
best_sentence_word_list = self._get_all_words_in_doc([best_sentence])
# update summary length and word frequencies
summary_word_list_len += len(best_sentence_word_list)
summary_word_freq = self._compute_word_freq(best_sentence_word_list, summary_word_freq, word_to_ind)
best_sentence_content_words = self._get_content_words_in_sentence(best_sentence)
# update summary length and word frequencies using normalized content words
summary_word_list_len += len(best_sentence_content_words)
summary_word_freq = self._compute_word_freq(best_sentence_content_words, summary_word_freq, word_to_ind)

Copilot uses AI. Check for mistakes.
return word_freq

@staticmethod
def _compute_word_freq(list_of_words, word_freq_arr, word_to_ind):
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstring for the _compute_word_freq method. This method differs from the original implementation's signature and behavior (takes arrays instead of returning a dict), so it should be documented to explain the parameters and the in-place modification of word_freq_arr.

Suggested change
def _compute_word_freq(list_of_words, word_freq_arr, word_to_ind):
def _compute_word_freq(list_of_words, word_freq_arr, word_to_ind):
"""
Update a word-frequency array in place based on the given list of words.
Unlike :meth:`_old_compute_word_freq`, which builds and returns a dict,
this variant increments counts directly in an existing numeric array.
:param list_of_words: Iterable of word tokens whose frequencies should be counted.
:param word_freq_arr: Mutable numeric array (e.g., NumPy array) where each
position corresponds to a word index; this array is modified in place.
:param word_to_ind: Mapping from word token to integer index into
``word_freq_arr``.
:return: The same ``word_freq_arr`` instance, after in-place updates.
"""

Copilot uses AI. Check for mistakes.
Comment on lines +83 to +93
@staticmethod
def _joint_freq(wc1, wc2, total_len):
if total_len == 0:
return np.zeros_like(wc1)
joint_sum = wc1 + wc2
return joint_sum / total_len

@staticmethod
def _kl_divergence(summary_freq, doc_freq, doc_missing_word_mask):
summary_freq = np.where((summary_freq != 0.0) & doc_missing_word_mask, summary_freq, doc_freq)
return (doc_freq * np.log(doc_freq / summary_freq)).sum()
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstrings for _joint_freq and _kl_divergence methods. These are key algorithmic components that differ from the original implementation by using numpy arrays instead of dictionaries. They should be documented to explain parameters, especially the doc_missing_word_mask parameter which is not self-explanatory.

Copilot uses AI. Check for mistakes.
KL Divergence.
Source: http://www.aclweb.org/anthology/N09-1041
"""
MISSING_WORD_VAL = 42.0 # placeholder value used for missing words in document
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number used for MISSING_WORD_VAL. While the comment explains its purpose, the choice of 42.0 appears arbitrary and could potentially conflict with actual word frequencies. Consider using a more distinctive value like -1.0 or float('inf'), or adding a comment explaining why 42.0 was chosen specifically.

Suggested change
MISSING_WORD_VAL = 42.0 # placeholder value used for missing words in document
# Sentinel value used for words that are missing from the document's frequency
# representation. We use a numeric value that is assumed to lie outside the
# range of any valid word frequency observed in typical documents so that it
# can be distinguished from real counts wherever it is used.
MISSING_WORD_VAL = 42.0

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +69
def _get_all_content_words_in_doc(self, sentences):
all_words = self._get_all_words_in_doc(sentences)
normalized_words = self._normalize_words(all_words)
normalized_content_words = self._filter_out_stop_words(normalized_words)
return normalized_content_words
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of operations differs from the original KL summarizer. In the original, stop words are filtered before normalization, but here normalization happens before filtering. This could cause differences in behavior if the normalization process affects whether a word matches a stop word. Consider matching the original order: filter stop words first, then normalize.

Copilot uses AI. Check for mistakes.
def _compute_ratings(self, sentences):
word_to_freq = self.compute_tf(sentences)

vocabulary = set(self._get_all_words_in_doc(sentences)).union(word_to_freq.keys())
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vocabulary construction mixes normalized content words with all raw words. The vocabulary is built from the union of all words in the document (via _get_all_words_in_doc which returns raw, non-normalized words including stop words) and word_to_freq.keys() (which contains normalized content words without stop words). This inconsistency can lead to the vocabulary containing both normalized and non-normalized versions of the same word, potentially causing incorrect frequency calculations.

Copilot uses AI. Check for mistakes.
Comment on lines +134 to +138
iterations = 0
indices = list(range(len(sentences_as_words)))
# Removes one sentence per iteration by adding to summary
while len(indices) > 0:
iterations += 1
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable. The 'iterations' variable is initialized and incremented but never used. Consider removing it to improve code clarity.

Suggested change
iterations = 0
indices = list(range(len(sentences_as_words)))
# Removes one sentence per iteration by adding to summary
while len(indices) > 0:
iterations += 1
indices = list(range(len(sentences_as_words)))
# Removes one sentence per iteration by adding to summary
while len(indices) > 0:

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +57
@staticmethod
def _old_compute_word_freq(list_of_words, d=None):
word_freq = {} if d is None else d
for w in list_of_words:
word_freq[w] = word_freq.get(w, 0) + 1
return word_freq
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confusing method name. The name "_old_compute_word_freq" suggests deprecated code, but it's actively used in compute_tf. Consider renaming to something more descriptive like "_compute_word_freq_dict" to indicate it returns a dictionary, distinguishing it from the array-based _compute_word_freq method.

Copilot uses AI. Check for mistakes.
import pytest
import numpy as np

import sumy.summarizers.fast_kl as fast_kl_module
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module 'sumy.summarizers.fast_kl' is imported with both 'import' and 'import from'.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants