Add faster version of KL-Sum using numpy#200
Add faster version of KL-Sum using numpy#200mamei16 wants to merge 3 commits intomiso-belica:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a faster implementation of the KL-Sum summarization algorithm using NumPy vectorized operations. The new implementation (fast_kl.py) aims to improve performance while maintaining compatibility with the original KL-Sum algorithm, though the author notes that results may slightly differ.
Key Changes:
- New vectorized KL-Sum implementation using NumPy arrays instead of dictionaries for word frequency calculations
- Comprehensive test suite mirroring the original KL-Sum tests
- Pre-computation and caching of sentence word frequencies to avoid repeated calculations
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 14 comments.
| File | Description |
|---|---|
sumy/summarizers/fast_kl.py |
New fast implementation of KLSummarizer using NumPy for vectorized operations and pre-computed word frequencies |
tests/test_summarizers/test_fast_kl.py |
Comprehensive test suite for the fast KL implementation, including tests for word frequency, joint frequency, KL divergence, and edge cases |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _ensure_dependencies_installed(): | ||
| if np is None: | ||
| raise ValueError("Fast KL-Sum summarizer requires NumPy." | ||
| "Please, install it by command 'pip install numpy'.") |
There was a problem hiding this comment.
Missing space in error message. There should be a space after the period and before "Please". The message should read "Fast KL-Sum summarizer requires NumPy. Please, install it by command 'pip install numpy'."
| "Please, install it by command 'pip install numpy'.") | |
| " Please, install it by command 'pip install numpy'.") |
| best_sentence_word_list = self._get_all_words_in_doc([best_sentence]) | ||
| # update summary length and word frequencies | ||
| summary_word_list_len += len(best_sentence_word_list) | ||
| summary_word_freq = self._compute_word_freq(best_sentence_word_list, summary_word_freq, word_to_ind) |
There was a problem hiding this comment.
Inconsistency in word filtering between sentence and summary frequencies. The pre-computed sentence word frequencies use normalized content words only (line 130 via _get_content_words_in_sentence), but the summary word frequency is updated with all raw words including stop words (line 154 via _get_all_words_in_doc). This creates an inconsistency in the joint frequency calculation at line 144-145, where normalized content word frequencies are combined with all-word frequencies. This should match the approach used for sentences.
| best_sentence_word_list = self._get_all_words_in_doc([best_sentence]) | |
| # update summary length and word frequencies | |
| summary_word_list_len += len(best_sentence_word_list) | |
| summary_word_freq = self._compute_word_freq(best_sentence_word_list, summary_word_freq, word_to_ind) | |
| best_sentence_content_words = self._get_content_words_in_sentence(best_sentence) | |
| # update summary length and word frequencies using normalized content words | |
| summary_word_list_len += len(best_sentence_content_words) | |
| summary_word_freq = self._compute_word_freq(best_sentence_content_words, summary_word_freq, word_to_ind) |
| return word_freq | ||
|
|
||
| @staticmethod | ||
| def _compute_word_freq(list_of_words, word_freq_arr, word_to_ind): |
There was a problem hiding this comment.
Missing docstring for the _compute_word_freq method. This method differs from the original implementation's signature and behavior (takes arrays instead of returning a dict), so it should be documented to explain the parameters and the in-place modification of word_freq_arr.
| def _compute_word_freq(list_of_words, word_freq_arr, word_to_ind): | |
| def _compute_word_freq(list_of_words, word_freq_arr, word_to_ind): | |
| """ | |
| Update a word-frequency array in place based on the given list of words. | |
| Unlike :meth:`_old_compute_word_freq`, which builds and returns a dict, | |
| this variant increments counts directly in an existing numeric array. | |
| :param list_of_words: Iterable of word tokens whose frequencies should be counted. | |
| :param word_freq_arr: Mutable numeric array (e.g., NumPy array) where each | |
| position corresponds to a word index; this array is modified in place. | |
| :param word_to_ind: Mapping from word token to integer index into | |
| ``word_freq_arr``. | |
| :return: The same ``word_freq_arr`` instance, after in-place updates. | |
| """ |
| @staticmethod | ||
| def _joint_freq(wc1, wc2, total_len): | ||
| if total_len == 0: | ||
| return np.zeros_like(wc1) | ||
| joint_sum = wc1 + wc2 | ||
| return joint_sum / total_len | ||
|
|
||
| @staticmethod | ||
| def _kl_divergence(summary_freq, doc_freq, doc_missing_word_mask): | ||
| summary_freq = np.where((summary_freq != 0.0) & doc_missing_word_mask, summary_freq, doc_freq) | ||
| return (doc_freq * np.log(doc_freq / summary_freq)).sum() |
There was a problem hiding this comment.
Missing docstrings for _joint_freq and _kl_divergence methods. These are key algorithmic components that differ from the original implementation by using numpy arrays instead of dictionaries. They should be documented to explain parameters, especially the doc_missing_word_mask parameter which is not self-explanatory.
| KL Divergence. | ||
| Source: http://www.aclweb.org/anthology/N09-1041 | ||
| """ | ||
| MISSING_WORD_VAL = 42.0 # placeholder value used for missing words in document |
There was a problem hiding this comment.
Magic number used for MISSING_WORD_VAL. While the comment explains its purpose, the choice of 42.0 appears arbitrary and could potentially conflict with actual word frequencies. Consider using a more distinctive value like -1.0 or float('inf'), or adding a comment explaining why 42.0 was chosen specifically.
| MISSING_WORD_VAL = 42.0 # placeholder value used for missing words in document | |
| # Sentinel value used for words that are missing from the document's frequency | |
| # representation. We use a numeric value that is assumed to lie outside the | |
| # range of any valid word frequency observed in typical documents so that it | |
| # can be distinguished from real counts wherever it is used. | |
| MISSING_WORD_VAL = 42.0 |
| def _get_all_content_words_in_doc(self, sentences): | ||
| all_words = self._get_all_words_in_doc(sentences) | ||
| normalized_words = self._normalize_words(all_words) | ||
| normalized_content_words = self._filter_out_stop_words(normalized_words) | ||
| return normalized_content_words |
There was a problem hiding this comment.
The order of operations differs from the original KL summarizer. In the original, stop words are filtered before normalization, but here normalization happens before filtering. This could cause differences in behavior if the normalization process affects whether a word matches a stop word. Consider matching the original order: filter stop words first, then normalize.
| def _compute_ratings(self, sentences): | ||
| word_to_freq = self.compute_tf(sentences) | ||
|
|
||
| vocabulary = set(self._get_all_words_in_doc(sentences)).union(word_to_freq.keys()) |
There was a problem hiding this comment.
Vocabulary construction mixes normalized content words with all raw words. The vocabulary is built from the union of all words in the document (via _get_all_words_in_doc which returns raw, non-normalized words including stop words) and word_to_freq.keys() (which contains normalized content words without stop words). This inconsistency can lead to the vocabulary containing both normalized and non-normalized versions of the same word, potentially causing incorrect frequency calculations.
| iterations = 0 | ||
| indices = list(range(len(sentences_as_words))) | ||
| # Removes one sentence per iteration by adding to summary | ||
| while len(indices) > 0: | ||
| iterations += 1 |
There was a problem hiding this comment.
Unused variable. The 'iterations' variable is initialized and incremented but never used. Consider removing it to improve code clarity.
| iterations = 0 | |
| indices = list(range(len(sentences_as_words))) | |
| # Removes one sentence per iteration by adding to summary | |
| while len(indices) > 0: | |
| iterations += 1 | |
| indices = list(range(len(sentences_as_words))) | |
| # Removes one sentence per iteration by adding to summary | |
| while len(indices) > 0: |
| @staticmethod | ||
| def _old_compute_word_freq(list_of_words, d=None): | ||
| word_freq = {} if d is None else d | ||
| for w in list_of_words: | ||
| word_freq[w] = word_freq.get(w, 0) + 1 | ||
| return word_freq |
There was a problem hiding this comment.
Confusing method name. The name "_old_compute_word_freq" suggests deprecated code, but it's actively used in compute_tf. Consider renaming to something more descriptive like "_compute_word_freq_dict" to indicate it returns a dictionary, distinguishing it from the array-based _compute_word_freq method.
| import pytest | ||
| import numpy as np | ||
|
|
||
| import sumy.summarizers.fast_kl as fast_kl_module |
There was a problem hiding this comment.
Module 'sumy.summarizers.fast_kl' is imported with both 'import' and 'import from'.
Hi, thanks for creating this project!
I was looking for a summarization method for a project I'm working on and really liked KL-Sum. However, it was too slow for my specific case. Therefore, I rewrote it using vectorized operations. It is not a perfect replacement for the original, since the results can slightly differ. However, all original tests still pass. Feel free to reject this PR if you'd rather keep the original only.
Cheers