Skip to content

Conversation

@radoslavralev
Copy link
Collaborator

@radoslavralev radoslavralev commented Dec 8, 2025

Note

Introduces reranking and a unified, provider-agnostic embedding stack, plus end-to-end benchmarking utilities.

  • Adds optional cross-encoder reranking (--cross_encoder_model, --rerank_k) across local and Redis workflows; retrieval upgraded to top-k with CE selection
  • Refactors embeddings to provider-agnostic interfaces: new EmbeddingModel, SimilarityMatcher, and embedding_providers (HuggingFace, OpenAI, Gemini); updates RedisVectorIndex to use providers and correct vector handling
  • Normalizes embeddings and fixes dtype in Redis/data paths; adjusts threshold sweep to use dynamic max
  • New benchmarking pipeline (run_benchmark.py, run_benchmark.sh) with bootstrap runs, model/CE sweeps, and organized outputs
  • Adds analysis/plot scripts: precision@1 table/plots, multi-model precision-vs-CHR with AUC bars, precision-over-threshold; expands run_chr_analysis.sh
  • Updates dependencies (google-genai, openai, peft, radon, pylint config) and removes large dataset CSVs from repo

Written by Cursor Bugbot for commit fbfe4e7. This will update automatically on new commits. Configure here.

@radoslavralev radoslavralev self-assigned this Dec 8, 2025
@radoslavralev radoslavralev added the enhancement New feature or request label Dec 8, 2025
@radoslavralev radoslavralev marked this pull request as ready for review December 8, 2025 09:16
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds optional cross-encoder reranking functionality to improve retrieval quality, implements top-k candidate retrieval in embedding matching, normalizes embeddings before Redis operations, enhances the Redis index handling with dimension probing, and improves visualization with dual-subplot charts. The changes span the core matching logic, benchmarking infrastructure, and result visualization.

Key Changes

  • Adds cross-encoder reranking with configurable top-k retrieval for both Redis-based and standard neural embedding matching
  • Implements top-k retrieval logic in blockwise embedding matching with support for variable k values
  • Normalizes embeddings to float32 and unit length before Redis operations, with dimension probing to handle models with incorrect config dimensions

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
src/customer_analysis/query_engine.py Adds embedding dimension probing, enables trust_remote_code, and always overwrites Redis index on initialization
src/customer_analysis/embedding_interface.py Implements top-k retrieval in blockwise matching methods with support for variable k; updates dimension inference to probe models first
src/customer_analysis/data_processing.py Integrates cross-encoder reranking in both run_matching and run_matching_redis; adds embedding normalization before Redis operations
scripts/plot_multiple_precision_vs_cache_hit_ratio.py Creates dual-subplot visualization with precision-CHR curves and AUC bar chart; adds numpy version compatibility for trapezoid/trapz
run_benchmark.sh Provides example shell script for running benchmarks with cross-encoder models
run_benchmark.py Extends benchmark loop to iterate over cross-encoder models and include them in output paths
evaluation.py Adds command-line arguments for cross-encoder model and rerank_k parameter
Comments suppressed due to low confidence (2)

src/customer_analysis/embedding_interface.py:414

  • The docstring for calculate_best_matches_with_cache_large_dataset does not document the new k parameter or how it affects the return value shapes. The function returns arrays with shape (num_sentences,) when k=1 but (num_sentences, k) when k>1. This should be documented to avoid confusion.
        """Large-dataset variant: find best cache match for each sentence using memmaps.

        Writes two memmaps (rows for sentences, cols for cache), normalised, and
        performs blockwise dot-products. If `sentence_offset` is provided and the
        cache corresponds to the same corpus, the self-similarity diagonal is masked.
        """

src/customer_analysis/embedding_interface.py:490

  • The docstrings for calculate_best_matches_with_cache and calculate_best_matches_from_embeddings_with_cache do not document the new k parameter or how it affects return value shapes. When k=1, arrays have shape (N,), but when k>1, they have shape (N, k). This should be documented.
        """
        Calculate the best similarity match for each sentence against all other
        sentences using a neural embedding model.
        """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 300 to 333
else:
# Top-k logic
# If columns in this block < k, take all valid
curr_block_size = col_end - col_start
if curr_block_size <= k:
top_k_in_block_idx = np.argsort(-sim, axis=1) # Sort all
top_k_in_block_val = np.take_along_axis(sim, top_k_in_block_idx, axis=1)
# Might have fewer than k if block is small
else:
# Use argpartition for top k
# We want largest k
part_idx = np.argpartition(-sim, k, axis=1)[:, :k]
top_k_in_block_val = np.take_along_axis(sim, part_idx, axis=1)

# Sort them to have ordered top-k (optional but good for merging)
sorted_sub_idx = np.argsort(-top_k_in_block_val, axis=1)
top_k_in_block_val = np.take_along_axis(top_k_in_block_val, sorted_sub_idx, axis=1)
top_k_in_block_idx = np.take_along_axis(part_idx, sorted_sub_idx, axis=1)

# Merge with accumulated bests
# chunk_best_scores: (batch, k)
# top_k_in_block_val: (batch, min(block, k))

# Adjust indices to global column indices
top_k_in_block_idx_global = top_k_in_block_idx + col_start

combined_vals = np.concatenate([chunk_best_scores, top_k_in_block_val], axis=1)
combined_idxs = np.concatenate([chunk_best_indices, top_k_in_block_idx_global], axis=1)

# Find top k in combined
best_combined_args = np.argsort(-combined_vals, axis=1)[:, :k]

chunk_best_scores = np.take_along_axis(combined_vals, best_combined_args, axis=1)
chunk_best_indices = np.take_along_axis(combined_idxs, best_combined_args, axis=1)
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top-k retrieval logic in blockwise matching lacks test coverage. Since the repository has comprehensive tests and this is a significant new feature that changes the shape of return values and introduces complex merging logic, it should have tests covering: 1) k > 1 with various cache sizes, 2) edge cases where block size < k, 3) self-similarity masking with k > 1, and 4) correct sorting and merging across blocks.

Copilot uses AI. Check for mistakes.

[[tool.uv.index]]
url = "https://artifactory.dev.redislabs.com/artifactory/api/pypi/cloud-pypi-local/simple"
url = ""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty UV index URL breaks package installation

High Severity

The [[tool.uv.index]] URL is set to an empty string "" instead of being removed entirely. An empty URL is not a valid package index URL and will cause uv to fail when attempting to resolve or install packages from this index. The section should either be removed completely or contain a valid URL.

Fix in Cursor Fix in Web


if all_pairs:
all_scores = cross_encoder.predict(all_pairs, batch_size=128, show_progress_bar=True)
all_scores = all_scores.reshape(N, k)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-encoder reranking creates invalid pairs when cache is small

Medium Severity

When using cross-encoder reranking with k > 1 and the cache has fewer entries than k, the code iterates over all k indices in best_indices[i], but some of these indices are initialized to 0 with scores of -inf (from init_results). This causes cache_list[idx] to repeatedly access cache_list[0] for invalid entries, creating spurious query-cache pairs. The cross-encoder then scores these invalid pairs, which could lead to incorrect match selection.

Fix in Cursor Fix in Web

emb_path, len(sentences), embedding_dim, row_block, col_block, dtype, early_stop
)

self._cleanup_memmap(created, memmap_dir, emb_path)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing try/finally for memmap cleanup causes temp file leak

Medium Severity

In _calculate_best_matches_large_dataset and calculate_best_matches_with_cache_large_dataset, temporary memmap files are created via _write_embeddings_memmap, but the cleanup code that removes these files is not wrapped in a try/finally block. If an exception is raised during _compute_blockwise_best_matches or _compute_blockwise_best_matches_two_sets, the cleanup code is never executed and temporary files accumulate on disk. Over time with repeated failures, this could fill up disk space.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants