feat: Make comparison metric user-specifiable #29

dguido · 2025-08-29T18:00:44Z

Summary

This PR implements user-specifiable comparison metrics to address issue #14, allowing users to control how detections are ranked and filtered.

Changes

New Features

6 comparison metrics via new metrics.py module:
- sum (default): Sum of both similarity scores
- average: Average of both similarity scores
- min: Minimum similarity (most conservative)
- max: Maximum similarity (most aggressive)
- token_overlap: Raw token overlap count
- weighted: Weighted combination of similarities and token overlap
CLI enhancement: Added --metric argument to select comparison metric
Smart filtering: --min-similarity now works as minimum token count for token_overlap metric

Special Handling for Token Overlap Metric

Shows raw token counts (e.g., "3568") instead of normalized values (0.2566)
CSV header displays "Token Overlap" instead of "Similarity"
JSON output uses "token_overlap": 3568 field
Filtering by minimum token count (e.g., --min-similarity 1000 means at least 1000 tokens)

Bug Fix

Also includes fix for RepositoryCommit checkout issue discovered during testing (incorrect commit hash passed to parent repository). Fixes #27

Usage Examples

# Use token overlap metric with minimum 1000 tokens
vendetect test_repo source_repo --metric token_overlap --min-similarity 1000

# Use conservative minimum similarity metric
vendetect test_repo source_repo --metric min --min-similarity 0.8

# Use weighted metric for balanced comparison
vendetect test_repo source_repo --metric weighted

Testing

All metrics have been tested and produce different rankings:

sum, average, min, weighted → prioritize wake/compiler/exceptions.py
max → prioritizes wake/detectors/template.py
token_overlap → prioritizes wake/ir/types.py (3568 tokens)

Motivation

This addresses the TODO comment about making the comparison metric user-specifiable and resolves issues with too many false positives from whitespace overlap when using the default metric.

Fixes #14

- Add new metrics module with 6 comparison metrics: - SumSimilarityMetric (default): sum of both similarity scores - AverageSimilarityMetric: average of both similarity scores - MinSimilarityMetric: minimum (most conservative) - MaxSimilarityMetric: maximum (most aggressive) - TokenOverlapMetric: raw token overlap count - WeightedSimilarityMetric: weighted combination - Add --metric CLI argument to select comparison metric - Update Detection class to use custom metrics - Special handling for token_overlap metric: - Shows raw token counts instead of normalized values - CSV header shows 'Token Overlap' instead of 'Similarity' - JSON uses 'token_overlap' field instead of 'similarity' - --min-similarity interpreted as minimum token count - Update help text to clarify min-similarity behavior Also includes fix for RepositoryCommit checkout issue discovered during testing. This allows users to control how detections are ranked and filtered, addressing the issue of too many false positives from whitespace overlap when using the default metric.

The VenDetector.__init__ method has 6 parameters which is appropriate for a configuration class with multiple optional settings.

src/vendetect/repo.py

dguido added 2 commits August 29, 2025 13:54

fix: Add noqa comment for PLR0913 (too many arguments)

1ff06c5

The VenDetector.__init__ method has 6 parameters which is appropriate for a configuration class with multiple optional settings.

ESultanik reviewed Aug 29, 2025

View reviewed changes

src/vendetect/repo.py Outdated Show resolved Hide resolved

ESultanik added 2 commits August 29, 2025 16:30

Fix repository commit regression

64bf195

Merge branch 'main' into feature/user-specifiable-comparison-metrics

c743864

ESultanik self-assigned this Aug 29, 2025

ESultanik added the enhancement ✨ New feature or request label Aug 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Make comparison metric user-specifiable #29

feat: Make comparison metric user-specifiable #29

Uh oh!

dguido commented Aug 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Make comparison metric user-specifiable #29

Are you sure you want to change the base?

feat: Make comparison metric user-specifiable #29

Uh oh!

Conversation

dguido commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New Features

Special Handling for Token Overlap Metric

Bug Fix

Usage Examples

Testing

Motivation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dguido commented Aug 29, 2025 •

edited

Loading