Skip to content

Conversation

@dguido
Copy link
Member

@dguido dguido commented Aug 29, 2025

Summary

This PR implements user-specifiable comparison metrics to address issue #14, allowing users to control how detections are ranked and filtered.

Changes

New Features

  • 6 comparison metrics via new metrics.py module:

    • sum (default): Sum of both similarity scores
    • average: Average of both similarity scores
    • min: Minimum similarity (most conservative)
    • max: Maximum similarity (most aggressive)
    • token_overlap: Raw token overlap count
    • weighted: Weighted combination of similarities and token overlap
  • CLI enhancement: Added --metric argument to select comparison metric

  • Smart filtering: --min-similarity now works as minimum token count for token_overlap metric

Special Handling for Token Overlap Metric

  • Shows raw token counts (e.g., "3568") instead of normalized values (0.2566)
  • CSV header displays "Token Overlap" instead of "Similarity"
  • JSON output uses "token_overlap": 3568 field
  • Filtering by minimum token count (e.g., --min-similarity 1000 means at least 1000 tokens)

Bug Fix

Also includes fix for RepositoryCommit checkout issue discovered during testing (incorrect commit hash passed to parent repository). Fixes #27

Usage Examples

# Use token overlap metric with minimum 1000 tokens
vendetect test_repo source_repo --metric token_overlap --min-similarity 1000

# Use conservative minimum similarity metric
vendetect test_repo source_repo --metric min --min-similarity 0.8

# Use weighted metric for balanced comparison
vendetect test_repo source_repo --metric weighted

Testing

All metrics have been tested and produce different rankings:

  • sum, average, min, weighted → prioritize wake/compiler/exceptions.py
  • max → prioritizes wake/detectors/template.py
  • token_overlap → prioritizes wake/ir/types.py (3568 tokens)

Motivation

This addresses the TODO comment about making the comparison metric user-specifiable and resolves issues with too many false positives from whitespace overlap when using the default metric.

Fixes #14

dguido added 2 commits August 29, 2025 13:54
- Add new metrics module with 6 comparison metrics:
  - SumSimilarityMetric (default): sum of both similarity scores
  - AverageSimilarityMetric: average of both similarity scores
  - MinSimilarityMetric: minimum (most conservative)
  - MaxSimilarityMetric: maximum (most aggressive)
  - TokenOverlapMetric: raw token overlap count
  - WeightedSimilarityMetric: weighted combination

- Add --metric CLI argument to select comparison metric
- Update Detection class to use custom metrics
- Special handling for token_overlap metric:
  - Shows raw token counts instead of normalized values
  - CSV header shows 'Token Overlap' instead of 'Similarity'
  - JSON uses 'token_overlap' field instead of 'similarity'
  - --min-similarity interpreted as minimum token count

- Update help text to clarify min-similarity behavior

Also includes fix for RepositoryCommit checkout issue discovered during testing.

This allows users to control how detections are ranked and filtered,
addressing the issue of too many false positives from whitespace overlap
when using the default metric.
The VenDetector.__init__ method has 6 parameters which is appropriate
for a configuration class with multiple optional settings.
@ESultanik ESultanik self-assigned this Aug 29, 2025
@ESultanik ESultanik added the enhancement ✨ New feature or request label Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement ✨ New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make the comparison metric user-specifiable RepositoryCommit incorrectly passes commit hash to parent repository, causing checkout failures

3 participants