feat: Make comparison metric user-specifiable #29
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements user-specifiable comparison metrics to address issue #14, allowing users to control how detections are ranked and filtered.
Changes
New Features
6 comparison metrics via new
metrics.pymodule:sum(default): Sum of both similarity scoresaverage: Average of both similarity scoresmin: Minimum similarity (most conservative)max: Maximum similarity (most aggressive)token_overlap: Raw token overlap countweighted: Weighted combination of similarities and token overlapCLI enhancement: Added
--metricargument to select comparison metricSmart filtering:
--min-similaritynow works as minimum token count fortoken_overlapmetricSpecial Handling for Token Overlap Metric
"token_overlap": 3568field--min-similarity 1000means at least 1000 tokens)Bug Fix
Also includes fix for RepositoryCommit checkout issue discovered during testing (incorrect commit hash passed to parent repository). Fixes #27
Usage Examples
Testing
All metrics have been tested and produce different rankings:
sum,average,min,weighted→ prioritizewake/compiler/exceptions.pymax→ prioritizeswake/detectors/template.pytoken_overlap→ prioritizeswake/ir/types.py(3568 tokens)Motivation
This addresses the TODO comment about making the comparison metric user-specifiable and resolves issues with too many false positives from whitespace overlap when using the default metric.
Fixes #14