Skip to content

Commit 1bcf3fc

Browse files
committed
Update required phrase generation
* This update decouples the creation of is_required_phrase rules from updating existing rules in a separate CLI. This makes it easier to control which rule are used as required phrases. * This now skip to process more rules when adding required phrases to existing rules: any rule that cannot be matched approximately is skipped and only tiny rules, but also many other rules. * This checks that no rule get a required phrase added that would break in the middle of a URL, email, or copyright. This is done by checking that no required phrase injection changes the set of ignorables of a rule and could break a URL making it no longer a proper URL. Same for emails or copyrights. * This extends "skipping" the collection of required phrases to skip a rule from both required phrases collection for generationg new rules AND injection of new required phrases in rule text. This allow to handle exceptions more easily. * The "is_required_phrase" rules creation now creates rules using improved content: the case and punctuation of the phrase text are preserved; the rule is created as "is_license_reference" which is going to be correct in the vast majority of the cases. * When matched, the "is_required_phrase" rules are treated the same as continuous rules and can only be matched exactly. * The "is_required_phrase" rules are now validated extensively to ensure that there is no conflict with other rule flags. * The code to "trace" the source of a required_phase inject now uses the new standard "source" rule field, and the code related to handling this field has been simplified. * Required phrases injection has not yet been tested as working. Signed-off-by: Philippe Ombredanne <[email protected]>
1 parent 8e16712 commit 1bcf3fc

File tree

4 files changed

+807
-705
lines changed

4 files changed

+807
-705
lines changed

setup.cfg

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,7 @@ console_scripts =
159159
scancode-license-data = licensedcode.license_db:dump_scancode_license_data
160160
regen-package-docs = packagedcode.regen_package_docs:regen_package_docs
161161
add-required-phrases = licensedcode.required_phrases:add_required_phrases
162+
gen-new-required-phrases-rules = licensedcode.required_phrases:gen_required_phrases_rules
162163

163164
# These are configurations for ScanCode plugins as setuptools entry points.
164165
# Each plugin entry hast this form:

src/licensedcode/match.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2129,12 +2129,14 @@ def filter_matches_missing_required_phrases(
21292129
A required phrase must be matched exactly without gaps or unknown words.
21302130
21312131
A rule with "is_continuous" set to True is the same as if its whole text
2132-
was defined as a keyphrase and is processed here too.
2132+
was defined as a required phrase and is processed here too.
2133+
Same for a rule with "is_required_phrase" set to True.
2134+
21332135
"""
2134-
# never discard a solo match, unless matched to "is_continuous" rule
2136+
# never discard a solo match, unless matched to "is_continuous" or "is_required_phrase" rule
21352137
if len(matches) == 1:
21362138
rule = matches[0]
2137-
if not rule.is_continuous:
2139+
if not (rule.is_continuous or rule.is_required_phrase):
21382140
return matches, []
21392141

21402142
kept = []
@@ -2149,7 +2151,7 @@ def filter_matches_missing_required_phrases(
21492151
if trace:
21502152
logger_debug(' CHECKING KEY PHRASES for:', match)
21512153

2152-
is_continuous = match.rule.is_continuous
2154+
is_continuous = match.rule.is_continuous or match.rule.is_required_phrase
21532155
ikey_spans = match.rule.required_phrase_spans
21542156

21552157
if not (ikey_spans or is_continuous):

0 commit comments

Comments
 (0)