Skip to content

Commit eee8640

Browse files
authored
Merge pull request #2667 from nexB/2635-license-accuracy
Improve license detection accuracy
2 parents 0362454 + 67708d9 commit eee8640

File tree

7,397 files changed

+74253
-65714
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

7,397 files changed

+74253
-65714
lines changed

CHANGELOG.rst

Lines changed: 44 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -7,53 +7,69 @@ v21.x.x (next, future)
77
Important API changes:
88
~~~~~~~~~~~~~~~~~~~~~~~~
99

10-
- The data structure of the JSON output is now versioned and the next version
11-
is available with a new command line option. We are also documenting a new
12-
and clear API policy and backward compatibility policy.
13-
14-
- The data structure of the JSON output has changed for copyrights, authors
15-
and holders: we now use proper name for attributes and not a generic "value".
16-
17-
- The data structure of the JSON output has changed for licenses: we now
18-
return match details once for each matched license expression rather than
19-
once for each license in a matched expression. There is a new top-level
20-
"licenses" attributes that contains the data details for each detected
21-
licenses only once. This data can contain the reference license text
22-
as an option.
23-
24-
- The data structure of the JSON output has changed for packages: we now
25-
return "package_manifests" package information at the manifest file-level
26-
rather than "packages". There is a a new top-level "packages" attribute
27-
that contains each package instance that can be aggregating data from
28-
multiple manifests for a single package instance.
29-
30-
- The data structure for HTML output has been changed to include emails and urls under the
31-
"infos" object. Now HTML template will output holders, authors, emails, and
32-
urls into separate tables like "licenses" and "copyrights".
10+
- The data structure of the JSON output is now versioned and the next version
11+
is available with a new command line option. We are also documenting a new
12+
and clear API policy and backward compatibility policy.
13+
14+
- The data structure of the JSON output has changed for copyrights, authors
15+
and holders: we now use proper name for attributes and not a generic "value".
16+
17+
- The data structure of the JSON output has changed for licenses: we now
18+
return match details once for each matched license expression rather than
19+
once for each license in a matched expression. There is a new top-level
20+
"licenses" attributes that contains the data details for each detected
21+
licenses only once. This data can contain the reference license text
22+
as an option.
23+
24+
- The data structure of the JSON output has changed for packages: we now
25+
return "package_manifests" package information at the manifest file-level
26+
rather than "packages". There is a a new top-level "packages" attribute
27+
that contains each package instance that can be aggregating data from
28+
multiple manifests for a single package instance.
29+
30+
- The data structure for HTML output has been changed to include emails and urls under the
31+
"infos" object. Now HTML template will output holders, authors, emails, and
32+
urls into separate tables like "licenses" and "copyrights".
3333

3434
Copyright detection:
3535
~~~~~~~~~~~~~~~~~~~~
3636

37-
- The data structure in the JSON is now using consistently named attributes as
38-
opposed to a plain value.
37+
- The data structure in the JSON is now using consistently named attributes as
38+
opposed to a plain value.
39+
- Several copyright detection bugs have been fixed.
3940

4041

4142
Package detection:
4243
~~~~~~~~~~~~~~~~~~
4344

44-
- Add support for OpenWRT packages.
45-
- Add support for Yocto/BitBake .bb recipes.
46-
- Add support to track installed files for each Package type.
45+
- Add support for OpenWRT packages.
46+
- Add support for Yocto/BitBake .bb recipes.
47+
- Add support to track installed files for each Package type.
48+
- Debian copyright license detection has been significantly improved with new
49+
license detection rules.
4750

4851

4952
License detection:
5053
~~~~~~~~~~~~~~~~~~~
5154

55+
- There have been XXX new licenses added, YYY new license detection rules added
56+
and ZZZ updated license or rules.
57+
58+
- Several license detection bugs have fixed.
59+
60+
- The SPDX license list 3.14 is now supported. We also include the version
61+
of the SPDX license list in the ScanCode JSON and SPDX outputs, as well as
62+
display it with the --version command line option.
63+
5264
- Unknown licenses have a new flag "is_unknown" to identify them
5365
beyond just the naming convention of having "unknown" as part of their name.
5466

5567
- Rules that match at least one unknown license have a flag "has_unknown" set
5668
in the returned match results.
69+
70+
- There is a new experimental command line option "--unknown-licenses" to
71+
detect unknown licenses and follow license references such as "See license in
72+
file COPYING". The actual data structure for this new option is evolving.
5773

5874

5975
Many thanks to every contributors that made this possible and in particular:
@@ -64,7 +80,6 @@ Many thanks to every contributors that made this possible and in particular:
6480
- Philippe Ombredanne @pombredanne
6581

6682

67-
6883
v21.8.4
6984
---------
7085

etc/scripts/licenses/buildrules.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,9 @@ def cli(licenses_file):
226226

227227
rulerec = models.Rule(**rd)
228228

229+
# force recomputing relevance to remove junk stored relevance for long rules
230+
rulerec.compute_relevance(_threshold=18.0)
231+
229232
rulerec.data_file = base_loc + '.yml'
230233
rulerec.text_file = base_loc + '.RULE'
231234

@@ -234,10 +237,13 @@ def cli(licenses_file):
234237
if rule_tokens in rules_tokens:
235238
print('Skipping already added rule with text for:', base_name)
236239
else:
240+
print('Adding new rule:')
241+
print(' file://' + rulerec.data_file)
242+
print(' file://' + rulerec.text_file,)
237243
rules_tokens.add(rule_tokens)
244+
rulerec.dump()
238245
models.update_ignorables(rulerec, verbose=False)
239246
rulerec.dump()
240-
print('Rule added:', 'file://' + rulerec.data_file, '\n', 'file://' + rulerec.text_file,)
241247

242248

243249
if __name__ == '__main__':
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# -*- coding: utf-8 -*-
2+
#
3+
# Copyright (c) nexB Inc. and others. All rights reserved.
4+
# ScanCode is a trademark of nexB Inc.
5+
# SPDX-License-Identifier: Apache-2.0
6+
# See http://www.apache.org/licenses/LICENSE-2.0 for the license text.
7+
# See https://github.com/nexB/scancode-toolkit for support or download.
8+
# See https://aboutcode.org for more information about nexB OSS projects.
9+
#
10+
11+
import click
12+
13+
from licensedcode import models
14+
15+
"""
16+
A script to generate license detection rules from existing license rules by
17+
replacing strings.
18+
"""
19+
20+
from buildrules import find_rule_base_loc
21+
from buildrules import rule_exists
22+
23+
24+
def get_rules(source, replacement):
25+
"""
26+
Yield tuple of (rule, new text) for non-false positive existing Rules with a
27+
text that contains source.
28+
"""
29+
for rule in models.load_rules():
30+
if rule.is_false_positive:
31+
continue
32+
text = rule.text()
33+
if source in text:
34+
yield rule, text.replace(source, replacement)
35+
36+
37+
@click.command()
38+
@click.option('--source', metavar='SOURCE', type=str, help='The source, old string to replace.')
39+
@click.option('--replacement', metavar='REPLACEMENT', type=str, help='The replacement string to use.')
40+
@click.help_option('-h', '--help')
41+
def cli(source, replacement):
42+
"""
43+
Create new license detection rules from existing rules by replacing a SOURCE
44+
string by a REPLACEMENT string in any rule text that contains this SOURCE string.
45+
"""
46+
47+
for rule, new_text in get_rules(source, replacement):
48+
existing = rule_exists(new_text)
49+
if existing:
50+
continue
51+
52+
if rule.is_license_intro:
53+
base_name = 'license-intro'
54+
else:
55+
base_name = rule.license_expression
56+
57+
base_loc = find_rule_base_loc(base_name)
58+
59+
rd = rule.to_dict()
60+
rd['stored_text'] = new_text
61+
rd['has_stored_relevance'] = rule.has_stored_relevance
62+
rd['has_stored_minimum_coverage'] = rule.has_stored_minimum_coverage
63+
64+
rulerec = models.Rule(**rd)
65+
66+
# force recomputing relevance to remove junk stored relevance for long rules
67+
rulerec.compute_relevance(_threshold=18.0)
68+
69+
rulerec.data_file = base_loc + '.yml'
70+
rulerec.text_file = base_loc + '.RULE'
71+
72+
print('Adding new rule:')
73+
print(' file://' + rulerec.data_file)
74+
print(' file://' + rulerec.text_file,)
75+
rulerec.dump()
76+
models.update_ignorables(rulerec, verbose=False)
77+
rulerec.dump()
78+
79+
80+
if __name__ == '__main__':
81+
cli()

etc/scripts/licenses/synclic.py

Lines changed: 41 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -406,16 +406,25 @@ def build_license(self, mapping, skip_oddities=True, scancode_licenses=None):
406406
return
407407

408408
# these keys have a complicated history
409-
if skip_oddities and key in set([
410-
'gpl-1.0', 'gpl-2.0', 'gpl-3.0',
411-
'lgpl-2.0', 'lgpl-2.1', 'lgpl-3.0',
412-
'agpl-1.0', 'agpl-2.0', 'agpl-3.0',
413-
'gfdl-1.1', 'gfdl-1.2', 'gfdl-1.3',
409+
spdx_keys_with_complicated_past = set([
410+
'gpl-1.0',
411+
'gpl-2.0',
412+
'gpl-3.0',
413+
'lgpl-2.0',
414+
'lgpl-2.1',
415+
'lgpl-3.0',
416+
'agpl-1.0',
417+
'agpl-2.0',
418+
'agpl-3.0',
419+
'gfdl-1.1',
420+
'gfdl-1.2',
421+
'gfdl-1.3',
414422
'nokia-qt-exception-1.1',
415423
'bzip2-1.0.5',
416424
'bsd-2-clause-freebsd',
417425
'bsd-2-clause-netbsd',
418-
]):
426+
])
427+
if skip_oddities and key in spdx_keys_with_complicated_past:
419428
return
420429

421430
deprecated = mapping.get('isDeprecatedLicenseId', False)
@@ -505,8 +514,9 @@ def __init__(self, external_base_dir, api_base_url=None, api_key=None):
505514
self.api_base_url = api_base_url or os.getenv('DEJACODE_API_URL')
506515
self.api_key = api_key or os.getenv('DEJACODE_API_KEY')
507516
assert (self.api_key and self.api_base_url), (
508-
'You must set the DEJACODE_API_URL and DEJACODE_API_KEY ' +
509-
'environment variables before running this script.')
517+
'You must set the DEJACODE_API_URL and DEJACODE_API_KEY '
518+
'environment variables before running this script.'
519+
)
510520

511521
super(DejaSource, self).__init__(external_base_dir)
512522

@@ -546,14 +556,30 @@ def build_license(self, mapping, scancode_licenses):
546556
return
547557

548558
# these licenses are combos of many others and are ignored: we detect
549-
# instead each part of the combo
559+
# instead each part of the combos separately
550560
dejacode_special_composites = set([
551-
'intel-bsd-special',
552-
# 'newlib-subdirectory',
553-
])
554-
is_component_license = mapping.get('is_component_license') or False
555-
556-
is_combo = is_component_license or key in dejacode_special_composites
561+
'net-snmp',
562+
'aes-128-3.0',
563+
'agpl-3.0-bacula',
564+
'bacula-exception',
565+
'componentace-jcraft',
566+
'nvidia-cuda-supplement-2020',
567+
'dejacode',
568+
'ibm-icu',
569+
'unicode-icu-58',
570+
'info-zip-1997-10',
571+
'info-zip-2001-01',
572+
'info-zip-2002-02',
573+
'info-zip-2003-05',
574+
'info-zip-2004-05',
575+
'info-zip-2005-02',
576+
'info-zip-2007-03',
577+
'info-zip-2009-01',
578+
'intel-bsd-special',
579+
'lgpl-3.0-plus-openssl',
580+
'newlib-subdirectory',
581+
])
582+
is_combo = key in dejacode_special_composites
557583
if is_combo:
558584
if TRACE: print('Skipping DejaCode combo/component license', key)
559585
return

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ colorama==0.4.4
1212
commoncode==21.8.31
1313
construct==2.10.67
1414
cryptography==3.4.7
15-
debian-inspector==21.5.25
15+
debian-inspector==30.0.0
1616
dparse==0.5.1
1717
extractcode==21.7.23
1818
extractcode-7z==16.5.210531

scancode-toolkit.ABOUT

100755100644
File mode changed.

setup-mini.cfg

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[metadata]
22
name = scancode-toolkit-mini
3-
version = 21.8.4
3+
version = 30.0.0
44
license = Apache-2.0 AND CC-BY-4.0 AND LicenseRef-scancode-other-permissive AND LicenseRef-scancode-other-copyleft
55

66
description = ScanCode is a tool to scan code for license, copyright, package and their documented dependencies and other interesting facts. scancode-toolkit-mini is a special build that does not come with pre-built binary dependencies by default. These are instead installed separately or with the extra_requires scancode-toolkit-mini[full]
@@ -59,8 +59,8 @@ install_requires =
5959
chardet >= 3.0.0
6060
click >= 6.7, !=7.0
6161
colorama >= 0.3.9
62-
commoncode >= 21.8.27
63-
debian-inspector >= 21.5.25
62+
commoncode >= 21.8.31
63+
debian-inspector >= 30.0.0
6464
dparse >= 0.5.1
6565
fasteners
6666
fingerprints >= 0.6.0
@@ -197,6 +197,7 @@ scancode_output =
197197
jsonlines = formattedcode.output_jsonlines:JsonLinesOutput
198198
template = formattedcode.output_html:CustomTemplateOutput
199199
debian = formattedcode.output_debian:DebianCopyrightOutput
200+
yaml = formattedcode.output_yaml:YamlOutput
200201

201202

202203
[tool:pytest]

setup.cfg

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[metadata]
22
name = scancode-toolkit
3-
version = 21.8.4
3+
version = 30.0.0
44
license = Apache-2.0 AND CC-BY-4.0 AND LicenseRef-scancode-other-permissive AND LicenseRef-scancode-other-copyleft
55

66
description = ScanCode is a tool to scan code for license, copyright, package and their documented dependencies and other interesting facts.
@@ -60,7 +60,7 @@ install_requires =
6060
click >= 6.7, !=7.0
6161
colorama >= 0.3.9
6262
commoncode >= 21.8.31
63-
debian-inspector >= 21.5.25
63+
debian-inspector >= 30.0.0
6464
dparse >= 0.5.1
6565
fasteners
6666
fingerprints >= 0.6.0

0 commit comments

Comments
 (0)