1+ Overview
2+ ========
3+
4+ This `licensedcode ` module have utilities to accurately detect a vast array
5+ of open-source and proprietary licenses. It manages a comprehensive
6+ database of license texts, patterns, and rules, enabling ScanCode to
7+ perform scans and provide precise license conclusions.
8+
9+ Key Functionality
10+ -----------------
11+
12+ * License Rule Management: Stores and manages a large collection of
13+ license rules, including full texts, snippets, and regular expressions.
14+
15+ * Pattern Matching: Implements sophisticated algorithms for matching
16+ detected code against known license patterns and texts.
17+
18+ * License Detection Logic: Contains the core logic for processing scan
19+ input, applying rules, and determining the presence and type of
20+ licenses.
21+
22+ * Rule-based Detection: Utilizes a robust system of rules to identify
23+ licenses even when only fragments or variations of license texts are
24+ present.
25+
26+ * License Expression Parsing: Supports the parsing and interpretation of
27+ complex license expressions (e.g., "MIT AND Apache-2.0").
28+
29+
30+ How It Works (High-Level)
31+ -------------------------
32+
33+ At a high level, the `licensedcode`` module operates by:
34+
35+ 1. Loading License Data: It initializes by loading a curated set of
36+ license texts, short license identifiers, and detection rules from its
37+ internal data store.
38+
39+ 2. Scanning Input: When ScanCode processes a file or directory, the
40+ content is converted into an internal representation (a "query").
41+
42+ 3. Applying Rules: The module then applies its extensive set of rules and
43+ patterns to the input content through a multi-stage pipeline, looking
44+ for matches.
45+
46+ 4. Reporting Detections: Upon successful matches, it reports the
47+ identified licenses, their confidence levels, and the exact locations
48+ (lines, characters) where they were found.
49+
50+ For a more in-depth understanding of the underlying technical principles
51+ and the detection pipeline, please refer to the sections below.
52+
53+
54+
155ScanCode license detection overview and key design elements
2- ===========================================================
56+ -----------------------------------------------------------
357
458License detection involves identifying commonalities between the text of a
559scanned query file and the indexed license and rule texts. The process
@@ -17,7 +71,7 @@ rather than strings.
1771
1872
1973Rules and licenses
20- ------------------
74+ ^^^^^^^^^^^^^^^^^^
2175
2276The detection uses an index of reference license texts and along with a set
2377of "rules" that are common notices or mentions of these licenses. One
@@ -28,7 +82,7 @@ resemblance and containment of the matched texts.
2882
2983
3084Words as integers
31- -----------------
85+ ^^^^^^^^^^^^^^^^^
3286
3387A dictionary that maps words to a unique integer is used to transform a
3488scanned text "query" words, as well as the words in the indexed license
@@ -78,7 +132,7 @@ With integers, we can be faster:
78132
79133
80134Common/junk tokens
81- ------------------
135+ ^^^^^^^^^^^^^^^^^^
82136
83137The quality and speed of detection is supported by classifying each word as
84138either good/discriminant or common/junk. Junk tokens are either very
@@ -89,7 +143,7 @@ referred to as low (junk) tokens and high (good) tokens.
89143
90144
91145Query processing
92- ----------------
146+ ^^^^^^^^^^^^^^^^
93147
94148When a file is scanned, it is first converted to a query object which is a list of
95149integer token ids. A query is further broken down in slices (a.k.a. query runs) based
@@ -101,7 +155,7 @@ matching.
101155
102156
103157Matching pipeline
104- -----------------
158+ ^^^^^^^^^^^^^^^^^
105159
106160The matching pipeline consist of:
107161
@@ -133,14 +187,16 @@ The matching pipeline consist of:
133187 significantly. The number of multiple local sequence alignments that are required
134188 in this step is also made much smaller by the pre-matching done using sets.
135189
136- - finally all the collected matches are merged, refined and filtered to yield the
137- final results. The merging considers the ressemblance, containment and overlap
138- between scanned texts and the matched texts and several secondary factors.
139- Filtering is based on the density and length of matches as well as the number of
140- good or frequent tokens matched.
141- Last, each match receives a score which base on the length of the rule text
142- and how of this rule text was matched. Optionally we can also collect the exact
143- matched texts and identify which portions were not matched for each instance.
190+ - finally all the collected matches are merged, refined and filtered to
191+ yield the final results. The merging considers the ressemblance,
192+ containment and overlap between scanned texts and the matched texts and
193+ several secondary factors. Filtering is based on the density and length
194+ of matches as well as the number of good or frequent tokens matched.
195+ Lastly, each match receives a score calculated based on the sum of the
196+ underlying match scores, weighted by the length of the match relative to
197+ the overall detection length. Optionally we can also collect the exact
198+ matched texts and identify which portions were not matched for each
199+ instance.
144200
145201
146202Comparison with other tools approaches
0 commit comments