Add/reorganize sections #2246

chinyeungli · chinyeungli · commit 56d19e9970fc · 2025-07-16T17:28:25.000+08:00
* I've also updated score text in README based on https://github.com/aboutcode-org/scancode-toolkit/blob/develop/src/licensedcode/detection.py#L404 Signed-off-by: Chin Yeung Li <tli@nexb.com>
diff --git a/src/licensedcode/README.rst b/src/licensedcode/README.rst
@@ -1,5 +1,59 @@
+Overview
+========
+
+This `licensedcode` module have utilities to accurately detect a vast array
+of open-source and proprietary licenses. It manages a comprehensive
+database of license texts, patterns, and rules, enabling ScanCode to
+perform scans and provide precise license conclusions.
+
+Key Functionality
+-----------------
+
+ * License Rule Management: Stores and manages a large collection of
+   license rules, including full texts, snippets, and regular expressions.
+
+ * Pattern Matching: Implements sophisticated algorithms for matching
+   detected code against known license patterns and texts.
+
+ * License Detection Logic: Contains the core logic for processing scan
+   input, applying rules, and determining the presence and type of
+   licenses.
+
+ * Rule-based Detection: Utilizes a robust system of rules to identify
+   licenses even when only fragments or variations of license texts are
+   present.
+
+ * License Expression Parsing: Supports the parsing and interpretation of
+   complex license expressions (e.g., "MIT AND Apache-2.0").
+
+
+How It Works (High-Level)
+-------------------------
+
+At a high level, the `licensedcode`` module operates by:
+
+ 1. Loading License Data: It initializes by loading a curated set of
+    license texts, short license identifiers, and detection rules from its
+    internal data store.
+
+ 2. Scanning Input: When ScanCode processes a file or directory, the
+    content is converted into an internal representation (a "query").
+
+ 3. Applying Rules: The module then applies its extensive set of rules and
+    patterns to the input content through a multi-stage pipeline, looking
+    for matches.
+
+ 4. Reporting Detections: Upon successful matches, it reports the
+    identified licenses, their confidence levels, and the exact locations
+    (lines, characters) where they were found.
+
+For a more in-depth understanding of the underlying technical principles
+and the detection pipeline, please refer to the sections below.
+
+
+
 ScanCode license detection overview and key design elements
-===========================================================
+-----------------------------------------------------------
 
 License detection involves identifying commonalities between the text of a
 scanned query file and the indexed license and rule texts. The process
@@ -17,7 +71,7 @@ rather than strings.
 
 
 Rules and licenses
-------------------
+^^^^^^^^^^^^^^^^^^
 
 The detection uses an index of reference license texts and along with a set
 of "rules" that are common notices or mentions of these licenses. One
@@ -28,7 +82,7 @@ resemblance and containment of the matched texts.
 
 
 Words as integers
------------------
+^^^^^^^^^^^^^^^^^
 
 A dictionary that maps words to a unique integer is used to transform a
 scanned text "query" words, as well as the words in the indexed license
@@ -78,7 +132,7 @@ With integers, we can be faster:
 
 
 Common/junk tokens
-------------------
+^^^^^^^^^^^^^^^^^^
 
 The quality and speed of detection is supported by classifying each word as
 either good/discriminant or common/junk. Junk tokens are either very
@@ -89,7 +143,7 @@ referred to as low (junk) tokens and high (good) tokens.
 
 
 Query processing
-----------------
+^^^^^^^^^^^^^^^^
 
 When a file is scanned, it is first converted to a query object which is a list of
 integer token ids. A query is further broken down in slices (a.k.a. query runs) based
@@ -101,7 +155,7 @@ matching.
 
 
 Matching pipeline
------------------
+^^^^^^^^^^^^^^^^^
 
 The matching pipeline consist of:
 
@@ -133,14 +187,16 @@ The matching pipeline consist of:
     significantly. The number of multiple local sequence alignments that are required
     in this step is also made much smaller by the pre-matching done using sets.
 
-- finally all the collected matches are merged, refined and filtered to yield the
-  final results. The merging considers the ressemblance, containment and overlap
-  between scanned texts and the matched texts and several secondary factors.
-  Filtering is based on the density and length of matches as well as the number of
-  good or frequent tokens matched.
-  Last, each match receives a score which base on the length of the rule text
-  and how of this rule text was matched. Optionally we can also collect the exact
-  matched texts and identify which portions were not matched for each instance.
+- finally all the collected matches are merged, refined and filtered to
+  yield the final results. The merging considers the ressemblance,
+  containment and overlap between scanned texts and the matched texts and
+  several secondary factors. Filtering is based on the density and length
+  of matches as well as the number of good or frequent tokens matched.
+  Lastly, each match receives a score calculated based on the sum of the
+  underlying match scores, weighted by the length of the match relative to
+  the overall detection length. Optionally we can also collect the exact
+  matched texts and identify which portions were not matched for each
+  instance.
 
 
 Comparison with other tools approaches