Add new FAQ entry on --license-text (#4476)

pombredanne · AyanSinhaMahapatra · web-flow · commit 0b60c639fd02 · 2025-07-22T20:33:07.000+05:30
* Add new FAQ entry on --license-text

Signed-off-by: Philippe Ombredanne &lt;pombredanne@aboutcode.org&gt;
Signed-off-by: Ayan Sinha Mahapatra &lt;ayansmahapatra@gmail.com&gt;
Co-authored-by: Ayan Sinha Mahapatra &lt;ayansmahapatra@gmail.com&gt;
diff --git a/docs/source/cli-reference/basic-options.rst b/docs/source/cli-reference/basic-options.rst
@@ -623,8 +623,8 @@
         The option ``--license-text-diagnostics`` is a sub-option of and requires the options
         ``--license`` and ``--license-text``.
 
-    In the matched license text, include diagnostic highlights surrounding with square brackets []
-    words that are not matched.
+    This adds a new attribute like the matched license text, but includes diagnostic highlights
+    surrounding with square brackets ``[]`` for words that are not matched.
 
     In a normal scan, whole lines of text are included in the matched license text, including parts
     that are possibly unmatched.
@@ -645,9 +645,14 @@
         obtaining a copy of this software and associated documentation files (the \"Software\"),
         to deal in the Software without restriction
 
-    With Diagnostics on::
+    With Diagnostics on (new attribute with the matched text diagnostics)::
 
         "matched_text":
+        "License Copyright (c) 2000 - 2006 The Legion Of The Bouncy Castle
+        (http://www.bouncycastle.org) Permission is hereby granted, free of charge, to any person
+        obtaining a copy of this software and associated documentation files (the \"Software\"),
+        to deal in the Software without restriction
+        "matched_text_diagnostics":
         "License [Copyright] ([c]) [2000] - [2006] [The] [Legion] [Of] [The] [Bouncy] [Castle]
         ([http]://[www].[bouncycastle].[org]) Permission is hereby granted, free of charge, to any person
         obtaining a copy of this software and associated documentation files (the \"Software\"),
diff --git a/docs/source/misc/faq.rst b/docs/source/misc/faq.rst
@@ -82,3 +82,53 @@ When scanning binaries, the line numbers are just a relative indication of where
 a detection was found: there is no such thing as lines in a binary. The numbers
 reported are based on the strings extracted from the binaries, typically broken
 as new lines with each NULL character.
+
+
+How does ``--license-text`` for ScanCode works exactly?
+-------------------------------------------------------------
+
+Is the matched text that gets included into the result exactly the lines of text
+from the input file that are covered by the ``start_line`` and ``end_line``
+fields of the result? I.e., if I would post-process the input file and extract
+``start_line`` to ``end_line`` from it, would I get exactly the ``matched_text``
+contents? Or is there some more "magic" involved when populating the
+``matched_text`` field?
+
+ScanCode is a bit smarter than just start and end line, as matching is based on
+words, not lines of the actual scanned text. And a whole line may not always be matched.
+
+For instance with this command::
+
+    $ echo "Foo is a wonder piece of code. Licensed under the GPL. " \
+        "For support contact foo@bar.com " > tst
+    $ scancode --license --license-text --license-text-diagnostics --yaml - tst
+    ...
+        license_detections:
+            -   license_expression: gpl-1.0-plus
+                license_expression_spdx: GPL-1.0-or-later
+                matches:
+                    -   license_expression: gpl-1.0-plus
+                        license_expression_spdx: GPL-1.0-or-later
+                        from_file: tst
+                        start_line: 1
+                        end_line: 1
+                        matcher: 2-aho
+                        score: '100.0'
+                        matched_length: 4
+                        match_coverage: '100.0'
+                        rule_relevance: 100
+                        rule_identifier: gpl_85.RULE
+                        rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/gpl_85.RULE
+                        matched_text: Foo is a wonder piece of code. Licensed under the GPL.
+                            For support contact foo@bar.com
+                        matched_text_diagnostics: Licensed under the GPL.
+    ...
+
+then:
+
+- ``matched_text`` is based on ``start_line`` and ``end_line``
+- ``matched_text_diagnostics`` is based on the exact matched words
+
+Note that ``matched_text_diagnostics`` also includes "tagged" gaps or extra
+unmatched words highlighted between the matched words.
+