Skip to content

Commit 99f0a0d

Browse files
committed
Add write up on packagedcode module in README.
* following discussion with @sschuberth in #421 (comment) Signed-off-by: Philippe Ombredanne <[email protected]>
1 parent d01aa0d commit 99f0a0d

File tree

1 file changed

+116
-0
lines changed

1 file changed

+116
-0
lines changed

src/packagedcode/README.rst

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
The purpose of `packagedcode` is to:
2+
3+
- detect a package,
4+
- determine its dependencies,
5+
- detect its asserted license (at the metadata level) vs. its actual licensing (as scanned).
6+
7+
8+
1. **detect the presence of a package** in a codebase based on its manifest, its file
9+
or archive type. Typically it is a third party package but it may be your own too.
10+
Taking Python as a main example a package can exist in multiple forms:
11+
12+
1.1. as a **source checkout** (or some source archive such as a source
13+
distribution or an `sdist`) where the presence of a `setup.py` or some
14+
`requirements.txt` file is the key marker for Python. For Maven it would be a
15+
`pom.xml` or a `build.gradle` file, for Ruby a `Gemfile` or `Gemfile.lock`, the
16+
presence of autotools files, and so on, with the goal to eventually covering all
17+
the packages formats/types that are out there and commonly used.
18+
19+
1.2. as an **installable archive or binary** such as a Pypi wheel `.whl` or
20+
`.egg`, a Maven `.jar`, a Ruby `.gem`, a `.nupkg` for a Nuget, a `.rpm` or `.deb`
21+
Linux package, etc... Here the type, shape and name structure of an archive as
22+
well as some its files content are the key markers for detection. The metadata
23+
may also be included in that archive as a file or as some headers (e.g. RPMs)
24+
25+
1.3. as an **installed packaged** such as when you `pip install` a Python package
26+
or `bundle install` Ruby gems or `npm install` node modules. Here the key markers
27+
may be some combo of a typical or conventional directory layout and presence of
28+
specific files such as the metadata installed with a Python `wheel`, a `vendor`
29+
directory for Ruby, some `node_modules` directory tree for npms, or a certain
30+
file type with metadata such as Windows DLLs. Additional markers may also include
31+
"namespaces" such as Java or Python imports, C/C++ namespace declarations.
32+
33+
2. **parse and collect the package manifest(s)** metadata. For Python, this means
34+
extracting name, version, authorship, asserted licensing and declared dependencies as
35+
found in the any of the package descriptor files (e.g. a `setup.py` file,
36+
`requirements` file(s) or any of the `*-dist-info` or `*-egg-info` dir files such as
37+
a `metadata.json`). Other package formats have their own metatada that may be more or
38+
less comprehensive in the breadth and depth of information they offer (e.g.
39+
`.nuspec`, `package.json`, `bower.json`, Godeps, etc...). These metadata include the
40+
declared dependencies (and in some cases the fully resolved dependencies too such as
41+
with Gemfile.lock). Finally, all the different packages formats and data are
42+
normalized and stored in a common data structure abstracting the small differences of
43+
naming and semantics that may exists between all the different package formats.
44+
45+
Once collected, these data are then injected in a `packages` section of the scan.
46+
47+
What code in `packagedcode` is not meant to do:
48+
49+
A. **download packages** from a thirdparty repository: there is code upcomming code in
50+
another tool that will be specifically dealing with this and also handles collecting
51+
the metadata as served by a package repository (which are in most cases --but not
52+
always-- the same as what is declared in the manifests).
53+
54+
B. **resolve dependencies**: the focus here is on a purely static analysis that does not
55+
rely on any network access at runtime by design. To scan for actually used
56+
dependencies the process is to instead scan for an as-built or as-installed or as-
57+
deployed codebase where the dependencies have already been provisioned and installed
58+
and there ScanCode would detect these.
59+
There are also some upcomming prototype for a dynamic multi-package dependencies
60+
resolver that actually runs live the proper tool to resolve and collect dependencies
61+
(e.g. effectively running Maven, bundler, pip, npm, gradle, bower, go get/dep, etc).
62+
This will be a tool separate from ScanCode as this requires having several/all
63+
package managers installed (and possibly multiple versions of each) and may run code
64+
from the codebase (e.g. a setup.py) and access the network for fetching or resolving
65+
dependencies. It could be also exposed as a web service that can take in a manifest
66+
and package and run safely the dep resolution in an isolated environment (e.g. a
67+
chroot jail or docker container) and return the collected deps.
68+
69+
C. **match packages** (and files) to actual repositories or registries, e.g. given a
70+
scan detecting packages matching would be looking them up in a remote package
71+
repository or a local index and possibly using A. and/or B. additionally if needed.
72+
Here again there is some upcomming code and tool that will deal specifically with
73+
this aspect and would handle also building an index of actual registries/repositories
74+
and matching using hashes and fingerprints.
75+
76+
An now some answer to questions originally by @sschuberth:
77+
78+
> More concretely, this does not download the source code of a Python package to run
79+
ScanCode over it.
80+
81+
Correct. The assumption with ScanCode proper (aside of the other in progress tools
82+
that I mentioned above) is that the deps have been fetched in the code you scan if
83+
you want to scan for deps. Packages will be detected with their declared deps but the
84+
deps will neither be resolved nor fetched. Though, as a second step we could also
85+
verify that all the declared deps are also present in the scanned code as detected
86+
packages.
87+
88+
> This should be made very clear as this means cases where the license from the
89+
metadata is wrong compared to the LICENSE file in the source code will not get
90+
detected.
91+
92+
Both the metadata and the file level licenses (such as a header comment or a
93+
`LICENSE` file of sorts) are detected by ScanCode here: the license scan detect the
94+
licenses while the package scan collect the asserted licensing in the metadata. The
95+
interesting thing thanks to this combo is that eventual conflicts (or incomplete
96+
data) can then be analyzed and a deduction should be doable automatically: given a
97+
scan for packages and licenses and copyrights, do the package metadata
98+
asserted/declared license match the actual detected licenses? If not this could be
99+
reported as some "error" condition... Furthermore, this could be refined based on
100+
classification of the files: a package may assert a top level `MIT` license and use a
101+
GPL-licensed build script. By knowing that the build script is indeed a build script,
102+
we could then report that the GPL detected in such script is not conflicting with the
103+
overall asserted MIT license of the package. The same could be done with test
104+
scripts/code, or documentation code (such as doxygen-generated docs)
105+
106+
> Moreover, licenses from transitive dependencies are not taking into account.
107+
108+
If the transitive dependencies have been resolved and their code present in the
109+
codebase, then they would be caught by a static ScanCode scan and eventually scanned
110+
both for package metadata and/or license detection. There are some caveats that would
111+
need to be dealt with of course as some tools (e.g. Maven) may not store locally
112+
(e.g. side-by-side with a given checkout) the corresponding artifacts/Jars and use
113+
instead a `~/user` "global" dot directory to store a cache.
114+
115+
Beyond this, actual dependency resolution of a single package or a complete manifest
116+
would the topic of another tool as mentioned above.

0 commit comments

Comments
 (0)