|
| 1 | +The purpose of `packagedcode` is to: |
| 2 | + |
| 3 | +- detect a package, |
| 4 | +- determine its dependencies, |
| 5 | +- detect its asserted license (at the metadata level) vs. its actual licensing (as scanned). |
| 6 | + |
| 7 | + |
| 8 | +1. **detect the presence of a package** in a codebase based on its manifest, its file |
| 9 | +or archive type. Typically it is a third party package but it may be your own too. |
| 10 | +Taking Python as a main example a package can exist in multiple forms: |
| 11 | + |
| 12 | + 1.1. as a **source checkout** (or some source archive such as a source |
| 13 | + distribution or an `sdist`) where the presence of a `setup.py` or some |
| 14 | + `requirements.txt` file is the key marker for Python. For Maven it would be a |
| 15 | + `pom.xml` or a `build.gradle` file, for Ruby a `Gemfile` or `Gemfile.lock`, the |
| 16 | + presence of autotools files, and so on, with the goal to eventually covering all |
| 17 | + the packages formats/types that are out there and commonly used. |
| 18 | + |
| 19 | + 1.2. as an **installable archive or binary** such as a Pypi wheel `.whl` or |
| 20 | + `.egg`, a Maven `.jar`, a Ruby `.gem`, a `.nupkg` for a Nuget, a `.rpm` or `.deb` |
| 21 | + Linux package, etc... Here the type, shape and name structure of an archive as |
| 22 | + well as some its files content are the key markers for detection. The metadata |
| 23 | + may also be included in that archive as a file or as some headers (e.g. RPMs) |
| 24 | + |
| 25 | + 1.3. as an **installed packaged** such as when you `pip install` a Python package |
| 26 | + or `bundle install` Ruby gems or `npm install` node modules. Here the key markers |
| 27 | + may be some combo of a typical or conventional directory layout and presence of |
| 28 | + specific files such as the metadata installed with a Python `wheel`, a `vendor` |
| 29 | + directory for Ruby, some `node_modules` directory tree for npms, or a certain |
| 30 | + file type with metadata such as Windows DLLs. Additional markers may also include |
| 31 | + "namespaces" such as Java or Python imports, C/C++ namespace declarations. |
| 32 | + |
| 33 | +2. **parse and collect the package manifest(s)** metadata. For Python, this means |
| 34 | +extracting name, version, authorship, asserted licensing and declared dependencies as |
| 35 | +found in the any of the package descriptor files (e.g. a `setup.py` file, |
| 36 | +`requirements` file(s) or any of the `*-dist-info` or `*-egg-info` dir files such as |
| 37 | +a `metadata.json`). Other package formats have their own metatada that may be more or |
| 38 | +less comprehensive in the breadth and depth of information they offer (e.g. |
| 39 | +`.nuspec`, `package.json`, `bower.json`, Godeps, etc...). These metadata include the |
| 40 | +declared dependencies (and in some cases the fully resolved dependencies too such as |
| 41 | +with Gemfile.lock). Finally, all the different packages formats and data are |
| 42 | +normalized and stored in a common data structure abstracting the small differences of |
| 43 | +naming and semantics that may exists between all the different package formats. |
| 44 | + |
| 45 | +Once collected, these data are then injected in a `packages` section of the scan. |
| 46 | + |
| 47 | +What code in `packagedcode` is not meant to do: |
| 48 | + |
| 49 | +A. **download packages** from a thirdparty repository: there is code upcomming code in |
| 50 | +another tool that will be specifically dealing with this and also handles collecting |
| 51 | +the metadata as served by a package repository (which are in most cases --but not |
| 52 | +always-- the same as what is declared in the manifests). |
| 53 | + |
| 54 | +B. **resolve dependencies**: the focus here is on a purely static analysis that does not |
| 55 | +rely on any network access at runtime by design. To scan for actually used |
| 56 | +dependencies the process is to instead scan for an as-built or as-installed or as- |
| 57 | +deployed codebase where the dependencies have already been provisioned and installed |
| 58 | +and there ScanCode would detect these. |
| 59 | +There are also some upcomming prototype for a dynamic multi-package dependencies |
| 60 | +resolver that actually runs live the proper tool to resolve and collect dependencies |
| 61 | +(e.g. effectively running Maven, bundler, pip, npm, gradle, bower, go get/dep, etc). |
| 62 | +This will be a tool separate from ScanCode as this requires having several/all |
| 63 | +package managers installed (and possibly multiple versions of each) and may run code |
| 64 | +from the codebase (e.g. a setup.py) and access the network for fetching or resolving |
| 65 | +dependencies. It could be also exposed as a web service that can take in a manifest |
| 66 | +and package and run safely the dep resolution in an isolated environment (e.g. a |
| 67 | +chroot jail or docker container) and return the collected deps. |
| 68 | + |
| 69 | +C. **match packages** (and files) to actual repositories or registries, e.g. given a |
| 70 | +scan detecting packages matching would be looking them up in a remote package |
| 71 | +repository or a local index and possibly using A. and/or B. additionally if needed. |
| 72 | +Here again there is some upcomming code and tool that will deal specifically with |
| 73 | +this aspect and would handle also building an index of actual registries/repositories |
| 74 | +and matching using hashes and fingerprints. |
| 75 | + |
| 76 | +An now some answer to questions originally by @sschuberth: |
| 77 | + |
| 78 | +> More concretely, this does not download the source code of a Python package to run |
| 79 | +ScanCode over it. |
| 80 | + |
| 81 | +Correct. The assumption with ScanCode proper (aside of the other in progress tools |
| 82 | +that I mentioned above) is that the deps have been fetched in the code you scan if |
| 83 | +you want to scan for deps. Packages will be detected with their declared deps but the |
| 84 | +deps will neither be resolved nor fetched. Though, as a second step we could also |
| 85 | +verify that all the declared deps are also present in the scanned code as detected |
| 86 | +packages. |
| 87 | + |
| 88 | +> This should be made very clear as this means cases where the license from the |
| 89 | +metadata is wrong compared to the LICENSE file in the source code will not get |
| 90 | +detected. |
| 91 | + |
| 92 | +Both the metadata and the file level licenses (such as a header comment or a |
| 93 | +`LICENSE` file of sorts) are detected by ScanCode here: the license scan detect the |
| 94 | +licenses while the package scan collect the asserted licensing in the metadata. The |
| 95 | +interesting thing thanks to this combo is that eventual conflicts (or incomplete |
| 96 | +data) can then be analyzed and a deduction should be doable automatically: given a |
| 97 | +scan for packages and licenses and copyrights, do the package metadata |
| 98 | +asserted/declared license match the actual detected licenses? If not this could be |
| 99 | +reported as some "error" condition... Furthermore, this could be refined based on |
| 100 | +classification of the files: a package may assert a top level `MIT` license and use a |
| 101 | +GPL-licensed build script. By knowing that the build script is indeed a build script, |
| 102 | +we could then report that the GPL detected in such script is not conflicting with the |
| 103 | +overall asserted MIT license of the package. The same could be done with test |
| 104 | +scripts/code, or documentation code (such as doxygen-generated docs) |
| 105 | + |
| 106 | +> Moreover, licenses from transitive dependencies are not taking into account. |
| 107 | + |
| 108 | +If the transitive dependencies have been resolved and their code present in the |
| 109 | +codebase, then they would be caught by a static ScanCode scan and eventually scanned |
| 110 | +both for package metadata and/or license detection. There are some caveats that would |
| 111 | +need to be dealt with of course as some tools (e.g. Maven) may not store locally |
| 112 | +(e.g. side-by-side with a given checkout) the corresponding artifacts/Jars and use |
| 113 | +instead a `~/user` "global" dot directory to store a cache. |
| 114 | + |
| 115 | +Beyond this, actual dependency resolution of a single package or a complete manifest |
| 116 | +would the topic of another tool as mentioned above. |
0 commit comments