Add sbom generation tooling (#2232)#1
Conversation
922ffdd to
1a69f6d
Compare
There was a problem hiding this comment.
Pull request overview
Adds Bazel-native SBOM generation for SCORE modules, including SPDX 2.3 + CycloneDX 1.6 emitters, metadata collection via a module extension/aspect, and supporting scripts + fixtures/tests.
Changes:
- Introduces Bazel rules/aspect/extension to collect dependency metadata and generate SPDX/CycloneDX SBOM outputs.
- Adds Python generators/formatters plus helper scripts (crate metadata cache, C++ metadata cache, SPDX→GitHub Dependency Submission snapshot).
- Adds a comprehensive test suite with real fixtures and expanded README/setup documentation.
Reviewed changes
Copilot reviewed 43 out of 45 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_spdx_to_github_snapshot.py | Unit tests for SPDX→GitHub snapshot conversion logic. |
| tests/test_spdx_formatter.py | Unit tests for SPDX 2.3 JSON generation and license normalization. |
| tests/test_real_sbom_integration.py | Integration tests generating SBOMs from real fixture inputs (includes online validator check). |
| tests/test_generate_crates_metadata_cache.py | Tests for parsing dash-license-scan output, MODULE.bazel.lock crate extraction, and synthetic Cargo.lock generation. |
| tests/test_generate_cpp_metadata_cache.py | Tests for converting cdxgen CycloneDX output into a C++ metadata cache. |
| tests/test_cyclonedx_formatter.py | Unit tests for CycloneDX 1.6 JSON generation and license encoding rules. |
| tests/test_cpp_enrich_checksum.py | Tests for C++ cache enrichment and enforcing no manual curation in cpp_metadata.json. |
| tests/test_bcr_known_licenses.py | Tests for BCR known-license fallback table and license-application priority behavior. |
| tests/fixtures/sbom_metadata.json | Fixture metadata input for integration tests. |
| tests/fixtures/reference_integration.MODULE.bazel.lock | Fixture lockfile slice used for module version/hash enrichment tests. |
| tests/fixtures/orchestrator_cdxgen.cdx.json | Fixture cdxgen output for orchestrator integration test path. |
| tests/fixtures/kyron_cdxgen.cdx.json | Fixture cdxgen output for kyron integration test path. |
| tests/fixtures/baselibs_input.json | Fixture Bazel aspect output for baselibs integration scenario. |
| tests/init.py | Declares tests as a Python package. |
| tests/BUILD | Bazel pytest targets for the SBOM test suite. |
| scripts/spdx_to_github_snapshot.py | Implements SPDX 2.3 → GitHub Dependency Submission snapshot conversion. |
| scripts/generate_crates_metadata_cache.py | Script to build Rust crate metadata cache via lockfiles + dash-license-scan + crates.io API. |
| scripts/generate_cpp_metadata_cache.py | Script to convert cdxgen CycloneDX output into cpp_metadata.json cache format. |
| scripts/BUILD.bazel | Bazel py_library targets for scripts. |
| npm_wrapper.sh | Shell wrapper intended to run npm/cdxgen from Bazel actions. |
| internal/rules.bzl | Core sbom_rule implementation wiring aspect outputs + generator actions (+ optional cache/cdxgen generation). |
| internal/providers.bzl | Defines SbomDepsInfo and SbomMetadataInfo providers. |
| internal/metadata_rule.bzl | Rule wrapper to expose metadata JSON produced by the extension. |
| internal/generator/utils.py | Shared utility for SPDX license operator normalization. |
| internal/generator/spdx_formatter.py | SPDX 2.3 JSON formatter implementation. |
| internal/generator/sbom_generator.py | Main SBOM generator entry point (resolving components, enrichment, writing outputs). |
| internal/generator/cyclonedx_formatter.py | CycloneDX 1.6 JSON formatter implementation (components + dependency graph). |
| internal/generator/init.py | Declares generator package. |
| internal/generator/BUILD | Bazel targets for generator binaries/libraries. |
| internal/aspect.bzl | Aspect collecting transitive deps and external dependency edges. |
| internal/init.py | Declares internal package. |
| internal/BUILD | Exports internal bzl implementation files. |
| extensions.bzl | Module extension collecting dependency metadata (modules/http_archives/git/crates/licenses). |
| defs.bzl | Public sbom() macro API wrapping the rule. |
| cpp_metadata.json | Initializes C++ metadata cache file (empty). |
| README.md | Expanded setup/usage/architecture documentation for SBOM tooling. |
| MODULE.bazel | Declares module metadata and Python toolchain deps for this repo. |
| BUILD.bazel | Exports public SBOM API files and provides npm wrapper sh_binary. |
| .gitignore | Ignores Bazel outputs, lockfile, and Python bytecode caches. |
| .bazelrc | Adds registries and Java toolchain settings for builds. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1a69f6d to
883fa29
Compare
883fa29 to
47a1d54
Compare
|
fixed the issues |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 43 out of 45 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # source.json entry – carries the sha256 of the downloaded source | ||
| # tarball for this module@version. Use it as the component hash. | ||
| source_match = re.search( | ||
| r"/modules/([^/]+)/([^/]+)/source\.json$", | ||
| url, |
There was a problem hiding this comment.
The code/comment assumes the registryFileHashes hash for .../source.json is the SHA-256 of the module source tarball. In Bazel lockfiles, registryFileHashes hashes the registry files themselves, so exposing this as the component artifact checksum makes the SBOM hash misleading.
Consider omitting module checksums from this source, or fetching/parsing source.json to extract the archive integrity/sha256 (with an explicit networked step).
| cache_cmd = "set -euo pipefail\npython3 {} {}".format( | ||
| ctx.file._crates_cache_script.path, | ||
| crates_cache.path, | ||
| ) | ||
| if ctx.file.cargo_lockfile: | ||
| cache_inputs.append(ctx.file.cargo_lockfile) | ||
| cache_cmd += " --cargo-lock {}".format(ctx.file.cargo_lockfile.path) | ||
| for lock in ctx.files.module_lockfiles: | ||
| cache_inputs.append(lock) | ||
| cache_cmd += " --module-lock {}".format(lock.path) | ||
| ctx.actions.run_shell( | ||
| inputs = cache_inputs, | ||
| outputs = [crates_cache], | ||
| command = cache_cmd, | ||
| mnemonic = "CratesCacheGenerate", | ||
| progress_message = "Generating crates metadata cache for %s" % ctx.attr.name, | ||
| execution_requirements = {"requires-network": ""}, | ||
| use_default_shell_env = True, |
There was a problem hiding this comment.
The crates cache generation action shells out to python3 ..., which bypasses Bazel's configured Python toolchain and can fail in hermetic/remote execution environments (or when python3 isn't on PATH). Prefer running a declared executable (e.g., a py_binary target) via ctx.actions.run, or use the Python toolchain runtime explicitly, so the action is portable and reproducible.
| cache_cmd = "set -euo pipefail\npython3 {} {}".format( | |
| ctx.file._crates_cache_script.path, | |
| crates_cache.path, | |
| ) | |
| if ctx.file.cargo_lockfile: | |
| cache_inputs.append(ctx.file.cargo_lockfile) | |
| cache_cmd += " --cargo-lock {}".format(ctx.file.cargo_lockfile.path) | |
| for lock in ctx.files.module_lockfiles: | |
| cache_inputs.append(lock) | |
| cache_cmd += " --module-lock {}".format(lock.path) | |
| ctx.actions.run_shell( | |
| inputs = cache_inputs, | |
| outputs = [crates_cache], | |
| command = cache_cmd, | |
| mnemonic = "CratesCacheGenerate", | |
| progress_message = "Generating crates metadata cache for %s" % ctx.attr.name, | |
| execution_requirements = {"requires-network": ""}, | |
| use_default_shell_env = True, | |
| cache_args = ctx.actions.args() | |
| cache_args.add(crates_cache.path) | |
| if ctx.file.cargo_lockfile: | |
| cache_inputs.append(ctx.file.cargo_lockfile) | |
| cache_args.add("--cargo-lock", ctx.file.cargo_lockfile.path) | |
| for lock in ctx.files.module_lockfiles: | |
| cache_inputs.append(lock) | |
| cache_args.add("--module-lock", lock.path) | |
| ctx.actions.run( | |
| inputs = cache_inputs, | |
| outputs = [crates_cache], | |
| executable = ctx.executable._crates_cache_script, | |
| arguments = [cache_args], | |
| mnemonic = "CratesCacheGenerate", | |
| progress_message = "Generating crates metadata cache for %s" % ctx.attr.name, | |
| execution_requirements = {"requires-network": ""}, |
| ctx.actions.run( | ||
| outputs = [cdxgen_sbom], | ||
| executable = ctx.executable._npm, | ||
| arguments = [ | ||
| "exec", |
There was a problem hiding this comment.
The auto_cdxgen action expects npm/cdxgen to be discoverable via PATH (and optionally NVM via $NVM_DIR), but this ctx.actions.run() invocation doesn't opt into inheriting the host environment. Without use_default_shell_env = True (or an explicit env), this can fail to locate system-installed Node/npm/cdxgen in common setups (e.g., nvm installs under $HOME).
| License metadata is collected automatically: | ||
| - Rust crates: from crates_metadata.json cache (bundled with tooling) | ||
| - C++ deps: from cpp_metadata.json cache (bundled with tooling) | ||
| - Bazel modules: version/PURL auto-extracted from module graph |
There was a problem hiding this comment.
This docstring says crate/C++ license metadata is collected automatically from the bundled crates_metadata.json/cpp_metadata.json caches, but the current rule/generator flow only loads crates metadata when --crates-cache is explicitly passed (and there is no codepath that reads the bundled files by default). Either wire the bundled cache files into the rule/generator defaults, or update the documentation here to match the actual behavior.
| License metadata is collected automatically: | |
| - Rust crates: from crates_metadata.json cache (bundled with tooling) | |
| - C++ deps: from cpp_metadata.json cache (bundled with tooling) | |
| - Bazel modules: version/PURL auto-extracted from module graph | |
| License metadata can be collected automatically when the corresponding | |
| inputs or caches are provided: | |
| - Rust crates: from a crates metadata cache generated from Cargo.lock | |
| and/or MODULE.bazel.lock files (see cargo_lockfile/module_lockfiles and | |
| auto_crates_cache) | |
| - C++ deps: optionally enriched from a CycloneDX SBOM produced by cdxgen | |
| (see cdxgen_sbom/auto_cdxgen) | |
| - Bazel modules: version/PURL auto-extracted from the Bazel module graph |
| SBOM generation. License metadata is collected automatically from | ||
| bundled caches (crates_metadata.json, cpp_metadata.json). |
There was a problem hiding this comment.
The module extension docstring claims license metadata is collected automatically from bundled caches (crates_metadata.json, cpp_metadata.json), but the extension implementation does not read those files. This is misleading for consumers; either implement the cache loading/merging here (or in the generator) or adjust this header comment to reflect the actual data sources.
| SBOM generation. License metadata is collected automatically from | |
| bundled caches (crates_metadata.json, cpp_metadata.json). | |
| SBOM generation. License and other dependency metadata are consumed | |
| from pre-generated metadata (for example, JSON caches passed via | |
| the metadata_content attribute). |
| if cdx_data.get("bomFormat") != "CycloneDX": | ||
| print("Error: Input is not a CycloneDX JSON file", file=sys.stderr) | ||
| sys.exit(1) |
There was a problem hiding this comment.
convert_cdxgen_to_cache() calls sys.exit(1) on non-CycloneDX input. Since this function is part of the exported py_library (and is imported in tests), exiting the interpreter makes it hard to reuse safely as a library function. Prefer raising a typed exception (e.g., ValueError) or returning {} and letting main() handle the CLI exit code.
| uvx, | ||
| "--from", | ||
| "dash-license-scan@git+https://github.com/eclipse-score/dash-license-scan", | ||
| "dash-license-scan", | ||
| "--summary", |
There was a problem hiding this comment.
run_dash_license_scan() installs/runs dash-license-scan from an unpinned Git URL (dash-license-scan@git+https://...). That makes results non-reproducible over time and increases supply-chain risk for a networked build-time action. Consider pinning to a specific tag/commit SHA, or using a versioned/published artifact if available.
AlexanderLanin
left a comment
There was a problem hiding this comment.
all copilot findings not critical, merging as-is. can be improved afterwards.
|
@Lukasz-Juranek , thank you for the contribution and congrats to the first PR merged in this repo |
This PR adds SBOM Bazel rules for SCORE modules. For setup details see the https://github.com/Lukasz-Juranek/sbom-tool/blob/feat/issue-2232-sbom-init/README.md
Old discussion is under eclipse-score/tooling#106
Example SBOMs generated by tooling validated with https://sbomgenerator.com/tools/validator
sbom_feo.cdx.json
sbom_feo.spdx.json
sbom_orchestrator.cdx.json
sbom_orchestrator.spdx.json
sbom_baselibs.cdx.json
sbom_baselibs.spdx.json
Documentation added in eclipse-score/score#2672