Skip to content

Commit fd1b980

Browse files
rogu-betatdruez
andauthored
Prioritize hashes and download URL for PurlDB mapping (#430)
* Prioritize hashes and download URL for PurlDB mapping In order to get an accurate mapping for a package in DejaCode to PurlDB entries the patched query prioritizes the hashes. This is needed in cases where the same PURL (without query parameters) can have multiple different download URLs as is the case with Python packages and various binaries for different hardware architectures or interpreter versions. Additionally, lookups for SHA-256 and MD5 are added as SHA-1 may not be populated under all circumstances. Hashes from SBOM imports, generated by tools such as cdxgen, commonly do not use SHA-1 anymore, since it is a mostly deprecated hashing algorithm due to the risk of hash collisions. SHA-512 could not yet be added as PurlDB does not support a lookup for it. The reason for the order of prioritization is that hashes give the most accurate for the content of the package, download URL at least points to the download location which would still allow to differentiate between the different target architectures, and lastly the PURL itself in case no fully accurate matches could be found otherwise. The results are then filtered by checking that PURLs match. Here a modification is made to also strip the query parameters from the PurlDB PURL as they may also contain them and previously caused matches to not be found. For reference see the following issues: #307 #383 Signed-off-by: Robert Guetzkow <[email protected]> * Update component_catalog/models.py Remove code duplication and reduce database queries to a single one Signed-off-by: tdruez <[email protected]> * Add details about matching order in docstring Signed-off-by: tdruez <[email protected]> --------- Signed-off-by: Robert Guetzkow <[email protected]> Signed-off-by: tdruez <[email protected]> Co-authored-by: tdruez <[email protected]>
1 parent 74e9e5e commit fd1b980

File tree

1 file changed

+20
-9
lines changed

1 file changed

+20
-9
lines changed

component_catalog/models.py

Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2506,9 +2506,16 @@ def create_from_url(cls, url, user):
25062506
if download_url and not purldb_data:
25072507
package_data = collect_package_data(download_url)
25082508

2509-
if sha1 := package_data.get("sha1"):
2510-
if sha1_match := scoped_packages_qs.filter(sha1=sha1):
2511-
package_link = sha1_match[0].get_absolute_link()
2509+
# Check for existing package by hash fields with a single database query
2510+
hash_fields = ["sha512", "sha256", "sha1", "md5"]
2511+
hash_filters = models.Q()
2512+
for hash_field in hash_fields:
2513+
if hash_value := package_data.get(hash_field):
2514+
hash_filters |= models.Q(**{hash_field: hash_value})
2515+
2516+
if hash_filters:
2517+
if package_match := scoped_packages_qs.filter(hash_filters).first():
2518+
package_link = package_match.get_absolute_link()
25122519
raise PackageAlreadyExistsWarning(
25132520
f"{url} already exists in your Dataspace as {package_link}"
25142521
)
@@ -2527,10 +2534,10 @@ def get_purldb_entries(self, user, max_request_call=0, timeout=10):
25272534
"""
25282535
Return the PurlDB entries that correspond to this Package instance.
25292536
2530-
Matching on the following fields order:
2531-
- Package URL
2532-
- SHA1
2533-
- Download URL
2537+
Matching is performed in order of decreasing accuracy:
2538+
1. Hash - Most accurate, matches exact file content
2539+
2. Download URL - High accuracy, matches specific package source
2540+
3. Package URL - Broadest match, may return multiple versions/variants
25342541
25352542
A `max_request_call` integer can be provided to limit the number of
25362543
HTTP requests made to the PackageURL server.
@@ -2542,12 +2549,16 @@ def get_purldb_entries(self, user, max_request_call=0, timeout=10):
25422549
purldb_entries = []
25432550

25442551
package_url = self.package_url
2545-
if package_url:
2546-
payloads.append({"purl": package_url})
2552+
if self.sha256:
2553+
payloads.append({"sha256": self.sha256})
25472554
if self.sha1:
25482555
payloads.append({"sha1": self.sha1})
2556+
if self.md5:
2557+
payloads.append({"md5": self.md5})
25492558
if self.download_url:
25502559
payloads.append({"download_url": self.download_url})
2560+
if package_url:
2561+
payloads.append({"purl": package_url})
25512562

25522563
purldb = PurlDB(user.dataspace)
25532564
for index, payload in enumerate(payloads):

0 commit comments

Comments
 (0)