-
Notifications
You must be signed in to change notification settings - Fork 2
Description
We violate our internal assumptions when we encounter multiple Repos with the same repo_url.
SELECT
v.id,
v.canonical_id,
r.display_name,
r.repo_url,
r.project_id
FROM repo_vertices v
JOIN repos r ON v.id = r.id
WHERE r.repo_url = 'https://github.com/stellarcarbon/sc-website';Output after running the v0.1 reference graph bootstrap pipeline:
id | canonical_id | display_name | repo_url | project_id
------+-------------------------------------+--------------+---------------------------------------------+------------
1837 | pkg:github/stellarcarbon/sc-website | sc-website | https://github.com/stellarcarbon/sc-website | 38
1841 | pkg:npm/sc-website | sc-website | https://github.com/stellarcarbon/sc-website | 38
(2 rows)Where we should be recording a single Repo with a known release package, we currently have two Repos that represent the same entity.
This severely affects the current data quality:
SELECT repo_url, COUNT(*) as url_count
FROM repos
GROUP BY repo_url
HAVING COUNT(*) > 1
ORDER BY url_count DESC; repo_url | url_count
-----------------------------------------------------------------------------+-----------
https://github.com/subquery/subql | 37
https://github.com/calimero-network/core | 34
https://github.com/airgap-it/beacon-sdk | 18
https://github.com/bandprotocol/bothan | 14
https://github.com/airgap-it/airgap-coin-lib | 10
https://github.com/bandprotocol/bandchain | 10
https://github.com/loambuild/loam | 8
https://github.com/bandprotocol/owasm | 6
https://github.com/diadata-org/nfttracker | 6
https://github.com/idos-network/idos-sdk-kotlin | 6
https://github.com/bandprotocol/go-band-sdk | 6
https://github.com/airgap-it/airgap-sapling | 4
https://github.com/bandprotocol/chain | 4
https://github.com/getaxal/verified-signer | 4
...
(927 rows)I don't want to add a uniqueness constraint on repo_url yet, because it would interfere with our data quality improvement process. I want to be able to complete a full batch ingestion and subsequently assess the quality after the run. It's much easier to work with DB tables than to parse thousands of logged exceptions.
Since SBOM ingestion is guaranteed to tell us the owner and repo that is the subject, but doesn't surface any of its package identifiers, we must use this as the canonical_id for Repos at all times. In contrast: crawlers may not surface the GitHub PURL immediately, and we can initially create an ExternalRepo in that case, which can be promoted to Repo when we do learn the GitHub association. This is the bug: the queries above show that we have established the repo_url and the ExternalRepo has been promoted to Repo, yet the canonical_id has not been updated accordingly. It must always be changed to the correct form as part of a promotion. If the repo with the true canonical_id already exists, any new information needs to be added to this record.
I knew about this issue, but was blindsided by it while reviewing #6. It's not a blocker to merge that feature branch per se, but it's interfering with the way I want to test it.
After fixing this bug, I want to add the uniqueness constraint on repo_url to prevent regression.