Skip to content

All Repos must have a pkg:github/... PURL as canonical_id #9

@aolieman

Description

@aolieman

We violate our internal assumptions when we encounter multiple Repos with the same repo_url.

SELECT 
    v.id, 
    v.canonical_id,
    r.display_name, 
    r.repo_url, 
    r.project_id
FROM repo_vertices v
JOIN repos r ON v.id = r.id
WHERE r.repo_url = 'https://github.com/stellarcarbon/sc-website';

Output after running the v0.1 reference graph bootstrap pipeline:

  id  |            canonical_id             | display_name |                  repo_url                   | project_id 
------+-------------------------------------+--------------+---------------------------------------------+------------
 1837 | pkg:github/stellarcarbon/sc-website | sc-website   | https://github.com/stellarcarbon/sc-website |         38
 1841 | pkg:npm/sc-website                  | sc-website   | https://github.com/stellarcarbon/sc-website |         38
(2 rows)

Where we should be recording a single Repo with a known release package, we currently have two Repos that represent the same entity.

This severely affects the current data quality:

SELECT repo_url, COUNT(*) as url_count 
FROM repos 
GROUP BY repo_url 
HAVING COUNT(*) > 1
ORDER BY url_count DESC;
                                  repo_url                                   | url_count 
-----------------------------------------------------------------------------+-----------
 https://github.com/subquery/subql                                           |        37
 https://github.com/calimero-network/core                                    |        34
 https://github.com/airgap-it/beacon-sdk                                     |        18
 https://github.com/bandprotocol/bothan                                      |        14
 https://github.com/airgap-it/airgap-coin-lib                                |        10
 https://github.com/bandprotocol/bandchain                                   |        10
 https://github.com/loambuild/loam                                           |         8
 https://github.com/bandprotocol/owasm                                       |         6
 https://github.com/diadata-org/nfttracker                                   |         6
 https://github.com/idos-network/idos-sdk-kotlin                             |         6
 https://github.com/bandprotocol/go-band-sdk                                 |         6
 https://github.com/airgap-it/airgap-sapling                                 |         4
 https://github.com/bandprotocol/chain                                       |         4
 https://github.com/getaxal/verified-signer                                  |         4
...
(927 rows)

I don't want to add a uniqueness constraint on repo_url yet, because it would interfere with our data quality improvement process. I want to be able to complete a full batch ingestion and subsequently assess the quality after the run. It's much easier to work with DB tables than to parse thousands of logged exceptions.

Since SBOM ingestion is guaranteed to tell us the owner and repo that is the subject, but doesn't surface any of its package identifiers, we must use this as the canonical_id for Repos at all times. In contrast: crawlers may not surface the GitHub PURL immediately, and we can initially create an ExternalRepo in that case, which can be promoted to Repo when we do learn the GitHub association. This is the bug: the queries above show that we have established the repo_url and the ExternalRepo has been promoted to Repo, yet the canonical_id has not been updated accordingly. It must always be changed to the correct form as part of a promotion. If the repo with the true canonical_id already exists, any new information needs to be added to this record.

I knew about this issue, but was blindsided by it while reviewing #6. It's not a blocker to merge that feature branch per se, but it's interfering with the way I want to test it.

After fixing this bug, I want to add the uniqueness constraint on repo_url to prevent regression.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdata-integrityData integrity issues: dataset does not match reality OR structural inconsistency

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions