Skip to content

current insert_facade_contributors logic not linking previously resolved contributors #3779

@MoralCode

Description

@MoralCode

SELECT DISTINCT
commits.cmt_author_name AS NAME,
commits.cmt_commit_hash AS hash,
commits.cmt_author_raw_email AS email_raw,
'not_unresolved' as resolution_status
FROM
commits
WHERE
commits.repo_id = :repo_id
AND (:since_date is NULL OR commits.data_collection_date > :since_date)
AND (NOT EXISTS ( SELECT contributors.cntrb_canonical FROM contributors WHERE contributors.cntrb_canonical = commits.cmt_author_raw_email )
or NOT EXISTS ( SELECT contributors_aliases.alias_email from contributors_aliases where contributors_aliases.alias_email = commits.cmt_author_raw_email)
AND ( commits.cmt_author_name ) IN ( SELECT C.cmt_author_name FROM commits AS C WHERE C.repo_id = :repo_id GROUP BY C.cmt_author_name ))
GROUP BY
commits.cmt_author_name,
commits.cmt_commit_hash,
commits.cmt_author_raw_email
UNION
SELECT DISTINCT
commits.cmt_author_name AS NAME,--commits.cmt_id AS id,
commits.cmt_commit_hash AS hash,
commits.cmt_author_raw_email AS email_raw,
'unresolved' as resolution_status
FROM
commits
WHERE
commits.repo_id = :repo_id
AND (:since_date is NULL OR commits.data_collection_date > :since_date)
AND EXISTS ( SELECT unresolved_commit_emails.email FROM unresolved_commit_emails WHERE unresolved_commit_emails.email = commits.cmt_author_raw_email )
AND ( commits.cmt_author_name ) IN ( SELECT C.cmt_author_name FROM commits AS C WHERE C.repo_id = :repo_id GROUP BY C.cmt_author_name )
GROUP BY
commits.cmt_author_name,
commits.cmt_commit_hash,
commits.cmt_author_raw_email
ORDER BY
hash
""").bindparams(repo_id=repo_id,since_date=last_collected_date)

Query error
this query, per the comments and what i understand of the implementation is getting all of the commit data's emails and names from the commit table that do not appear in the contributors table or the contributors_aliases table.

This works fine for new contributors that are not yet in the contributors table.

Where this fails is if a record slips through (maybe commits were made using an email that later got linked to a github account). In that case, the records with NULL cmt_ght_author_id never get revisited and properly linked to a contributor that IS resolved, just not for the commits that slipped through.

Last Collection Date

since we added a filter for last collection date (added by @IsaacMilarky added it in 8539825, PR #3253 ), this problem is worse, since anything that slipped through or didnt get properly linked will now be systematically ignored by its older last collection date.

When this PR was filed, I called out that the change it was making seemed unrelated to the analyze_commits_in_parallel problem that was being solved at the time (#3253 (comment)).

Since this query is being run as essentially a precondition check to establish what records we should attempt to run contributor resolution on.

currently our logic is to find all email addresses (looking up against two sources, the contributors table and the aliases table, see #3746) haven't been matched yet

That is way less simple than "we should run contributor resolution on all commits we don't currently have linked to a contributor" AKA "is the cmt_ght_author_id NULL?"

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugDocuments unexpected/wrong/buggy behaviorhigh priorityBlocking multiple other things, causing data loss, or other incredibly urgent thingstech debt

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions