-
Notifications
You must be signed in to change notification settings - Fork 995
Description
augur/augur/tasks/github/facade_github/tasks.py
Lines 210 to 246 in 49a008a
| SELECT DISTINCT | |
| commits.cmt_author_name AS NAME, | |
| commits.cmt_commit_hash AS hash, | |
| commits.cmt_author_raw_email AS email_raw, | |
| 'not_unresolved' as resolution_status | |
| FROM | |
| commits | |
| WHERE | |
| commits.repo_id = :repo_id | |
| AND (:since_date is NULL OR commits.data_collection_date > :since_date) | |
| AND (NOT EXISTS ( SELECT contributors.cntrb_canonical FROM contributors WHERE contributors.cntrb_canonical = commits.cmt_author_raw_email ) | |
| or NOT EXISTS ( SELECT contributors_aliases.alias_email from contributors_aliases where contributors_aliases.alias_email = commits.cmt_author_raw_email) | |
| AND ( commits.cmt_author_name ) IN ( SELECT C.cmt_author_name FROM commits AS C WHERE C.repo_id = :repo_id GROUP BY C.cmt_author_name )) | |
| GROUP BY | |
| commits.cmt_author_name, | |
| commits.cmt_commit_hash, | |
| commits.cmt_author_raw_email | |
| UNION | |
| SELECT DISTINCT | |
| commits.cmt_author_name AS NAME,--commits.cmt_id AS id, | |
| commits.cmt_commit_hash AS hash, | |
| commits.cmt_author_raw_email AS email_raw, | |
| 'unresolved' as resolution_status | |
| FROM | |
| commits | |
| WHERE | |
| commits.repo_id = :repo_id | |
| AND (:since_date is NULL OR commits.data_collection_date > :since_date) | |
| AND EXISTS ( SELECT unresolved_commit_emails.email FROM unresolved_commit_emails WHERE unresolved_commit_emails.email = commits.cmt_author_raw_email ) | |
| AND ( commits.cmt_author_name ) IN ( SELECT C.cmt_author_name FROM commits AS C WHERE C.repo_id = :repo_id GROUP BY C.cmt_author_name ) | |
| GROUP BY | |
| commits.cmt_author_name, | |
| commits.cmt_commit_hash, | |
| commits.cmt_author_raw_email | |
| ORDER BY | |
| hash | |
| """).bindparams(repo_id=repo_id,since_date=last_collected_date) |
Query error
this query, per the comments and what i understand of the implementation is getting all of the commit data's emails and names from the commit table that do not appear in the contributors table or the contributors_aliases table.
This works fine for new contributors that are not yet in the contributors table.
Where this fails is if a record slips through (maybe commits were made using an email that later got linked to a github account). In that case, the records with NULL cmt_ght_author_id never get revisited and properly linked to a contributor that IS resolved, just not for the commits that slipped through.
Last Collection Date
since we added a filter for last collection date (added by @IsaacMilarky added it in 8539825, PR #3253 ), this problem is worse, since anything that slipped through or didnt get properly linked will now be systematically ignored by its older last collection date.
When this PR was filed, I called out that the change it was making seemed unrelated to the analyze_commits_in_parallel problem that was being solved at the time (#3253 (comment)).
Since this query is being run as essentially a precondition check to establish what records we should attempt to run contributor resolution on.
currently our logic is to find all email addresses (looking up against two sources, the contributors table and the aliases table, see #3746) haven't been matched yet
That is way less simple than "we should run contributor resolution on all commits we don't currently have linked to a contributor" AKA "is the cmt_ght_author_id NULL?"