-
-
Notifications
You must be signed in to change notification settings - Fork 6
Merge author by email #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
e1e7cc9 to
9f52392
Compare
This comment was marked as outdated.
This comment was marked as outdated.
WalkthroughThe pull request introduces author canonicalization to the gitstats tool. A new Changes
Sequence DiagramsequenceDiagram
participant Git as Git Commands
participant Parser as Parsing Logic
participant Canonicalizer as get_canonical_author()
participant Stats as Per-Author Stats
Git->>Parser: git shortlog -e output<br/>(Name <email>)
Parser->>Parser: extract name & email
Parser->>Canonicalizer: lookup canonical name<br/>for email
Canonicalizer->>Canonicalizer: count frequency of<br/>each name per email
Canonicalizer-->>Parser: return canonical name
Parser->>Stats: update author stats<br/>with canonical name
Git->>Parser: git rev-list output
Parser->>Canonicalizer: resolve author via<br/>canonical mapping
Canonicalizer-->>Parser: canonical author
Parser->>Stats: propagate into per-author<br/>& per-date aggregations
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
for more information, see https://pre-commit.ci
9f52392 to
7172d9b
Compare
|
@pre-commit-ci[bot] is attempting to deploy a commit to the Xianpeng Shen's projects Team on Vercel. A member of the Team first needs to authorize it. |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
gitstats/main.py (2)
50-52: Canonical author tracking looks sound; consider clarifying tie/empty-email behaviorThe
author_emails/author_name_countsstructures andget_canonical_author()logic are coherent, and using per-email frequency is a good fit for the PR’s goal.Two minor edge cases to consider (not blockers):
- In case of ties (two names with equal counts for the same email),
max()will return the first inserted name due to dict insertion order. If you care about deterministic policy here (e.g., newest name wins or lexicographically smallest), you might want to make that explicit in code or docstring.- If Git ever yields an empty email (e.g.
<>), all such commits will be merged into one pseudo-email. If that’s undesirable, you might want to special-caseemail == ""and skip canonicalization or key it by author instead.Otherwise this helper is straightforward and integrates well.
Also applies to: 139-162
167-171:total_authorsstill reflects raw shortlog names, not canonical-by-email authors
self.total_authorsis computed via:self.total_authors += int( get_pipe_output(["git shortlog -s %s" % get_log_range("HEAD", False), "wc -l"]) )and later exposed via
get_total_authors(). With the new canonicalization,self.authorsis keyed by canonical names per email, buttotal_authorsstill counts raw shortlog name entries, so “Total authors” may disagree with the canonical-author list when the same email appears under multiple names.If you want the UI/JSON to consistently reflect “merged by email” semantics, consider deriving
total_authorsfrom the canonical structures (e.g.,len(self.author_emails)orlen(self.authors)after canonicalization) instead of raw shortlog output. If backward compatibility with the previous behavior is more important, current code is fine as-is.Also applies to: 728-732
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
gitstats/main.py(8 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: test-macos (3.11)
- GitHub Check: test-macos (3.10)
🔇 Additional comments (2)
gitstats/main.py (2)
274-276: rev-list canonicalization and authors initialization are consistentApplying
get_canonical_author()in the rev-list loop and initializingself.authors[author]withlines_added,lines_removed, andcommitsset to0keeps author keys consistent with the canonical names and guarantees the expected fields exist before later aggregation.This meshes cleanly with the later per-author shortstat pass, which increments those fields.
Also applies to: 343-348
591-593: Per-author shortstat parsing with email-based canonicalization looks correctSwitching the pretty format to
%at %aN <%%aE>and then parsingstamp_strplus"Name <email>"before feeding intoget_canonical_author()is logically consistent with the earlier rev-list handling. The fallback branch when</>are missing is also reasonable as a defensive path.Note: since this loop and the rev-list loop both call
get_canonical_author()per commit, all names for a given email get their counts scaled by the same factor, so the “most frequent name per email” decision remains stable.Also applies to: 605-624
| # Modify command to only include commits within our range and include email | ||
| cmd = f'git shortlog -s -e "{tag}"' | ||
| if prev is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Canonicalization can undercount per-tag author commits when multiple names share an email
When git shortlog -s -e emits multiple lines for the same email (different author names), get_canonical_author() can map them to the same canonical author. In that case:
self.tags[tag]["authors"][author] = commitswill overwrite the previous entry instead of summing, so the per-author counts for that tag become incorrect even though self.tags[tag]["commits"] is correct.
Consider accumulating instead of assigning:
- self.tags[tag]["commits"] += commits
- self.tags[tag]["authors"][author] = commits
+ self.tags[tag]["commits"] += commits
+ self.tags[tag]["authors"][author] = (
+ self.tags[tag]["authors"].get(author, 0) + commits
+ )This preserves the intended “merge by email” semantics for tag statistics.
Also applies to: 239-251
🤖 Prompt for AI Agents
In gitstats/main.py around lines 222-224 (and similarly for 239-251), the code
assigns self.tags[tag]["authors"][author] = commits after canonicalizing by
email which overwrites counts when multiple names map to the same canonical
author; change the assignment to accumulate the commit counts instead (e.g.,
read current = self.tags[tag]["authors"].get(author, 0) and set to current +
commits) and apply the same accumulation logic in the other block so per-tag
author counts are summed rather than overwritten.
|
For some reason, the "Tags" page can not show authors for each tag. For "Authors" pages, take me as an example, "shenxianpeng" and "Xianpeng Shen" are both me, but they are not merged as either "shenxianpeng" or "Xianpeng Shen" to treat them are the same author, like .mailmap feature |







I know that there is a .mailmap feature for git that helps you merge commits by the same author with different
emailorname. But I'd like to automatically merge authors with the same email address. This PR does that :-)📚 Documentation preview 📚: https://gitstats--131.org.readthedocs.build/
Summary by CodeRabbit