Skip to content

Conversation

@jackbravo
Copy link
Contributor

@jackbravo jackbravo commented Nov 13, 2025

I know that there is a .mailmap feature for git that helps you merge commits by the same author with different email or name. But I'd like to automatically merge authors with the same email address. This PR does that :-)


📚 Documentation preview 📚: https://gitstats--131.org.readthedocs.build/

Summary by CodeRabbit

  • New Features
    • Enhanced author tracking to recognize and consolidate multiple name formats for the same contributor across commits.
    • Improved email-based author identification in statistics collection.

@netlify

This comment was marked as outdated.

@coderabbitai
Copy link

coderabbitai bot commented Nov 14, 2025

Walkthrough

The pull request introduces author canonicalization to the gitstats tool. A new get_canonical_author method derives the most frequently-used name per email address. Data structures track author-to-email mappings and name frequency counts. The parsing and data collection pipelines now use canonical names instead of raw commit author names throughout.

Changes

Cohort / File(s) Change Summary
Author Canonicalization System
gitstats/main.py
Added author_emails (email → canonical name mapping) and author_name_counts (email → name frequency counts) attributes to DataCollector. Introduced get_canonical_author(author, email) method to GitDataCollector that returns the most frequently-used author name for a given email. Updated shortlog collection to include emails (-e flag). Modified parsing of "Name " entries to extract and normalize pairs. Updated revision stats processing to resolve authors via canonical names and propagate them into per-author statistics. Enhanced per-author data structure initialization with lines_added, lines_removed, and commits fields.

Sequence Diagram

sequenceDiagram
    participant Git as Git Commands
    participant Parser as Parsing Logic
    participant Canonicalizer as get_canonical_author()
    participant Stats as Per-Author Stats

    Git->>Parser: git shortlog -e output<br/>(Name <email>)
    Parser->>Parser: extract name & email
    Parser->>Canonicalizer: lookup canonical name<br/>for email
    Canonicalizer->>Canonicalizer: count frequency of<br/>each name per email
    Canonicalizer-->>Parser: return canonical name
    Parser->>Stats: update author stats<br/>with canonical name
    
    Git->>Parser: git rev-list output
    Parser->>Canonicalizer: resolve author via<br/>canonical mapping
    Canonicalizer-->>Parser: canonical author
    Parser->>Stats: propagate into per-author<br/>& per-date aggregations
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Key areas for attention:
    • Logic density in get_canonical_author method—verify frequency-based canonicalization is correct and handles edge cases (ties, null values)
    • Data flow changes across parsing paths—ensure canonical names are consistently applied in tag-based aggregation, revision statistics, and shortlog collection
    • Initialization of per-author statistics fields—confirm lines_added, lines_removed, and commits are properly propagated across all collection paths
    • Email extraction and parsing in "Name " entries—verify regex or parsing logic handles various name/email formats correctly

Poem

🐰 A hop, skip, and jump through the code so keen,
Where authors once tangled are now canonicalized clean!
Each email finds truth in a name most frequent,
No more duplicates making the stats obsequent!
With whiskers held high, we celebrate true,
One name per email—hooray, hooray, who's who! 🎉

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Merge author by email' directly corresponds to the main objective of the pull request, which implements automatic merging of authors who share the same email address.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@vercel
Copy link

vercel bot commented Nov 15, 2025

@pre-commit-ci[bot] is attempting to deploy a commit to the Xianpeng Shen's projects Team on Vercel.

A member of the Team first needs to authorize it.

@vercel
Copy link

vercel bot commented Nov 15, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
gitstats Ready Ready Preview Comment Nov 15, 2025 5:46pm

@sonarqubecloud
Copy link

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
gitstats/main.py (2)

50-52: Canonical author tracking looks sound; consider clarifying tie/empty-email behavior

The author_emails / author_name_counts structures and get_canonical_author() logic are coherent, and using per-email frequency is a good fit for the PR’s goal.

Two minor edge cases to consider (not blockers):

  • In case of ties (two names with equal counts for the same email), max() will return the first inserted name due to dict insertion order. If you care about deterministic policy here (e.g., newest name wins or lexicographically smallest), you might want to make that explicit in code or docstring.
  • If Git ever yields an empty email (e.g. <>), all such commits will be merged into one pseudo-email. If that’s undesirable, you might want to special-case email == "" and skip canonicalization or key it by author instead.

Otherwise this helper is straightforward and integrates well.

Also applies to: 139-162


167-171: total_authors still reflects raw shortlog names, not canonical-by-email authors

self.total_authors is computed via:

self.total_authors += int(
    get_pipe_output(["git shortlog -s %s" % get_log_range("HEAD", False), "wc -l"])
)

and later exposed via get_total_authors(). With the new canonicalization, self.authors is keyed by canonical names per email, but total_authors still counts raw shortlog name entries, so “Total authors” may disagree with the canonical-author list when the same email appears under multiple names.

If you want the UI/JSON to consistently reflect “merged by email” semantics, consider deriving total_authors from the canonical structures (e.g., len(self.author_emails) or len(self.authors) after canonicalization) instead of raw shortlog output. If backward compatibility with the previous behavior is more important, current code is fine as-is.

Also applies to: 728-732

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e20ae3a and 7172d9b.

📒 Files selected for processing (1)
  • gitstats/main.py (8 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: test-macos (3.11)
  • GitHub Check: test-macos (3.10)
🔇 Additional comments (2)
gitstats/main.py (2)

274-276: rev-list canonicalization and authors initialization are consistent

Applying get_canonical_author() in the rev-list loop and initializing self.authors[author] with lines_added, lines_removed, and commits set to 0 keeps author keys consistent with the canonical names and guarantees the expected fields exist before later aggregation.

This meshes cleanly with the later per-author shortstat pass, which increments those fields.

Also applies to: 343-348


591-593: Per-author shortstat parsing with email-based canonicalization looks correct

Switching the pretty format to %at %aN <%%aE> and then parsing stamp_str plus "Name <email>" before feeding into get_canonical_author() is logically consistent with the earlier rev-list handling. The fallback branch when < / > are missing is also reasonable as a defensive path.

Note: since this loop and the rev-list loop both call get_canonical_author() per commit, all names for a given email get their counts scaled by the same factor, so the “most frequent name per email” decision remains stable.

Also applies to: 605-624

Comment on lines +222 to 224
# Modify command to only include commits within our range and include email
cmd = f'git shortlog -s -e "{tag}"'
if prev is not None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Canonicalization can undercount per-tag author commits when multiple names share an email

When git shortlog -s -e emits multiple lines for the same email (different author names), get_canonical_author() can map them to the same canonical author. In that case:

self.tags[tag]["authors"][author] = commits

will overwrite the previous entry instead of summing, so the per-author counts for that tag become incorrect even though self.tags[tag]["commits"] is correct.

Consider accumulating instead of assigning:

-                self.tags[tag]["commits"] += commits
-                self.tags[tag]["authors"][author] = commits
+                self.tags[tag]["commits"] += commits
+                self.tags[tag]["authors"][author] = (
+                    self.tags[tag]["authors"].get(author, 0) + commits
+                )

This preserves the intended “merge by email” semantics for tag statistics.

Also applies to: 239-251

🤖 Prompt for AI Agents
In gitstats/main.py around lines 222-224 (and similarly for 239-251), the code
assigns self.tags[tag]["authors"][author] = commits after canonicalizing by
email which overwrites counts when multiple names map to the same canonical
author; change the assignment to accumulate the commit counts instead (e.g.,
read current = self.tags[tag]["authors"].get(author, 0) and set to current +
commits) and apply the same accumulation logic in the other block so per-tag
author counts are summed rather than overwritten.

@shenxianpeng
Copy link
Owner

For some reason, the "Tags" page can not show authors for each tag.

Click to see sreenshot

Before

image

After

image

For "Authors" pages, take me as an example, "shenxianpeng" and "Xianpeng Shen" are both me, but they are not merged as either "shenxianpeng" or "Xianpeng Shen" to treat them are the same author, like .mailmap feature

Click to see sreenshot

Before

image

After (removed .mailmap and regenerated report)

image

@shenxianpeng shenxianpeng added the enhancement New feature or request label Nov 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants