Merge author by email #131

jackbravo · 2025-11-13T15:46:24Z

I know that there is a .mailmap feature for git that helps you merge commits by the same author with different email or name. But I'd like to automatically merge authors with the same email address. This PR does that :-)

📚 Documentation preview 📚: https://gitstats--131.org.readthedocs.build/

Summary by CodeRabbit

New Features
- Enhanced author tracking to recognize and consolidate multiple name formats for the same contributor across commits.
- Improved email-based author identification in statistics collection.

coderabbitai · 2025-11-14T22:13:19Z

Walkthrough

The pull request introduces author canonicalization to the gitstats tool. A new get_canonical_author method derives the most frequently-used name per email address. Data structures track author-to-email mappings and name frequency counts. The parsing and data collection pipelines now use canonical names instead of raw commit author names throughout.

Changes

Cohort / File(s)	Change Summary
Author Canonicalization System `gitstats/main.py`	Added `author_emails` (email → canonical name mapping) and `author_name_counts` (email → name frequency counts) attributes to `DataCollector`. Introduced `get_canonical_author(author, email)` method to `GitDataCollector` that returns the most frequently-used author name for a given email. Updated shortlog collection to include emails (-e flag). Modified parsing of "Name " entries to extract and normalize pairs. Updated revision stats processing to resolve authors via canonical names and propagate them into per-author statistics. Enhanced per-author data structure initialization with lines_added, lines_removed, and commits fields.

Sequence Diagram

sequenceDiagram
    participant Git as Git Commands
    participant Parser as Parsing Logic
    participant Canonicalizer as get_canonical_author()
    participant Stats as Per-Author Stats

    Git->>Parser: git shortlog -e output<br/>(Name <email>)
    Parser->>Parser: extract name & email
    Parser->>Canonicalizer: lookup canonical name<br/>for email
    Canonicalizer->>Canonicalizer: count frequency of<br/>each name per email
    Canonicalizer-->>Parser: return canonical name
    Parser->>Stats: update author stats<br/>with canonical name
    
    Git->>Parser: git rev-list output
    Parser->>Canonicalizer: resolve author via<br/>canonical mapping
    Canonicalizer-->>Parser: canonical author
    Parser->>Stats: propagate into per-author<br/>& per-date aggregations

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Key areas for attention:
- Logic density in get_canonical_author method—verify frequency-based canonicalization is correct and handles edge cases (ties, null values)
- Data flow changes across parsing paths—ensure canonical names are consistently applied in tag-based aggregation, revision statistics, and shortlog collection
- Initialization of per-author statistics fields—confirm lines_added, lines_removed, and commits are properly propagated across all collection paths
- Email extraction and parsing in "Name " entries—verify regex or parsing logic handles various name/email formats correctly

Poem

🐰 A hop, skip, and jump through the code so keen,
Where authors once tangled are now canonicalized clean!
Each email finds truth in a name most frequent,
No more duplicates making the stats obsequent!
With whiskers held high, we celebrate true,
One name per email—hooray, hooray, who's who! 🎉

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Merge author by email' directly corresponds to the main objective of the pull request, which implements automatic merging of authors who share the same email address.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

for more information, see https://pre-commit.ci

vercel · 2025-11-15T17:45:42Z

@pre-commit-ci[bot] is attempting to deploy a commit to the Xianpeng Shen's projects Team on Vercel.

A member of the Team first needs to authorize it.

vercel · 2025-11-15T17:45:59Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
gitstats	Ready	Preview	Comment	Nov 15, 2025 5:46pm

sonarqubecloud · 2025-11-15T17:46:12Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

gitstats/main.py (2)
50-52: Canonical author tracking looks sound; consider clarifying tie/empty-email behavior

The author_emails / author_name_counts structures and get_canonical_author() logic are coherent, and using per-email frequency is a good fit for the PR’s goal.

Two minor edge cases to consider (not blockers):

In case of ties (two names with equal counts for the same email), max() will return the first inserted name due to dict insertion order. If you care about deterministic policy here (e.g., newest name wins or lexicographically smallest), you might want to make that explicit in code or docstring.

If Git ever yields an empty email (e.g. <>), all such commits will be merged into one pseudo-email. If that’s undesirable, you might want to special-case email == "" and skip canonicalization or key it by author instead.

Otherwise this helper is straightforward and integrates well.

Also applies to: 139-162

167-171: total_authors still reflects raw shortlog names, not canonical-by-email authors

self.total_authors is computed via:
self.total_authors += int(
    get_pipe_output(["git shortlog -s %s" % get_log_range("HEAD", False), "wc -l"])
)
and later exposed via get_total_authors(). With the new canonicalization, self.authors is keyed by canonical names per email, but total_authors still counts raw shortlog name entries, so “Total authors” may disagree with the canonical-author list when the same email appears under multiple names.

If you want the UI/JSON to consistently reflect “merged by email” semantics, consider deriving total_authors from the canonical structures (e.g., len(self.author_emails) or len(self.authors) after canonicalization) instead of raw shortlog output. If backward compatibility with the previous behavior is more important, current code is fine as-is.

Also applies to: 728-732

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e20ae3a and 7172d9b.

📒 Files selected for processing (1)

gitstats/main.py (8 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: test-macos (3.11)
GitHub Check: test-macos (3.10)

🔇 Additional comments (2)

gitstats/main.py (2)

274-276: rev-list canonicalization and authors initialization are consistent

Applying get_canonical_author() in the rev-list loop and initializing self.authors[author] with lines_added, lines_removed, and commits set to 0 keeps author keys consistent with the canonical names and guarantees the expected fields exist before later aggregation.

This meshes cleanly with the later per-author shortstat pass, which increments those fields.

Also applies to: 343-348

591-593: Per-author shortstat parsing with email-based canonicalization looks correct

Switching the pretty format to %at %aN <%%aE> and then parsing stamp_str plus "Name <email>" before feeding into get_canonical_author() is logically consistent with the earlier rev-list handling. The fallback branch when < / > are missing is also reasonable as a defensive path.

Note: since this loop and the rev-list loop both call get_canonical_author() per commit, all names for a given email get their counts scaled by the same factor, so the “most frequent name per email” decision remains stable.

Also applies to: 605-624

coderabbitai · 2025-11-15T17:50:32Z

gitstats/main.py

+            # Modify command to only include commits within our range and include email
+            cmd = f'git shortlog -s -e "{tag}"'
            if prev is not None:


⚠️ Potential issue | 🟠 Major

Canonicalization can undercount per-tag author commits when multiple names share an email

When git shortlog -s -e emits multiple lines for the same email (different author names), get_canonical_author() can map them to the same canonical author. In that case:

self.tags[tag]["authors"][author] = commits

will overwrite the previous entry instead of summing, so the per-author counts for that tag become incorrect even though self.tags[tag]["commits"] is correct.

Consider accumulating instead of assigning:

- self.tags[tag]["commits"] += commits - self.tags[tag]["authors"][author] = commits + self.tags[tag]["commits"] += commits + self.tags[tag]["authors"][author] = ( + self.tags[tag]["authors"].get(author, 0) + commits + )

This preserves the intended “merge by email” semantics for tag statistics.

Also applies to: 239-251

🤖 Prompt for AI Agents

In gitstats/main.py around lines 222-224 (and similarly for 239-251), the code assigns self.tags[tag]["authors"][author] = commits after canonicalizing by email which overwrites counts when multiple names map to the same canonical author; change the assignment to accumulate the commit counts instead (e.g., read current = self.tags[tag]["authors"].get(author, 0) and set to current + commits) and apply the same accumulation logic in the other block so per-tag author counts are summed rather than overwritten.

shenxianpeng · 2025-11-15T19:28:19Z

For some reason, the "Tags" page can not show authors for each tag.

Click to see sreenshot

Before

After

For "Authors" pages, take me as an example, "shenxianpeng" and "Xianpeng Shen" are both me, but they are not merged as either "shenxianpeng" or "Xianpeng Shen" to treat them are the same author, like .mailmap feature

Click to see sreenshot

Before

After (removed .mailmap and regenerated report)

shenxianpeng force-pushed the merge-author-mail branch from e1e7cc9 to 9f52392 Compare November 14, 2025 22:13

This comment was marked as outdated.

Sign in to view

jackbravo and others added 2 commits November 15, 2025 19:45

feat: Merge author by email

42711b8

[pre-commit.ci] auto fixes from pre-commit.com hooks

7172d9b

for more information, see https://pre-commit.ci

shenxianpeng force-pushed the merge-author-mail branch from 9f52392 to 7172d9b Compare November 15, 2025 17:45

vercel bot temporarily deployed to Preview November 15, 2025 17:46 Inactive

coderabbitai bot reviewed Nov 15, 2025

View reviewed changes

shenxianpeng added the enhancement New feature or request label Nov 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Merge author by email #131

Merge author by email #131

Uh oh!

jackbravo commented Nov 13, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

This comment was marked as outdated.

coderabbitai bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

vercel bot commented Nov 15, 2025

Uh oh!

vercel bot commented Nov 15, 2025 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Nov 15, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 15, 2025

Uh oh!

shenxianpeng commented Nov 15, 2025

Before

After

Before

After (removed .mailmap and regenerated report)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Merge author by email #131

Are you sure you want to change the base?

Merge author by email #131

Uh oh!

Conversation

jackbravo commented Nov 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

This comment was marked as outdated.

coderabbitai bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

vercel bot commented Nov 15, 2025

Uh oh!

vercel bot commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Nov 15, 2025

Quality Gate passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

shenxianpeng commented Nov 15, 2025

Before

After

Before

After (removed .mailmap and regenerated report)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jackbravo commented Nov 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 14, 2025 •

edited

Loading

vercel bot commented Nov 15, 2025 •

edited

Loading