Skip to content

blame function is running slowly, and the results do not match the git blame command #1401

@Sisyphus-wang

Description

@Sisyphus-wang

I want to use pygit2 to trace back the vulnerability introduction time based on the fix patches of open-source components.

On Ubuntu, testing the Linux repository takes about 13 seconds.

time (git checkout 2734d6c1b1a089fb593ef6a23d4b70903526fe0c && git blame -L 3883,3886 kernel/trace/ring_buffer.c)
Updating files: 100% (77303/77303), done.
Note: switching to '2734d6c1b1a089fb593ef6a23d4b70903526fe0c'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 2734d6c1b1a0 Linux 5.14-rc2
bf41a158cacba (Steven Rostedt 2008-10-04 02:00:59 -0400 3883)   return reader->read == rb_page_commit(reader) &&
bf41a158cacba (Steven Rostedt 2008-10-04 02:00:59 -0400 3884)           (commit == reader ||
bf41a158cacba (Steven Rostedt 2008-10-04 02:00:59 -0400 3885)            (commit == head &&
bf41a158cacba (Steven Rostedt 2008-10-04 02:00:59 -0400 3886)             head->read == rb_page_commit(commit)));

real    0m13.343s
user    0m6.960s
sys     0m5.266s

However, the execution time of pygit2 is significantly slower than direct command-line operations, and the running results are inconsistent.

Here is my python code:

def show_commit_line_blame(repo_path, commit_hash, file_path, min_line_number, max_line_number):
    repo = pygit2.Repository(repo_path)
    blame = repo.blame(
        file_path, 
        newest_commit=commit_hash, 
        flags=pygit2.GIT_BLAME_FIRST_PARENT, 
        min_line=min_line_number, 
        max_line=max_line_number
    )
    
    for line_number in range(min_line_number, max_line_number+1):
        hunk = blame.for_line(line_number)
        blamed_commit = repo.get(hunk.final_commit_id)
        utc_time = datetime.fromtimestamp(blamed_commit.commit_time, tz=timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
        print(f"Line {line_number} in {file_path} last modified by commit {blamed_commit.id} on {utc_time}")

output:
Line 3883 in kernel/trace/ring_buffer.c last modified by commit 92b29b86fe2e183d44eb467e5e74a5f718ef2e43 on 2008-10-20T20:35:07Z
Line 3884 in kernel/trace/ring_buffer.c last modified by commit 92b29b86fe2e183d44eb467e5e74a5f718ef2e43 on 2008-10-20T20:35:07Z
Line 3885 in kernel/trace/ring_buffer.c last modified by commit 92b29b86fe2e183d44eb467e5e74a5f718ef2e43 on 2008-10-20T20:35:07Z
Line 3886 in kernel/trace/ring_buffer.c last modified by commit 92b29b86fe2e183d44eb467e5e74a5f718ef2e43 on 2008-10-20T20:35:07Z
used : 50.29366707801819s

environment:
ubuntu 22.04 (6.8.0-58-generic #60~22.04.1-Ubuntu SMP)
Python 3.10.12
pygit2 1.18.1

Could you give me some suggestions to improve execution efficiency?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions