-
Notifications
You must be signed in to change notification settings - Fork 2.6k
refactor:improve checkpoint and ensure gc to improve disk space #4494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
|
Awesome! I have been wondering for a long time if Do you have any idea how much this saves us in repo space 20, 30, 50%? We might be able to find people with really large repo's to be guinea pigs if you want a big stress test (unless you personally already have them and this is why you made the PR :-) ) 🦘 🤟 |
|
@adamhill i think the problems is we don't need to track all of them, we just need to track what file going to change, that will be better (you can see checkpoint use git add . ) we can change it to (git add ) |
|
@adamhill Running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ijin, thank you for taking over this PR, I left some questions and suggestions.
My main concern about this new implementation is finding a way to test it correctly, it might also be a good idea to separate the garbage collection functionality from the stageAll method of tracking file changes into a separate PR so it's easier to test.
I'm curious to know what you think of this!
| try { | ||
| await git.add(".") | ||
| // Use git status to find changed/new/deleted files and stage them specifically | ||
| const status = await git.status(["--porcelain=v1", "-uall"]) // -uall includes all untracked files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new stageAll implementation makes multiple git calls (status, add, rm) compared to the previous single git add . call. Have you measured the performance impact of this change? For large repositories with many changes, this could potentially be slower.
While the precision is valuable for avoiding unnecessary files, it might be worth benchmarking this against the old approach to ensure we're not introducing a performance regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I know, the expensive full repo walk is done once by git status (just as it would in git add .). Subsequent git add/rm calls only does a lightweight stat(), making their overhead minimal.
That said, here are some generated benchmarks I ran on Roo-Code itself. I don't see it much as a performance regression.
- benchmark: https://gist.github.com/ijin/8946d83e77ddcb108d62c82073550eaf
- output:
=== stageAll() Performance Benchmark ===
Repository: /tmp/Roo-Code
Runs per test: 5
=== Creating Large Change Scenario ===
Created 2 nested git repositories
Found 80 TypeScript files to modify
Modified 60 TypeScript files
Created 12 new files
Successfully added and committed delete-me files
Created and deleted 7 files
Modified package.json
Large change scenario created successfully
=== Repository Statistics ===
Total tracked files: 1630
Modified files: 80
Staged files: 0
=== Benchmarking OLD Approach (Full stageAll with git add .) ===
Found 2 nested .git directories in 26.9ms
Run 1: 242.2ms (rename: 29.4ms, git: 212.5ms)
Run 2: 157.8ms (rename: 3.3ms, git: 154.5ms)
Run 3: 153.1ms (rename: 2.7ms, git: 150.3ms)
Run 4: 111.0ms (rename: 1.7ms, git: 109.2ms)
Run 5: 109.9ms (rename: 2.0ms, git: 107.9ms)
OLD Approach (Full stageAll with git add .) average: 154.8ms
- Rename operations: 7.8ms
- Git operations: 146.9ms
=== Benchmarking NEW Approach (Full stageAll with selective staging) ===
Run 1: 436.5ms (rename: 2.0ms, git: 434.3ms) [77 add, 7 rm]
Run 2: 457.8ms (rename: 2.2ms, git: 455.6ms) [77 add, 7 rm]
Run 3: 443.2ms (rename: 1.7ms, git: 441.4ms) [77 add, 7 rm]
Run 4: 486.8ms (rename: 2.0ms, git: 484.7ms) [77 add, 7 rm]
Run 5: 459.9ms (rename: 2.2ms, git: 457.7ms) [77 add, 7 rm]
NEW Approach (Full stageAll with selective staging) average: 456.8ms
- Rename operations: 2.0ms
- Git operations: 454.7ms
=== Performance Comparison ===
OLD approach average: 154.8ms
- Rename operations: 7.8ms
- Git operations: 146.9ms
NEW approach average: 456.8ms
- Rename operations: 2.0ms
- Git operations: 454.7ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new
stageAllimplementation makes multiple git calls (status, add, rm) compared to the previous singlegit add .call. Have you measured the performance impact of this change? For large repositories with many changes, this could potentially be slower.While the precision is valuable for avoiding unnecessary files, it might be worth benchmarking this against the old approach to ensure we're not introducing a performance regression.
Fwiw add all takes like 5-10 minutes on my repo.
| } | ||
| } | ||
|
|
||
| this.log(`[${this.constructor.name}#initShadowGit] Repository repack failed, continuing without optimization.`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the tryRepack method, errors are logged but the method continues silently. Should we consider:
- Tracking these failures (perhaps with a counter)?
- Alerting the user if repacking consistently fails?
- Adding telemetry to understand how often repacking fails in the wild?
This would help identify problematic Git installations or configurations that might prevent users from benefiting from the optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could for instance, keep track and alert if there are more than N consecutive failures.
But I'm not familiar enough with the project on how to implement user facing alerts or telemetry...
| const commands = isWindows | ||
| ? [ | ||
| // Try the more thorough repack command on Git 2.49.0 for Windows | ||
| { args: ["repack", "-a", "-d", "-f", "--path-walk", "--quiet"], desc: "advanced repack with --path-walk" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code mentions using Git 2.49.0 on Windows for the --path-walk option, but:
- How do we ensure users have this version installed?
- Should we add a version check and provide a helpful message if they're on an older version?
- Is this documented in the PR description or user-facing documentation?
This could be confusing for Windows users who don't get the expected performance improvements due to their Git version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- No need to ensure, but it does have a fallback strategy
- If it fails, it logs and tries the fallback
- Yes, it is mentioned in the PR description
|
|
||
| // GC related properties | ||
| private gcCounter: number = 0 | ||
| private readonly GC_CHECKPOINT_THRESHOLD: number = Number(process.env.GC_CHECKPOINT_THRESHOLD) || 20 // Run gc every 20 checkpoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about this number, could it make it hard for the gc to trigger on short tasks? would it be a good idea to lower it so it triggers more often? it would be a good idea to know how much of an impact running the gc causes in the system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This number was what was suggested in the original PR. It would trigger less for shorter tasks, but this is already configurable with GC_CHECKPOINT_THRESHOLD.
We could try running gc on a large codebase and measure it if you have one in mind.
|
Kilo is fighting the git bloat as well. They are trying out a cron job to vacuum the checkpoint repo that deletes tasks after a period of time, if someone want to test it out. |
|
@daniel-lxs I've made some changes and comments. |
|
I've been looking at this PR for a while now, and something doesn't feel quite right. I think the best way to proceed is to split this implementation into two separate PRs: one for the garbage collection and another for the new staging method. This will make it much easier to test and identify potential issues. This doesn't mean the implementation is bad; we just need a better separation of concerns to properly test everything. It would also be helpful to have clear steps to reproduce the original issue to verify that these changes resolve it. I'll be closing this PR temporarily, but this PR can be used as base for any future implementations. Thank you @ijin for taking over the original PR and working on this one. |
Related GitHub Issue
Closes #3391
Closes #3348
Closes #3080
Closes #3695
Description
Builds upon #3695
Discussion: #3695 (comment)
Garbage collection and perf improvements
Upgrade to git 2.49.0 (Windows only) for additional benefits
Test Procedure
Unit tests added
Sanity test on local
Type of Change
srcor test files.Pre-Submission Checklist
npm run lint).console.log) has been removed.npm test).mainbranch.npm run changesetif this PR includes user-facing changes or dependency updates.Screenshots / Videos
Documentation Updates
Additional Notes
Get in Touch
Important
Refactor
ShadowCheckpointServiceto enhance garbage collection and checkpoint handling, improving disk space management and performance.ShadowCheckpointServiceto improve garbage collection and disk space management.tryRepack()to handle repository repacking with platform-specific commands.saveCheckpoint()and after branch deletion indeleteBranch().findAndCacheNestedGitRepoPaths()to avoid repeated scans.stageAll()to usegit statusfor precise file staging.ShadowCheckpointService.test.tsto verify new garbage collection and checkpoint behaviors.executeRipgrepto simulate nested git directory detection.ShadowCheckpointService.This description was created by
for 2d8d2d3. You can customize this summary. It will automatically update as commits are pushed.