-
Notifications
You must be signed in to change notification settings - Fork 24
fix: Avoid frequent writes to graph.json
in a loop
#738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: Avoid frequent writes to graph.json
in a loop
#738
Conversation
src/fromager/commands/bootstrap.py
Outdated
wkctx.write_to_graph_to_file() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you measured how large the performance improvement is going to be?
bootstrapping can fail for multiple reasons like missing sdist, conflict, or build error. We require to graph file to debug problems. Your change may speed up bootstrapping by a few seconds, maybe less.
Think about how you can write the file less often but still write it in case of an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested the performance difference, although I'd like to. Need to figure out how, and once I do I'll post my findings here.
We're also calling DependencyGraph.serialize()
for every time we write, which may or may not make an additional performance difference to the file writes.
As for the error case - good point, didn't know the graph file is used for debugging.
My thought is to wrap the section where we loop over the packages with a try
block, and then put wkctx.write_to_graph_to_file()
under a finally
block, so that we export the dependency graph whether there's an error or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tiran I've updated the code. It should save the graph.json
file even if an error is encountered and it's incomplete.
Let me know what you think.
Before this change, we would write the `graph.json` over and over again for every single dependency that's added while the dependency graph is being created. This commit changes this behavior to only write to the file once, only when all dependencies have been already added. Signed-off-by: Michael Yochpaz <[email protected]>
7c46f4e
to
ff39fb7
Compare
Signed-off-by: Michael Yochpaz <[email protected]>
ff39fb7
to
2670fb7
Compare
The code looks good to me but @tiran needs to review it as well. |
This introduces some risk of losing the graph if the reason for the bootstrap failure is a full disk, but I think we can live with that if it improves performance. I'll leave it to @tiran to approve, since he has been thinking about these performance-related changes. When you measured the time, how much difference did this make? |
try: | ||
for req in to_build: | ||
token = requirement_ctxvar.set(req) | ||
bt.bootstrap(req, requirements_file.RequirementType.TOP_LEVEL) | ||
progressbar.update() | ||
requirement_ctxvar.reset(token) | ||
|
||
finally: | ||
wkctx.write_to_graph_to_file() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about you write the graph file every time Fromager processes a top level requirement? A typical product has between one and a handful of top level requirement.
for req in to_build:
token = requirement_ctxvar.set(req)
try:
bt.bootstrap(req, requirements_file.RequirementType.TOP_LEVEL)
finally:
wkctx.write_to_graph_to_file()
progressbar.update()
requirement_ctxvar.reset(token)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code you made will still write the file for all dependencies, as to my understanding, we call bootstrap
on every dependency recursively, so this loop runs for every dependency down the tree and not just the top level requirements. (this loop you quoted is part of the bootstrap
function, and we call bootstrap
on every requirement).
About the idea itself - it's possible to only write after each top level dependency is completed, and will likely still be a big performance boost (slightly lesser since we still do more writes, but it's marginal), but what benefit does this approach have over the approach of writing once at the end, inside a finally
block (which means we'll write the file even if there's an error)? I don't see any benefits to it.
Before this change, we would write the
graph.json
over and over again for every single dependency that's added while the dependency graph is being created.This PR changes this behavior to only write to the file once, only when all dependencies have been already added.
This PR resolves #735