-
Couldn't load subscription status.
- Fork 31
fix: Avoid frequent writes to graph.json in a loop
#738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: Avoid frequent writes to graph.json in a loop
#738
Conversation
src/fromager/commands/bootstrap.py
Outdated
| wkctx.write_to_graph_to_file() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you measured how large the performance improvement is going to be?
bootstrapping can fail for multiple reasons like missing sdist, conflict, or build error. We require to graph file to debug problems. Your change may speed up bootstrapping by a few seconds, maybe less.
Think about how you can write the file less often but still write it in case of an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested the performance difference, although I'd like to. Need to figure out how, and once I do I'll post my findings here.
We're also calling DependencyGraph.serialize() for every time we write, which may or may not make an additional performance difference to the file writes.
As for the error case - good point, didn't know the graph file is used for debugging.
My thought is to wrap the section where we loop over the packages with a try block, and then put wkctx.write_to_graph_to_file() under a finally block, so that we export the dependency graph whether there's an error or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tiran I've updated the code. It should save the graph.json file even if an error is encountered and it's incomplete.
Let me know what you think.
Before this change, we would write the `graph.json` over and over again for every single dependency that's added while the dependency graph is being created. This commit changes this behavior to only write to the file once, only when all dependencies have been already added. Signed-off-by: Michael Yochpaz <[email protected]>
7c46f4e to
ff39fb7
Compare
Signed-off-by: Michael Yochpaz <[email protected]>
ff39fb7 to
2670fb7
Compare
|
The code looks good to me but @tiran needs to review it as well. |
|
This introduces some risk of losing the graph if the reason for the bootstrap failure is a full disk, but I think we can live with that if it improves performance. I'll leave it to @tiran to approve, since he has been thinking about these performance-related changes. When you measured the time, how much difference did this make? |
| try: | ||
| for req in to_build: | ||
| token = requirement_ctxvar.set(req) | ||
| bt.bootstrap(req, requirements_file.RequirementType.TOP_LEVEL) | ||
| progressbar.update() | ||
| requirement_ctxvar.reset(token) | ||
|
|
||
| finally: | ||
| wkctx.write_to_graph_to_file() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about you write the graph file every time Fromager processes a top level requirement? A typical product has between one and a handful of top level requirement.
for req in to_build:
token = requirement_ctxvar.set(req)
try:
bt.bootstrap(req, requirements_file.RequirementType.TOP_LEVEL)
finally:
wkctx.write_to_graph_to_file()
progressbar.update()
requirement_ctxvar.reset(token)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code you made will still write the file for all dependencies, as to my understanding, we call bootstrap on every dependency recursively, so this loop runs for every dependency down the tree and not just the top level requirements. (this loop you quoted is part of the bootstrap function, and we call bootstrap on every requirement).
About the idea itself - it's possible to only write after each top level dependency is completed, and will likely still be a big performance boost (slightly lesser since we still do more writes, but it's marginal), but what benefit does this approach have over the approach of writing once at the end, inside a finally block (which means we'll write the file even if there's an error)? I don't see any benefits to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The graph file is critical for debugging build failures. It describes the entire dependency tree up to the point of the failure and helps us understand why packages are built in the order they are built. We MUST retain the ability to produce that file. Writing the file frequently is not overhead, it is a feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhellmann I get your point that it's an important file, but fail to understand how the proposed code lessens the ability to produce this file.
We still write the file whether it succeeds or there's an error, we just do it once instead of doing it repeatedly.
I don't really see how an edge case of a full disk is relevant... if the disk is full, you won't be able to write to the graph.json file with the existing code either, and Fromager will break regardless of the graph.json file, since it won't be able to download sdists nor build packages.
In the very rare case that there's only a few KB / MB available, you might get a partial graph file before it breaks using the current code, compared to no file at all with the proposed changes. But I don't see how this file would be useful as Fromager (any maybe even other parts of the system like Python) will break, and the user will want to rerun it properly anyways (where a full, proper graph.json file will be created) after resolving the space issue.
If you still strongly believe it's necessary to rewrite the file following every change let me know and I'll close the PR. I made this change mainly because I think it's a better practice and cleaner code, but I agree it's probably not a meaningful performance improvement.
Before this change, we would write the
graph.jsonover and over again for every single dependency that's added while the dependency graph is being created.This PR changes this behavior to only write to the file once, only when all dependencies have been already added.
This PR resolves #735