Skip to content

Conversation

@HongboDu-at
Copy link

@HongboDu-at HongboDu-at commented Jan 7, 2026

From internal usage, this reduces fetching inputs(~4GB) time from ~60s to ~20s.

@aspect-workflows
Copy link

aspect-workflows bot commented Jan 7, 2026

Test

16 test targets passed

Targets
//pkg/blobstore:blobstore_test [k8-fastbuild]                                         233ms
//pkg/builder:builder_test [k8-fastbuild]                                             256ms
//pkg/cas:cas_test [k8-fastbuild]                                                     411ms
//pkg/cleaner:cleaner_test [k8-fastbuild]                                             180ms
//pkg/clock:clock_test [k8-fastbuild]                                                 197ms
//pkg/filesystem/pool:pool_test [k8-fastbuild]                                        310ms
//pkg/filesystem/virtual/fuse:fuse_test [k8-fastbuild]                                199ms
//pkg/filesystem/virtual/nfsv4:nfsv4_test [k8-fastbuild]                              333ms
//pkg/filesystem/virtual:virtual_test [k8-fastbuild]                                  529ms
//pkg/filesystem:filesystem_test [k8-fastbuild]                                       585ms
//pkg/runner:runner_test [k8-fastbuild]                                               156ms
//pkg/scheduler/initialsizeclass:initialsizeclass_test [k8-fastbuild]                 535ms
//pkg/scheduler/platform:platform_test [k8-fastbuild]                                 370ms
//pkg/scheduler/routing:routing_test [k8-fastbuild]                                   121ms
//pkg/scheduler:scheduler_test [k8-fastbuild]                                         439ms
//pkg/sync:sync_test [k8-fastbuild]                                                   164ms

Total test execution time was 5s. 3 tests (15.8%) were fully cached saving 418ms.

select {
case <-d.wait:
if d.err != nil {
return d.err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. A caller of GetFile() may receive an error that was produced by another invocation. This should not happen, as errors may also include things like context cancelation, etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to retry if it failed because of context cancelation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: bb_storage/pkg/auth/remoteAuthorizer.authorizeSingle also retries but has the same kind of flaw. Consider the case where the timeout is 55s, downloading the file takes 50s, one call is coming in every 30s and that the first call is cancelled after 35s. Then all retries for the later requests will also fail because when the retry starts, the context has less than 50s left.

0s RPC 0
30s RPC 1
45s RPC 0 cancelled, RPC 1 retries
60s RPC 2
85s RPC 1 times out (5s left to download), RPC 2 retries
90s RPC 3
115 RPC 2 times out (15s left to download), RPC 3 retries
120s RPC 4
115 RPC 3 times out (15s left to download), RPC 4 retries
...

One solution is to manually cancel a context.Background() when there are no requests left waiting. I don't know how complex the implementation and tests for this kind of solution will be or if this is just not worth bothering with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just eliminate the error propagation logic entirely? Just change that downloads map to:

downloads map[string]<-chan struct{}

Only use that to wait an existing download to complete.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Updated to use the new map.

ff.downloads[key] = d
ff.downloadsLock.Unlock()

downloadErr := ff.base.GetFile(ctx, blobDigest, directory, name, isExecutable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that there is still a slight race here, where files get downloaded unnnecessarily/redundantly? Shouldn't we do another tryLinkFromCache() at this specific point?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Added.

@HongboDu-at HongboDu-at requested a review from EdSchouten January 7, 2026 23:13

type download struct {
wait chan struct{}
err error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err error
// err is available only when wait has been closed.
err error

}

type download struct {
wait chan struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It is possible to be more strict here with wait <-chan struct{} and further down use

wait := make(chan struct{})
d = &download{wait: wait}
close(wait)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion.

Comment on lines 123 to 125
d = &download{wait: make(chan struct{})}
ff.downloads[key] = d
ff.downloadsLock.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ff.downloadsLock does not necessarily need to be locked when setting d.err as it is only to be read after d.wait has been closed. This means that defer can be used.

Suggested change
d = &download{wait: make(chan struct{})}
ff.downloads[key] = d
ff.downloadsLock.Unlock()
wait := make(chan struct{})
d = &download{wait: wait}
ff.downloads[key] = d
ff.downloadsLock.Unlock()
defer func() {
ff.downloadsLock.Lock()
delete(ff.downloads, key)
ff.downloadsLock.Unlock()
close(wait)
}()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Updated to use defer. The code is cleaner.

Comment on lines 129 to 145
wasMissing, err = ff.tryLinkFromCache(key, directory, name)
if err == nil {
// File appeared in cache, no download needed.
ff.downloadsLock.Lock()
delete(ff.downloads, key)
ff.downloadsLock.Unlock()
close(d.wait)
return nil
} else if !wasMissing {
// tryLinkFromCache had a real error (not just missing).
ff.downloadsLock.Lock()
d.err = err
delete(ff.downloads, key)
ff.downloadsLock.Unlock()
close(d.wait)
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment on line 125.

Suggested change
wasMissing, err = ff.tryLinkFromCache(key, directory, name)
if err == nil {
// File appeared in cache, no download needed.
ff.downloadsLock.Lock()
delete(ff.downloads, key)
ff.downloadsLock.Unlock()
close(d.wait)
return nil
} else if !wasMissing {
// tryLinkFromCache had a real error (not just missing).
ff.downloadsLock.Lock()
d.err = err
delete(ff.downloads, key)
ff.downloadsLock.Unlock()
close(d.wait)
return err
}
wasMissing, d.err = ff.tryLinkFromCache(key, directory, name)
if !wasMissing {
// Either the file appeared in cache, no download needed,
// or tryLinkFromCache had a real error (not just missing).
return d.err
}

select {
case <-d.wait:
if d.err != nil {
return d.err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: bb_storage/pkg/auth/remoteAuthorizer.authorizeSingle also retries but has the same kind of flaw. Consider the case where the timeout is 55s, downloading the file takes 50s, one call is coming in every 30s and that the first call is cancelled after 35s. Then all retries for the later requests will also fail because when the retry starts, the context has less than 50s left.

0s RPC 0
30s RPC 1
45s RPC 0 cancelled, RPC 1 retries
60s RPC 2
85s RPC 1 times out (5s left to download), RPC 2 retries
90s RPC 3
115 RPC 2 times out (15s left to download), RPC 3 retries
120s RPC 4
115 RPC 3 times out (15s left to download), RPC 4 retries
...

One solution is to manually cancel a context.Background() when there are no requests left waiting. I don't know how complex the implementation and tests for this kind of solution will be or if this is just not worth bothering with.

Copy link
Author

@HongboDu-at HongboDu-at left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review.

}

type download struct {
wait chan struct{}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion.

select {
case <-d.wait:
if d.err != nil {
return d.err
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Updated to use the new map.

Comment on lines 123 to 125
d = &download{wait: make(chan struct{})}
ff.downloads[key] = d
ff.downloadsLock.Unlock()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Updated to use defer. The code is cleaner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants