_sandboxremote.py: avoid reusing failed actions #2033

abderrahim · 2025-07-08T08:45:26Z

This is a proposal towards #2020. It's probably not the most efficient way to do it, but at least it works.

What this does is:

ignore failures found in the action cache
ask the remote execution service to not look in the cache

juergbi · 2025-07-11T15:11:53Z

src/buildstream/sandbox/_sandboxremote.py

        stub = self.exec_remote.exec_service
        request = remote_execution_pb2.ExecuteRequest(
-            instance_name=self.exec_remote.instance_name, action_digest=action_digest, skip_cache_lookup=False
+            instance_name=self.exec_remote.instance_name, action_digest=action_digest, skip_cache_lookup=True


We don't want to always skip cache lookup. If no action-service-cache is declared, internal action cache lookup by the remote execution server should still be used.

Yeah, even when the action-cache-service is defined, there is still value in having the execution re-do cache lookup (in case the action is requested and built by someone else before our action reaches the top of the queue).

You're right that this is probably working around a "broken" remote execution server. I've only tested it with buildbox-casd, and wasn't aware it was different with other servers.

juergbi · 2025-07-11T15:14:17Z

I think the main issue is that failed actions are cached in the action cache in the first place. While there may be circumstances where caching failed actions is useful, I don't think we need this for BuildStream at all (given that we have a higher level caching mechanism where we already support caching failures) and most remote execution servers default to not caching failed actions, mainly because failures may be spurious, e.g., due to a worker running out of RAM.

BuildGrid caches failures by default but it can be disabled with cache-failed-actions: false. buildbox-casd currently unconditionally caches failures but I think we should change that or at least add an option to disable it.

On the BuildStream side, might it be sufficient to skip action cache lookup (direct action cache query as well as indirectly via Execute()) if context.build_retry_failed is set?

abderrahim · 2025-07-11T15:25:36Z

I think the main issue is that failed actions are cached in the action cache in the first place. While there may be circumstances where caching failed actions is useful, I don't think we need this for BuildStream at all (given that we have a higher level caching mechanism where we already support caching failures) and most remote execution servers default to not caching failed actions, mainly because failures may be spurious, e.g., due to a worker running out of RAM.

BuildGrid caches failures by default but it can be disabled with cache-failed-actions: false. buildbox-casd currently unconditionally caches failures but I think we should change that or at least add an option to disable it.

Yeah, I was using this with buildbox-casd as a server. The failures in question were due to a bug in buildbox-fuse.

On the BuildStream side, might it be sufficient to skip action cache lookup (direct action cache query as well as indirectly via Execute()) if context.build_retry_failed is set?

The thing is context.build_retry_failed is only set when passing the option on the command line or the config file. It doesn't work when you choose retry on the prompt.

juergbi · 2025-07-11T15:28:58Z

The thing is context.build_retry_failed is only set when passing the option on the command line or the config file. It doesn't work when you choose retry on the prompt.

Good point but maybe we can find a way to forward this information also in the interactive retry case.

_sandboxremote.py: avoid reusing failed actions

9e0126c

juergbi reviewed Jul 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

_sandboxremote.py: avoid reusing failed actions #2033

_sandboxremote.py: avoid reusing failed actions #2033

Uh oh!

abderrahim commented Jul 8, 2025

Uh oh!

juergbi Jul 11, 2025

Uh oh!

abderrahim Jul 11, 2025

Uh oh!

juergbi commented Jul 11, 2025

Uh oh!

abderrahim commented Jul 11, 2025

Uh oh!

juergbi commented Jul 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

_sandboxremote.py: avoid reusing failed actions #2033

Are you sure you want to change the base?

_sandboxremote.py: avoid reusing failed actions #2033

Uh oh!

Conversation

abderrahim commented Jul 8, 2025

Uh oh!

juergbi Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

abderrahim Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

juergbi commented Jul 11, 2025

Uh oh!

abderrahim commented Jul 11, 2025

Uh oh!

juergbi commented Jul 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants