Skip to content

Conversation

@abderrahim
Copy link
Contributor

This is a proposal towards #2020. It's probably not the most efficient way to do it, but at least it works.

What this does is:

  • ignore failures found in the action cache
  • ask the remote execution service to not look in the cache

stub = self.exec_remote.exec_service
request = remote_execution_pb2.ExecuteRequest(
instance_name=self.exec_remote.instance_name, action_digest=action_digest, skip_cache_lookup=False
instance_name=self.exec_remote.instance_name, action_digest=action_digest, skip_cache_lookup=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to always skip cache lookup. If no action-service-cache is declared, internal action cache lookup by the remote execution server should still be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, even when the action-cache-service is defined, there is still value in having the execution re-do cache lookup (in case the action is requested and built by someone else before our action reaches the top of the queue).

You're right that this is probably working around a "broken" remote execution server. I've only tested it with buildbox-casd, and wasn't aware it was different with other servers.

@juergbi
Copy link
Contributor

juergbi commented Jul 11, 2025

I think the main issue is that failed actions are cached in the action cache in the first place. While there may be circumstances where caching failed actions is useful, I don't think we need this for BuildStream at all (given that we have a higher level caching mechanism where we already support caching failures) and most remote execution servers default to not caching failed actions, mainly because failures may be spurious, e.g., due to a worker running out of RAM.

BuildGrid caches failures by default but it can be disabled with cache-failed-actions: false. buildbox-casd currently unconditionally caches failures but I think we should change that or at least add an option to disable it.

On the BuildStream side, might it be sufficient to skip action cache lookup (direct action cache query as well as indirectly via Execute()) if context.build_retry_failed is set?

@abderrahim
Copy link
Contributor Author

I think the main issue is that failed actions are cached in the action cache in the first place. While there may be circumstances where caching failed actions is useful, I don't think we need this for BuildStream at all (given that we have a higher level caching mechanism where we already support caching failures) and most remote execution servers default to not caching failed actions, mainly because failures may be spurious, e.g., due to a worker running out of RAM.

BuildGrid caches failures by default but it can be disabled with cache-failed-actions: false. buildbox-casd currently unconditionally caches failures but I think we should change that or at least add an option to disable it.

Yeah, I was using this with buildbox-casd as a server. The failures in question were due to a bug in buildbox-fuse.

On the BuildStream side, might it be sufficient to skip action cache lookup (direct action cache query as well as indirectly via Execute()) if context.build_retry_failed is set?

The thing is context.build_retry_failed is only set when passing the option on the command line or the config file. It doesn't work when you choose retry on the prompt.

@juergbi
Copy link
Contributor

juergbi commented Jul 11, 2025

The thing is context.build_retry_failed is only set when passing the option on the command line or the config file. It doesn't work when you choose retry on the prompt.

Good point but maybe we can find a way to forward this information also in the interactive retry case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants