Fix OSS RE client: FindMissingCache staleness, TTL, and silent upload skip#1273
Fix OSS RE client: FindMissingCache staleness, TTL, and silent upload skip#1273mattparks wants to merge 1 commit intofacebook:mainfrom
Conversation
|
Hi @mattparks! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this in D97149108. (Because this pull request was imported automatically, there will not be any future comments.) |
… skip Four bugs in the OSS RE gRPC client cause CasArtifactNotFound errors when concurrent actions share input digests. Bug 1: convert_action_result() hardcodes ttl: 0, causing the deferred materializer to treat every output as immediately expired. Fix: use the configured cas_ttl_secs. Bug 2: get_digests_ttl() caches ALL checked digests as ExistsOnRemote BEFORE processing the FindMissing response. A concurrent action that hits the cache between caching and response processing sees a premature ExistsOnRemote for a digest that is actually missing, skips upload, and its Execute fails. Fix: build the missing set first, then only cache digests confirmed to exist. Bug 3: get_digests_ttl() also caches DigestRemoteState::Missing. A "Missing" entry is transient — the blob may appear at any time. Caching it prevents future FindMissingBlobs RPCs from being sent. Fix: do not cache Missing entries. Bug 4: FindMissingCache global TTL is 12 hours, but cas_ttl_secs defaults to 3 hours (since PR facebook#1248). Cache entries outlive the blobs they describe. Fix: set cache TTL to cas_ttl_secs. Bug 5: Uploader silently skips RequiresCasDownload files reported missing by FindMissingBlobs. This leaves downstream actions with incomplete inputs. Fix: materialize locally and re-upload instead of skipping.
Three related bugs in the OSS RE gRPC client cause
remote_upload_errorwhen deferred-materialized outputs from one action are used as inputs to another action.Bug 1: Hardcoded TTL of 0 in OSS shim
convert_action_result()hardcodesttl: 0for all output files andaction_result_ttl: 0for execute responses. The standard REv2 protocol has no TTL field — TTL is a buck2 extension. With ttl=0, the deferred materializer treats every blob as immediately expired, soget_digests_ttl()always queries FindMissingBlobs instead of trusting that recently-produced outputs still exist.Fix: use
cas_ttl_secsfrom runtime options (configured viacas_ttl_secsbuckconfig key, default 3 hours) instead of hardcoding 0.Bug 2: FindMissingCache caches "Missing" state
get_digests_ttl()cachesDigestRemoteState::Missingin the FindMissingCache LRU (500K entries, 12-hour TTL). "Missing" is a transient state — the blob may be uploaded or produced by an RE action at any time. A stale "Missing" entry causesget_digests_ttl()to return ttl=0 without sending a FindMissingBlobs RPC, even though the blob now exists in CAS. Since deferred materialization never downloaded the blob, buck2 cannot upload it and fails withremote_upload_error.Fix: stop caching
DigestRemoteState::Missing. Only cacheExistsOnRemote. Additionally, mark top-level output file and directory digests asExistsOnRemotewhen processing action results from Execute, GetActionResult, and WriteActionResult responses.Bug 3: Silent skip of missing CAS artifacts during upload
When the uploader encounters a
RequiresCasDownloadfile that FindMissingBlobs reports as missing, it logs asoft_error!("cas_missing")and silently skips the file (continue). The downstream action then fails with incomplete inputs.Fix: instead of silently skipping, materialize the file locally (which downloads from CAS, bypassing FindMissingCache) and add it to the upload list. If CAS truly doesn't have the blob,
ensure_materializedfails with a clear error instead of a mysterious downstream action failure.