Skip to content

Commit 4a661d3

Browse files
Disable per-RPC timeout by default to prevent large upload retry loops
The GrpcStore rpc_timeout_s defaulted to 120 seconds, which is too short for multi-GB uploads. This caused DeadlineExceeded errors that triggered retries, restarting the upload and compounding the problem. Dead connections are already detected by HTTP/2 keepalive (30s ping, 20s timeout) and TCP keepalive (30s) on each endpoint, so the per-RPC total timeout is unnecessary for that purpose. Setting rpc_timeout_s=0 now correctly disables the timeout instead of silently falling through to the 120s default. Fixes #2185
1 parent 3ff25a7 commit 4a661d3

File tree

2 files changed

+10
-8
lines changed

2 files changed

+10
-8
lines changed

nativelink-config/src/stores.rs

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1126,10 +1126,16 @@ pub struct GrpcSpec {
11261126
pub connections_per_endpoint: usize,
11271127

11281128
/// Maximum time (seconds) allowed for a single RPC request (e.g. a
1129-
/// ByteStream.Write call) before it is cancelled. This prevents
1130-
/// individual RPCs from hanging forever on dead connections.
1129+
/// ByteStream.Write call) before it is cancelled.
11311130
///
1132-
/// Default: 120 (seconds)
1131+
/// A value of 0 (the default) disables the per-RPC timeout. Dead
1132+
/// connections are still detected by the HTTP/2 and TCP keepalive
1133+
/// mechanisms configured on each endpoint.
1134+
///
1135+
/// For large uploads (multi-GB), either leave this at 0 or set it
1136+
/// large enough to accommodate the full transfer time.
1137+
///
1138+
/// Default: 0 (disabled)
11331139
#[serde(default, deserialize_with = "convert_duration_with_shellexpand")]
11341140
pub rpc_timeout_s: u64,
11351141
}

nativelink-store/src/grpc_store.rs

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -90,11 +90,7 @@ impl GrpcStore {
9090
endpoints.push(endpoint);
9191
}
9292

93-
let rpc_timeout = if spec.rpc_timeout_s > 0 {
94-
Duration::from_secs(spec.rpc_timeout_s)
95-
} else {
96-
Duration::from_secs(120)
97-
};
93+
let rpc_timeout = Duration::from_secs(spec.rpc_timeout_s);
9894

9995
Ok(Arc::new(Self {
10096
instance_name: spec.instance_name.clone(),

0 commit comments

Comments
 (0)