Skip to content

Commit e29c97d

Browse files
armrugbartolini
andauthored
feat: set default tcp_user_timeout to 5 seconds for replicas (cloudnative-pg#9317)
The default `tcp_user_timeout` for standby replication connections has been changed from the system default to `5000ms` (5 seconds) for all replicas. This new default enhances the robustness of CloudNativePG clusters by enabling standby instances to detect and recover from network issues more quickly. Previously, silent network drops could cause standbys to wait up to ~127 seconds (due to TCP SYN retries) before detecting a failure. With the new 5-second timeout, standbys will close unresponsive connections sooner and promptly retry connecting to the primary. If this default does not meet your requirements, you can override it for all standbys managed by the operator using the `STANDBY_TCP_USER_TIMEOUT` configuration option. PRESERVATION GUIDE FOR EXISTING INSTALLATIONS: If you have an existing CloudNativePG installation where `STANDBY_TCP_USER_TIMEOUT` was not explicitly set (thus defaulting to `0`), and you wish to preserve that behaviour after upgrading, you must now explicitly set it to `0`. Example using a `ConfigMap`: ```yaml apiVersion: v1 kind: ConfigMap metadata: name: cnpg-controller-manager-config namespace: cnpg-system data: STANDY_TCP_USER_TIMEOUT: "0" ``` If the variable is not explicitly configured, the new default of 5 seconds will automatically apply after the next operator upgrade or pod restart. For more information on `tcp_user_timeout`, see the PostgreSQL documentation: https://www.postgresql.org/docs/current/runtime-config-connection.html#GUC-TCP-USER-TIMEOUT Closes cloudnative-pg#9229 Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Signed-off-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com> Co-authored-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>
1 parent c0de7be commit e29c97d

File tree

9 files changed

+184
-20
lines changed

9 files changed

+184
-20
lines changed

docs/src/operator_conf.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ Name | Description
5959
`PGBOUNCER_IMAGE_NAME` | The name of the PgBouncer image used by default for new poolers. Defaults to the version specified in the operator.
6060
`POSTGRES_IMAGE_NAME` | The name of the PostgreSQL image used by default for new clusters. Defaults to the version specified in the operator.
6161
`PULL_SECRET_NAME` | Name of an additional pull secret to be defined in the operator's namespace and to be used to download images
62-
`STANDBY_TCP_USER_TIMEOUT` | Defines the [`TCP_USER_TIMEOUT` socket option](https://www.postgresql.org/docs/current/runtime-config-connection.html#GUC-TCP-USER-TIMEOUT) for replication connections from standby instances to the primary. Default is 0 (system's default).
62+
`STANDBY_TCP_USER_TIMEOUT` | Defines the [`TCP_USER_TIMEOUT` socket option](https://www.postgresql.org/docs/current/runtime-config-connection.html#GUC-TCP-USER-TIMEOUT) in milliseconds for replication connections from standby instances to the primary. Default is 5000 (5 seconds). Set to `0` to use the system's default.
6363
`DRAIN_TAINTS` | Specifies the taint keys that should be interpreted as indicators of node drain. By default, it includes the taints commonly applied by [kubectl](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/), [Cluster Autoscaler](https://github.com/kubernetes/autoscaler), and [Karpenter](https://github.com/aws/karpenter-provider-aws): `node.kubernetes.io/unschedulable`, `ToBeDeletedByClusterAutoscaler`, `karpenter.sh/disrupted`, `karpenter.sh/disruption`.
6464

6565
Values in `INHERITED_ANNOTATIONS` and `INHERITED_LABELS` support path-like wildcards. For example, the value `example.com/*` will match

docs/src/postgresql_conf.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -168,19 +168,20 @@ role within the cluster. These parameters are effectively applied only when the
168168
instance is operating as a replica.
169169

170170
```text
171-
primary_conninfo = 'host=<PRIMARY> user=postgres dbname=postgres'
171+
primary_conninfo = 'host=<PRIMARY> user=postgres dbname=postgres tcp_user_timeout=5000'
172172
recovery_target_timeline = 'latest'
173173
```
174174

175-
The [`STANDBY_TCP_USER_TIMEOUT` operator configuration setting](operator_conf.md#available-options),
176-
if specified, sets the `tcp_user_timeout` parameter on all standby instances
177-
managed by the operator.
178-
179-
The `tcp_user_timeout` parameter determines how long transmitted data can
180-
remain unacknowledged before the TCP connection is forcibly closed. Adjusting
181-
this value allows you to fine-tune the responsiveness of standby instances to
182-
network disruptions. For more details, refer to the
183-
[PostgreSQL documentation](https://www.postgresql.org/docs/current/runtime-config-connection.html#GUC-TCP-USER-TIMEOUT).
175+
!!! Important
176+
By default, every standby sets `tcp_user_timeout` to **5 seconds**, as shown
177+
above. This parameter defines how long transmitted data may remain
178+
unacknowledged before the TCP connection is forcibly closed. Adjusting it lets
179+
you control how quickly a standby reacts to network issues.
180+
If the default value does not meet your requirements, you can override it
181+
for all standbys managed by the operator using the
182+
[`STANDBY_TCP_USER_TIMEOUT` operator configuration option](operator_conf.md#available-options).
183+
For additional details on `tcp_user_timeout`, refer to the
184+
[PostgreSQL documentation](https://www.postgresql.org/docs/current/runtime-config-connection.html#GUC-TCP-USER-TIMEOUT).
184185

185186
### Log control settings
186187

internal/configuration/configuration.go

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,9 @@ type Data struct {
165165
// added as a tcp_user_timeout option to the primary_conninfo
166166
// string, which is used by the standby server to connect to the
167167
// primary server in CloudNativePG.
168-
StandbyTCPUserTimeout int `json:"standbyTcpUserTimeout" env:"STANDBY_TCP_USER_TIMEOUT"`
168+
// When nil, the instance manager will use a default value of 5000ms.
169+
// Set to 0 explicitly to use the system's default.
170+
StandbyTCPUserTimeout *int `json:"standbyTcpUserTimeout" env:"STANDBY_TCP_USER_TIMEOUT"`
169171

170172
// KubernetesClusterDomain defines the domain suffix for service FQDNs
171173
// within the Kubernetes cluster. If left unset, it defaults to `cluster.local`.
@@ -189,7 +191,7 @@ func newDefaultConfig() *Data {
189191
CreateAnyService: false,
190192
CertificateDuration: CertificateDuration,
191193
ExpiringCheckThreshold: ExpiringCheckThreshold,
192-
StandbyTCPUserTimeout: 0,
194+
StandbyTCPUserTimeout: nil,
193195
KubernetesClusterDomain: DefaultKubernetesClusterDomain,
194196
DrainTaints: DefaultDrainTaints,
195197
}

pkg/configparser/configparser.go

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ var configparserLog = log.WithName("configparser")
6464

6565
// ReadConfigMap reads the configuration from the environment and the passed in data map.
6666
// Config and defaults are supposed to be pointers to structs of the same type
67-
func ReadConfigMap(target interface{}, defaults interface{}, data map[string]string) {
67+
func ReadConfigMap(target interface{}, defaults interface{}, data map[string]string) { //nolint: gocognit
6868
ensurePointerToCompatibleStruct("target", target, "default", defaults)
6969

7070
count := reflect.TypeOf(defaults).Elem().NumField()
@@ -89,6 +89,21 @@ func ReadConfigMap(target interface{}, defaults interface{}, data map[string]str
8989
case reflect.Int:
9090
value = fmt.Sprintf("%v", valueField.Int())
9191

92+
case reflect.Ptr:
93+
// Handle pointer types - if not nil, get the underlying value
94+
if !valueField.IsNil() {
95+
switch valueField.Elem().Kind() {
96+
case reflect.Int:
97+
value = fmt.Sprintf("%v", valueField.Elem().Int())
98+
default:
99+
configparserLog.Info(
100+
"Skipping unsupported pointer type while parsing default configuration",
101+
"field", field.Name, "kind", valueField.Elem().Kind())
102+
continue
103+
}
104+
}
105+
// If nil, value stays empty, which means we'll only set it if env/data provides a value
106+
92107
case reflect.Slice:
93108
if valueField.Type().Elem().Kind() != reflect.String {
94109
configparserLog.Info(
@@ -129,6 +144,29 @@ func ReadConfigMap(target interface{}, defaults interface{}, data map[string]str
129144
continue
130145
}
131146
reflect.ValueOf(target).Elem().FieldByName(field.Name).SetInt(intValue)
147+
case reflect.Ptr:
148+
// Handle pointer types
149+
if value == "" {
150+
// If no value is provided, leave the pointer as nil
151+
continue
152+
}
153+
switch t.Elem().Kind() {
154+
case reflect.Int:
155+
intValue, err := strconv.ParseInt(value, 10, 0)
156+
if err != nil {
157+
configparserLog.Info(
158+
"Skipping invalid integer pointer value parsing configuration",
159+
"field", field.Name, "value", value)
160+
continue
161+
}
162+
intVal := int(intValue)
163+
reflect.ValueOf(target).Elem().FieldByName(field.Name).Set(reflect.ValueOf(&intVal))
164+
default:
165+
configparserLog.Info(
166+
"Skipping unsupported pointer type while parsing configuration",
167+
"field", field.Name, "kind", t.Elem().Kind())
168+
continue
169+
}
132170
case reflect.String:
133171
reflect.ValueOf(target).Elem().FieldByName(field.Name).SetString(value)
134172
case reflect.Slice:

pkg/configparser/configparser_test.go

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,9 @@ type FakeData struct {
4444

4545
// Threshold to consider a certificate as expiring
4646
ExpiringCheckThreshold int `json:"expiringCheckThreshold" env:"EXPIRING_CHECK_THRESHOLD"`
47+
48+
// OptionalTimeout is an optional pointer to int for testing
49+
OptionalTimeout *int `json:"optionalTimeout" env:"OPTIONAL_TIMEOUT"`
4750
}
4851

4952
var defaultInheritedAnnotations = []string{
@@ -117,4 +120,55 @@ var _ = Describe("Data test suite", func() {
117120
Expect(config.InheritedAnnotations).To(Equal(defaultInheritedAnnotations))
118121
Expect(config.InheritedLabels).To(BeNil())
119122
})
123+
124+
Context("pointer types", func() {
125+
It("leaves pointer nil when no value is provided", func() {
126+
GinkgoT().Setenv("OPTIONAL_TIMEOUT", "")
127+
config := &FakeData{}
128+
ReadConfigMap(config, &FakeData{}, nil)
129+
Expect(config.OptionalTimeout).To(BeNil())
130+
})
131+
132+
It("sets pointer from environment value", func() {
133+
GinkgoT().Setenv("OPTIONAL_TIMEOUT", "5000")
134+
config := &FakeData{}
135+
ReadConfigMap(config, &FakeData{}, nil)
136+
Expect(config.OptionalTimeout).ToNot(BeNil())
137+
Expect(*config.OptionalTimeout).To(Equal(5000))
138+
})
139+
140+
It("sets pointer from map value", func() {
141+
GinkgoT().Setenv("OPTIONAL_TIMEOUT", "")
142+
config := &FakeData{}
143+
ReadConfigMap(config, &FakeData{}, map[string]string{
144+
"OPTIONAL_TIMEOUT": "3000",
145+
})
146+
Expect(config.OptionalTimeout).ToNot(BeNil())
147+
Expect(*config.OptionalTimeout).To(Equal(3000))
148+
})
149+
150+
It("allows setting pointer to zero value explicitly", func() {
151+
GinkgoT().Setenv("OPTIONAL_TIMEOUT", "0")
152+
config := &FakeData{}
153+
ReadConfigMap(config, &FakeData{}, nil)
154+
Expect(config.OptionalTimeout).ToNot(BeNil())
155+
Expect(*config.OptionalTimeout).To(Equal(0))
156+
})
157+
158+
It("uses default pointer value when set", func() {
159+
GinkgoT().Setenv("OPTIONAL_TIMEOUT", "")
160+
config := &FakeData{}
161+
defaultValue := 1000
162+
ReadConfigMap(config, &FakeData{OptionalTimeout: &defaultValue}, nil)
163+
Expect(config.OptionalTimeout).ToNot(BeNil())
164+
Expect(*config.OptionalTimeout).To(Equal(1000))
165+
})
166+
167+
It("skips invalid pointer value", func() {
168+
GinkgoT().Setenv("OPTIONAL_TIMEOUT", "invalid")
169+
config := &FakeData{}
170+
ReadConfigMap(config, &FakeData{}, nil)
171+
Expect(config.OptionalTimeout).To(BeNil())
172+
})
173+
})
120174
})

pkg/management/postgres/instance.go

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1400,11 +1400,14 @@ func (instance *Instance) GetPrimaryConnInfo() string {
14001400
result := buildPrimaryConnInfo(instance.GetClusterName()+"-rw", instance.GetPodName()) + " dbname=postgres"
14011401

14021402
standbyTCPUserTimeout := os.Getenv("CNPG_STANDBY_TCP_USER_TIMEOUT")
1403-
if len(standbyTCPUserTimeout) > 0 {
1404-
result = fmt.Sprintf("%s tcp_user_timeout='%s'", result,
1405-
strings.ReplaceAll(strings.ReplaceAll(standbyTCPUserTimeout, `\`, `\\`), `'`, `\'`))
1403+
if len(standbyTCPUserTimeout) == 0 {
1404+
// Default to 5000ms (5 seconds) if not explicitly set
1405+
standbyTCPUserTimeout = "5000"
14061406
}
14071407

1408+
result = fmt.Sprintf("%s tcp_user_timeout='%s'", result,
1409+
strings.ReplaceAll(strings.ReplaceAll(standbyTCPUserTimeout, `\`, `\\`), `'`, `\'`))
1410+
14081411
return result
14091412
}
14101413

pkg/management/postgres/instance_test.go

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -361,3 +361,58 @@ func getLibraryPathFromEnv(envs []string) string {
361361

362362
return ldLibraryPath
363363
}
364+
365+
var _ = Describe("GetPrimaryConnInfo", func() {
366+
var instance *Instance
367+
368+
BeforeEach(func() {
369+
instance = &Instance{
370+
Cluster: &apiv1.Cluster{
371+
ObjectMeta: metav1.ObjectMeta{
372+
Name: "test-cluster",
373+
},
374+
},
375+
}
376+
instance.WithPodName("test-cluster-1").WithClusterName("test-cluster")
377+
})
378+
379+
AfterEach(func() {
380+
err := os.Unsetenv("CNPG_STANDBY_TCP_USER_TIMEOUT")
381+
Expect(err).ToNot(HaveOccurred())
382+
})
383+
384+
It("should use default 5000ms tcp_user_timeout when env var is not set", func() {
385+
err := os.Unsetenv("CNPG_STANDBY_TCP_USER_TIMEOUT")
386+
Expect(err).ToNot(HaveOccurred())
387+
connInfo := instance.GetPrimaryConnInfo()
388+
Expect(connInfo).To(ContainSubstring("tcp_user_timeout='5000'"))
389+
})
390+
391+
It("should use custom tcp_user_timeout when env var is set", func() {
392+
err := os.Setenv("CNPG_STANDBY_TCP_USER_TIMEOUT", "10000")
393+
Expect(err).ToNot(HaveOccurred())
394+
connInfo := instance.GetPrimaryConnInfo()
395+
Expect(connInfo).To(ContainSubstring("tcp_user_timeout='10000'"))
396+
})
397+
398+
It("should allow setting tcp_user_timeout to 0 explicitly", func() {
399+
err := os.Setenv("CNPG_STANDBY_TCP_USER_TIMEOUT", "0")
400+
Expect(err).ToNot(HaveOccurred())
401+
connInfo := instance.GetPrimaryConnInfo()
402+
Expect(connInfo).To(ContainSubstring("tcp_user_timeout='0'"))
403+
})
404+
405+
It("should escape single quotes in tcp_user_timeout value", func() {
406+
err := os.Setenv("CNPG_STANDBY_TCP_USER_TIMEOUT", "5000'injection")
407+
Expect(err).ToNot(HaveOccurred())
408+
connInfo := instance.GetPrimaryConnInfo()
409+
Expect(connInfo).To(ContainSubstring("tcp_user_timeout='5000\\'injection'"))
410+
})
411+
412+
It("should escape backslashes in tcp_user_timeout value", func() {
413+
err := os.Setenv("CNPG_STANDBY_TCP_USER_TIMEOUT", "5000\\test")
414+
Expect(err).ToNot(HaveOccurred())
415+
connInfo := instance.GetPrimaryConnInfo()
416+
Expect(connInfo).To(ContainSubstring("tcp_user_timeout='5000\\\\test'"))
417+
})
418+
})

pkg/management/postgres/restore.go

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -964,7 +964,18 @@ func (info InitInfo) ConfigureInstanceAfterRestore(ctx context.Context, cluster
964964

965965
// GetPrimaryConnInfo returns the DSN to reach the primary
966966
func (info InitInfo) GetPrimaryConnInfo() string {
967-
return buildPrimaryConnInfo(info.ClusterName+"-rw", info.PodName)
967+
result := buildPrimaryConnInfo(info.ClusterName+"-rw", info.PodName) + " dbname=postgres"
968+
969+
standbyTCPUserTimeout := os.Getenv("CNPG_STANDBY_TCP_USER_TIMEOUT")
970+
if len(standbyTCPUserTimeout) == 0 {
971+
// Default to 5000ms (5 seconds) if not explicitly set
972+
standbyTCPUserTimeout = "5000"
973+
}
974+
975+
result = fmt.Sprintf("%s tcp_user_timeout='%s'", result,
976+
strings.ReplaceAll(strings.ReplaceAll(standbyTCPUserTimeout, `\`, `\\`), `'`, `\'`))
977+
978+
return result
968979
}
969980

970981
func (info *InitInfo) checkBackupDestination(

pkg/specs/pods.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -162,12 +162,12 @@ func CreatePodEnvConfig(cluster apiv1.Cluster, podName string) EnvConfig {
162162
}
163163
config.EnvVars = append(config.EnvVars, cluster.Spec.Env...)
164164

165-
if configuration.Current.StandbyTCPUserTimeout != 0 {
165+
if configuration.Current.StandbyTCPUserTimeout != nil {
166166
config.EnvVars = append(
167167
config.EnvVars,
168168
corev1.EnvVar{
169169
Name: "CNPG_STANDBY_TCP_USER_TIMEOUT",
170-
Value: strconv.Itoa(configuration.Current.StandbyTCPUserTimeout),
170+
Value: strconv.Itoa(*configuration.Current.StandbyTCPUserTimeout),
171171
},
172172
)
173173
}

0 commit comments

Comments
 (0)