|
| 1 | += Fault-tolerant execution |
| 2 | +:description: Configure fault-tolerant execution in Trino clusters for improved query resilience and automatic retry capabilities. |
| 3 | +:keywords: fault-tolerant execution, retry policy, exchange manager, spooling, query resilience |
| 4 | + |
| 5 | +Fault-tolerant execution is a mechanism in Trino that enables a cluster to mitigate query failures by retrying queries or their component tasks in the event of failure. |
| 6 | +With fault-tolerant execution enabled, intermediate exchange data is spooled and can be re-used by another worker in the event of a worker outage or other fault during query execution. |
| 7 | + |
| 8 | +By default, if a Trino node lacks the resources to execute a task or otherwise fails during query execution, the query fails and must be run again manually. |
| 9 | +The longer the runtime of a query, the more likely it is to be susceptible to such failures. |
| 10 | + |
| 11 | +NOTE: Fault tolerance does not apply to broken queries or other user error. |
| 12 | +For example, Trino does not spend resources retrying a query that fails because its SQL cannot be parsed. |
| 13 | + |
| 14 | +Take a look at the link:https://trino.io/docs/current/admin/fault-tolerant-execution.html[Trino documentation for fault-tolerant execution {external-link-icon}^] to learn more. |
| 15 | + |
| 16 | +== Configuration |
| 17 | + |
| 18 | +Fault-tolerant execution is turned off by default. |
| 19 | +To enable the feature, you need to configure it in your `TrinoCluster` resource by adding a `faultTolerantExecution` section to the cluster configuration: |
| 20 | + |
| 21 | +[source,yaml] |
| 22 | +---- |
| 23 | +spec: |
| 24 | + clusterConfig: |
| 25 | + faultTolerantExecution: |
| 26 | + retryPolicy: QUERY # <1> |
| 27 | + queryRetryAttempts: 3 # <2> |
| 28 | +---- |
| 29 | +<1> The retry policy - either `QUERY` or `TASK` |
| 30 | +<2> Maximum number of times to retry a query (QUERY policy only) |
| 31 | + |
| 32 | +== Retry policies |
| 33 | + |
| 34 | +The `retryPolicy` configuration property designates whether Trino retries entire queries or a query's individual tasks in the event of failure. |
| 35 | + |
| 36 | +=== QUERY retry policy |
| 37 | + |
| 38 | +A `QUERY` retry policy instructs Trino to automatically retry a query in the event of an error occurring on a worker node. |
| 39 | +A `QUERY` retry policy is recommended when the majority of the Trino cluster's workload consists of many small queries. |
| 40 | + |
| 41 | +By default, Trino does not implement fault tolerance for queries whose result set exceeds 32MB in size. |
| 42 | +This limit can be increased by modifying the `exchangeDeduplicationBufferSize` configuration property to be greater than the default value of `32MB`, but this results in higher memory usage on the coordinator. |
| 43 | + |
| 44 | +[source,yaml] |
| 45 | +---- |
| 46 | +... |
| 47 | +spec: |
| 48 | + clusterConfig: |
| 49 | + faultTolerantExecution: |
| 50 | + retryPolicy: QUERY |
| 51 | + queryRetryAttempts: 3 |
| 52 | + exchangeDeduplicationBufferSize: 64MB # Increased from default 32MB |
| 53 | +... |
| 54 | +---- |
| 55 | + |
| 56 | +=== TASK retry policy |
| 57 | + |
| 58 | +A `TASK` retry policy instructs Trino to retry individual query tasks in the event of failure. |
| 59 | +You **must** configure an exchange manager to use the task retry policy. |
| 60 | +This policy is recommended when executing large batch queries, as the cluster can more efficiently retry smaller tasks within the query rather than retry the whole query. |
| 61 | + |
| 62 | +IMPORTANT: A `TASK` retry policy is best suited for long-running queries, but this policy can result in higher latency for short-running queries executed in high volume. |
| 63 | +As a best practice, it is recommended to run a dedicated cluster with a `TASK` retry policy for large batch queries, separate from another cluster that handles short queries. |
| 64 | + |
| 65 | +[source,yaml] |
| 66 | +---- |
| 67 | +spec: |
| 68 | + clusterConfig: |
| 69 | + faultTolerantExecution: |
| 70 | + retryPolicy: TASK |
| 71 | + taskRetryAttemptsPerTask: 4 |
| 72 | + exchangeManager: |
| 73 | + s3: |
| 74 | + baseDirectories: |
| 75 | + - "s3://trino-exchange-bucket/spooling" |
| 76 | + connection: |
| 77 | + reference: my-s3-connection # <1> |
| 78 | +---- |
| 79 | +<1> Reference to an xref:concepts:s3.adoc[S3Connection] resource |
| 80 | + |
| 81 | +== Exchange manager |
| 82 | + |
| 83 | +Exchange spooling is responsible for storing and managing spooled data for fault-tolerant execution. |
| 84 | +You can configure a filesystem-based exchange manager that stores spooled data in a specified location, such as AWS S3 and S3-compatible systems, Azure Blob Storage or HDFS. |
| 85 | + |
| 86 | +NOTE: An exchange manager is required when using the `TASK` retry policy and optional for the `QUERY` retry policy. |
| 87 | + |
| 88 | +=== S3-compatible storage |
| 89 | + |
| 90 | +You can use S3-compatible storage systems for exchange spooling, including AWS S3, MinIO, and Google Cloud Storage. |
| 91 | + |
| 92 | +[source,yaml] |
| 93 | +---- |
| 94 | +spec: |
| 95 | + clusterConfig: |
| 96 | + faultTolerantExecution: |
| 97 | + retryPolicy: TASK |
| 98 | + exchangeManager: |
| 99 | + s3: |
| 100 | + baseDirectories: # <1> |
| 101 | + - "s3://exchange-bucket-1/trino-spooling" |
| 102 | + connection: |
| 103 | + reference: minio-s3-connection # <2> |
| 104 | +--- |
| 105 | +apiVersion: s3.stackable.tech/v1alpha1 |
| 106 | +kind: S3Connection |
| 107 | +metadata: |
| 108 | + name: minio-s3-connection |
| 109 | +spec: |
| 110 | + host: minio.default.svc.cluster.local |
| 111 | + port: 9000 |
| 112 | + accessStyle: Path |
| 113 | + credentials: |
| 114 | + secretClass: minio-secret-class |
| 115 | + tls: |
| 116 | + verification: |
| 117 | + server: |
| 118 | + caCert: |
| 119 | + secretClass: tls |
| 120 | +---- |
| 121 | +<1> Multiple S3 buckets can be specified to distribute I/O load |
| 122 | +<2> S3 connection defined as a reference to an xref:concepts:s3.adoc[S3Connection] resource |
| 123 | + |
| 124 | +For Google Cloud Storage, you can use GCS buckets with S3 compatibility: |
| 125 | + |
| 126 | +[source,yaml] |
| 127 | +---- |
| 128 | +spec: |
| 129 | + clusterConfig: |
| 130 | + faultTolerantExecution: |
| 131 | + exchangeManager: |
| 132 | + s3: |
| 133 | + baseDirectories: |
| 134 | + - "gs://my-gcs-bucket/trino-spooling" |
| 135 | + connection: |
| 136 | + inline: |
| 137 | + host: storage.googleapis.com |
| 138 | + port: 443 |
| 139 | + accessStyle: Path |
| 140 | + credentials: |
| 141 | + secretClass: gcs-hmac-credentials |
| 142 | + tls: |
| 143 | + verification: |
| 144 | + server: |
| 145 | + caCert: |
| 146 | + webPki: {} |
| 147 | + gcsServiceAccountKey: |
| 148 | + secretClass: "gcs-service-account-secret-class" |
| 149 | + key: "service-account.json" |
| 150 | +---- |
| 151 | + |
| 152 | +=== Azure Blob Storage |
| 153 | + |
| 154 | +You can configure Azure Blob Storage as the exchange spooling destination: |
| 155 | + |
| 156 | +[source,yaml] |
| 157 | +---- |
| 158 | +spec: |
| 159 | + clusterConfig: |
| 160 | + faultTolerantExecution: |
| 161 | + retryPolicy: TASK |
| 162 | + exchangeManager: |
| 163 | + azure: |
| 164 | + baseDirectories: |
| 165 | + - "abfs://[email protected]/exchange-spooling" |
| 166 | + secretClass: azure-credentials # <1> |
| 167 | + key: connectionString # <2> |
| 168 | +---- |
| 169 | +<1> SecretClass providing the Azure connection string |
| 170 | +<2> Key name in the Secret that contains the connection string (defaults to `connectionString`) |
| 171 | + |
| 172 | +The Azure connection string should be provided via a SecretClass that refers to a Kubernetes Secret containing the Azure storage account connection string, like this: |
| 173 | + |
| 174 | +[source,yaml] |
| 175 | +---- |
| 176 | +apiVersion: secrets.stackable.tech/v1alpha1 |
| 177 | +kind: SecretClass |
| 178 | +metadata: |
| 179 | + name: azure-credentials |
| 180 | +spec: |
| 181 | + backend: |
| 182 | + k8sSearch: |
| 183 | + searchNamespace: |
| 184 | + pod: {} |
| 185 | +---- |
| 186 | + |
| 187 | +[source,yaml] |
| 188 | +---- |
| 189 | +apiVersion: v1 |
| 190 | +kind: Secret |
| 191 | +metadata: |
| 192 | + name: azure-secret |
| 193 | + labels: |
| 194 | + secrets.stackable.tech/class: azure-credentials |
| 195 | +type: Opaque |
| 196 | +stringData: |
| 197 | + connectionString: "DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=your_account_key;EndpointSuffix=core.windows.net" |
| 198 | +---- |
| 199 | + |
| 200 | +=== HDFS storage |
| 201 | + |
| 202 | +You can configure HDFS as the exchange spooling destination: |
| 203 | + |
| 204 | +[source,yaml] |
| 205 | +---- |
| 206 | +spec: |
| 207 | + clusterConfig: |
| 208 | + faultTolerantExecution: |
| 209 | + retryPolicy: TASK |
| 210 | + exchangeManager: |
| 211 | + hdfs: |
| 212 | + baseDirectories: |
| 213 | + - "hdfs://simple-hdfs/exchange-spooling" |
| 214 | + hdfs: |
| 215 | + configMap: simple-hdfs # <1> |
| 216 | +---- |
| 217 | +<1> ConfigMap containing HDFS configuration files (created by the HDFS operator) |
| 218 | + |
| 219 | +=== Local filesystem storage |
| 220 | + |
| 221 | +Local filesystem storage is supported but only recommended for development or single-node deployments: |
| 222 | + |
| 223 | +WARNING: It is only recommended to use a local filesystem for exchange in standalone, non-production clusters. |
| 224 | +A local directory can only be used for exchange in a distributed cluster if the exchange directory is shared and accessible from all nodes. |
| 225 | + |
| 226 | +[source,yaml] |
| 227 | +---- |
| 228 | +spec: |
| 229 | + clusterConfig: |
| 230 | + faultTolerantExecution: |
| 231 | + retryPolicy: TASK |
| 232 | + exchangeManager: |
| 233 | + local: |
| 234 | + baseDirectories: |
| 235 | + - "/trino-exchange" |
| 236 | + coordinators: |
| 237 | + roleGroups: |
| 238 | + default: |
| 239 | + replicas: 1 |
| 240 | + podOverrides: |
| 241 | + spec: |
| 242 | + volumes: |
| 243 | + - name: trino-exchange |
| 244 | + persistentVolumeClaim: |
| 245 | + claimName: trino-exchange-pvc |
| 246 | + containers: |
| 247 | + - name: trino |
| 248 | + volumeMounts: |
| 249 | + - name: trino-exchange |
| 250 | + mountPath: /trino-exchange |
| 251 | + workers: |
| 252 | + roleGroups: |
| 253 | + default: |
| 254 | + replicas: 1 |
| 255 | + podOverrides: |
| 256 | + spec: |
| 257 | + volumes: |
| 258 | + - name: trino-exchange |
| 259 | + persistentVolumeClaim: |
| 260 | + claimName: trino-exchange-pvc |
| 261 | + containers: |
| 262 | + - name: trino |
| 263 | + volumeMounts: |
| 264 | + - name: trino-exchange |
| 265 | + mountPath: /trino-exchange |
| 266 | +--- |
| 267 | +kind: PersistentVolumeClaim |
| 268 | +apiVersion: v1 |
| 269 | +metadata: |
| 270 | + name: trino-exchange-pvc |
| 271 | +spec: |
| 272 | + accessModes: |
| 273 | + - ReadWriteOnce |
| 274 | + resources: |
| 275 | + requests: |
| 276 | + storage: 10Gi |
| 277 | +---- |
| 278 | + |
| 279 | +== Connector support |
| 280 | + |
| 281 | +Support for fault-tolerant execution of SQL statements varies on a per-connector basis. |
| 282 | +Take a look at the link:https://trino.io/docs/current/admin/fault-tolerant-execution.html#configuration[Trino documentation {external-link-icon}^] to see which connectors support fault-tolerant execution. |
| 283 | + |
| 284 | +When using connectors that do not explicitly support fault-tolerant execution, you may encounter a "This connector does not support query retries" error message. |
| 285 | + |
| 286 | +== Examples |
| 287 | + |
| 288 | +* link:https://github.com/stackabletech/trino-operator/blob/main/examples/simple-trino-cluster-fault-tolerant-execution.yaml[TrinoCluster with TASK retry policy and S3 exchange manager {external-link-icon}^] |
0 commit comments