|
| 1 | +import { Steps } from "nextra/components"; |
| 2 | + |
| 3 | +# Improving Resilience |
| 4 | + |
| 5 | +The first step we recommend is making sure that you have [observability](/spicedb/ops/observability) in place. |
| 6 | +Once you've done that, this page will help you improve the resilience of your SpiceDB deployment. |
| 7 | + |
| 8 | +## Retries |
| 9 | + |
| 10 | +When making requests to SpiceDB, it's important to implement proper retry logic to handle transient failures. |
| 11 | +The [SpiceDB Client Libraries](/spicedb/getting-started/client-libraries) use gRPC[^1], |
| 12 | +which can experience various types of temporary failures that can be resolved through retries. |
| 13 | + |
| 14 | +Retries are recommended for all gRPC methods. |
| 15 | + |
| 16 | +[^1]: SpiceDB can also expose an [HTTP API](/spicedb/getting-started/client-libraries#http-clients); however, gRPC is recommended. |
| 17 | + |
| 18 | +### Implementing Retry Policies |
| 19 | + |
| 20 | +You can implement your own retry policies using the gRPC Service Config. |
| 21 | +Below, you will find a recommended Retry Policy. |
| 22 | + |
| 23 | +``` |
| 24 | +"retryPolicy": { |
| 25 | + "maxAttempts": 3, |
| 26 | + "initialBackoff": "1s", |
| 27 | + "maxBackoff": "4s", |
| 28 | + "backoffMultiplier": 2, |
| 29 | + "retryableStatusCodes": [ |
| 30 | + 'UNAVAILABLE', 'RESOURCE_EXHAUSTED', 'DEADLINE_EXCEEDED', 'ABORTED', |
| 31 | + ] |
| 32 | +} |
| 33 | +``` |
| 34 | + |
| 35 | +This retry policy configuration provides exponential backoff with the following behavior: |
| 36 | + |
| 37 | +- **`maxAttempts: 3`** - Allows for a maximum of 3 total attempts (1 initial request + 2 retries). |
| 38 | + This prevents infinite retry loops while giving sufficient opportunity for transient issues to resolve. |
| 39 | +- **`initialBackoff: "1s"`** - Sets the initial delay to 1 second before the first retry attempt. |
| 40 | + This gives the system time to recover from temporary issues. |
| 41 | +- **`maxBackoff: "4s"`** - Caps the maximum delay between retries at 4 seconds to prevent excessively long waits that could impact user experience. |
| 42 | +- **`backoffMultiplier: 2`** - Doubles the backoff time with each retry attempt. |
| 43 | + Combined with the other settings, this creates a retry pattern of: 1s → 2s → 4s. |
| 44 | +- **`retryableStatusCodes`** - Only retries on specific gRPC status codes that indicate transient failures: |
| 45 | + - `UNAVAILABLE`: SpiceDB is temporarily unavailable |
| 46 | + - `RESOURCE_EXHAUSTED`: SpiceDB is overloaded |
| 47 | + - `DEADLINE_EXCEEDED`: Request timed out |
| 48 | + - `ABORTED`: Operation was aborted, often due to conflicts that may resolve on retry |
| 49 | + |
| 50 | +You can find a python retry example [here](https://github.com/authzed/examples/blob/main/data/retry/main.py). |
| 51 | + |
| 52 | +## `ResourceExhausted` and its Causes |
| 53 | + |
| 54 | +SpiceDB will return a [`ResourceExhausted`](https://grpc.io/docs/guides/status-codes/#the-full-list-of-status-codes) error |
| 55 | +when it needs to protect its own resources. |
| 56 | +These should be treated as transient conditions that can be safely retried, and should be retried with a backoff |
| 57 | +in order to allow SpiceDB to recover whichever resource is unavailable. |
| 58 | + |
| 59 | +### Memory Pressure |
| 60 | + |
| 61 | +SpiceDB implements a memory protection middleware that rejects requests if the middleware determines that a request would cause an Out Of Memory |
| 62 | +condition. Some potential causes: |
| 63 | + |
| 64 | +- SpiceDB instances provisioned with too little memory |
| 65 | + - Fix: provision more memory to the instances |
| 66 | +- Large `CheckBulk` or `LookupResources` requests collecting results in memory |
| 67 | + - Fix: identify the offending client/caller and add pagination or break up the request |
| 68 | + |
| 69 | +### Connection Pool Contention |
| 70 | + |
| 71 | +The [CockroachDB](/spicedb/concepts/datastores#cockroachdb) and [Postgres](/spicedb/concepts/datastores#postgresql) datastore |
| 72 | +implementations use a [pgx connection pool](https://github.com/jackc/pgx/wiki/Getting-started-with-pgx#using-a-connection-pool), |
| 73 | +since creating a new Postgres client connection is relatively expensive. |
| 74 | +This creates a pool of available connections that can be acquired in order to open transactions and do work. |
| 75 | +If this pool is exhausted, SpiceDB may return a `ResourceExhausted` rather than making the calling client wait for connection acquisition. |
| 76 | + |
| 77 | +This can be diagnosed by checking the `pgxpool_empty_acquire` [Prometheus metric](/spicedb/ops/observability#prometheus) or |
| 78 | +the `authzed_cloud.spicedb.datastore.pgx.waited_connections` Datadog metric. |
| 79 | +If the metric is positive, that indicates that SpiceDB is waiting on database connections. |
| 80 | + |
| 81 | +SpiceDB uses these four flags to configure how many connections it will attempt to create: |
| 82 | + |
| 83 | +- `--datastore-conn-pool-read-max-open` |
| 84 | +- `--datastore-conn-pool-read-min-open` |
| 85 | +- `--datastore-conn-pool-write-max-open` |
| 86 | +- `--datastore-conn-pool-write-min-open` |
| 87 | + |
| 88 | +SpiceDB uses separate read and write pools and the flags describe the minimum and maximum number of connections that it will open. |
| 89 | + |
| 90 | +To address database connection pool contention, take the following steps. |
| 91 | + |
| 92 | +#### How To Fix Postgres Connection Pool Contention |
| 93 | + |
| 94 | +<Steps> |
| 95 | + |
| 96 | +##### Ensure that Postgres has enough available connections |
| 97 | + |
| 98 | +Postgres connections are relatively expensive because each connection is a [separate process](https://www.postgresql.org/docs/current/connect-estab.html). |
| 99 | +There's typically a maximum number of supported connections for a given size of Postgres instance. |
| 100 | +If you see an error like: |
| 101 | + |
| 102 | +```json |
| 103 | +{ |
| 104 | + "level": "error", |
| 105 | + "error": "failed to create datastore: failed to create primary datastore: failed to connect to `user=spicedbchULNkGtmeQPUFV database=thumper-pg-db`: 10.96.125.205:5432 (spicedb-dedicated.postgres.svc.cluster.local): server error: FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300)", |
| 106 | + "time": "2025-11-24T20:32:43Z", |
| 107 | + "message": "terminated with errors" |
| 108 | +} |
| 109 | +``` |
| 110 | + |
| 111 | +This indicates that there are no more connections to be had and you'll need to scale up your Postgres instance. |
| 112 | + |
| 113 | +##### Use a Connection Pooler |
| 114 | + |
| 115 | +If your database load is relatively low compared to the number of connections being used, you might benefit from |
| 116 | +a connection pooler like [pgbouncer](https://www.pgbouncer.org/). |
| 117 | +This sits between a client like SpiceDB and your Postgres instance and multiplexes connections, helping to mitigate |
| 118 | +the cost of Postgres connections. |
| 119 | + |
| 120 | +##### Configure Connection Flags |
| 121 | + |
| 122 | +Configure the SpiceDB connection flags so that the maximum number of connections requested fits within the number of connections available: |
| 123 | + |
| 124 | +``` |
| 125 | +(read_max_open + write_max_open) * num_spicedb_instances < total_available_postgres_connections |
| 126 | +``` |
| 127 | + |
| 128 | +You may want to leave additional headroom to allow a new instance to come into service without exhausting connections, depending on your deployment model |
| 129 | +and how instances roll. |
| 130 | + |
| 131 | +</Steps> |
| 132 | + |
| 133 | +#### How To Fix CockroachDB Connection Pool Contention |
| 134 | + |
| 135 | +<Steps> |
| 136 | + |
| 137 | +##### Ensure that CockroachDB has enough available CPU |
| 138 | + |
| 139 | +CockroachDB has [connection pool sizing recommendations](https://www.cockroachlabs.com/docs/stable/connection-pooling?#size-connection-pools). |
| 140 | +Note that the recommendations differ for Basic/Standard and Advanced deployments. |
| 141 | +These heuristics are somewhat fuzzy, and it will require some trial-and-error to find the right connection pool size for your workload. |
| 142 | + |
| 143 | +##### Configure Connection Flags |
| 144 | + |
| 145 | +Configure the SpiceDB connection flags so that the number of connections requested matches the desired number of connections: |
| 146 | + |
| 147 | +``` |
| 148 | +(read_max_open + write_max_open) * num_spicedb_instances < total_available_cockroach_connections |
| 149 | +``` |
| 150 | + |
| 151 | +</Steps> |
0 commit comments