Skip to content

Commit 3f554c8

Browse files
committed
chore: move retries to separate section and document ResourceExhausted
1 parent 3f22e0b commit 3f554c8

File tree

4 files changed

+130
-46
lines changed

4 files changed

+130
-46
lines changed

pages/spicedb/ops/_meta.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
"eks": "Deploying to AWS EKS",
55
"data": "Writing data to SpiceDB",
66
"performance": "Improving Performance",
7+
"resilience": "Improving Resilience",
78
"observability": "Observability Tooling",
89
"ai-agent-authorization": "Authorization for AI Agents",
910
"secure-rag-pipelines": "Secure Your RAG Pipelines with Fine Grained Authorization"

pages/spicedb/ops/data/writing-relationships.mdx

Lines changed: 0 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -6,52 +6,6 @@ import { Callout } from 'nextra/components'
66
This page will provide some practical recommendations for writing relationships to SpiceDB.
77
If you are interested in relationships as a concept, check out this [page](/spicedb/concepts/relationships).
88

9-
## Retries
10-
11-
When making requests to SpiceDB, it's important to implement proper retry logic to handle transient failures. [SpiceDB APIs use gRPC*](/spicedb/getting-started/client-libraries), which can experience various types of temporary failures that can be resolved through retries.
12-
13-
Retries are recommended for all gRPC methods, not just WriteRelationships.
14-
15-
*SpiceDB can also expose an [HTTP API](/spicedb/getting-started/client-libraries#http-clients); however, gRPC is recommended.
16-
17-
### Implementing Retry Policies
18-
19-
You can implement your own retry policies using the gRPC Service Config.
20-
Below, you will find a recommended Retry Policy.
21-
22-
```
23-
"retryPolicy": {
24-
"maxAttempts": 3,
25-
"initialBackoff": "1s",
26-
"maxBackoff": "4s",
27-
"backoffMultiplier": 2,
28-
"retryableStatusCodes": [
29-
'UNAVAILABLE', 'RESOURCE_EXHAUSTED', 'DEADLINE_EXCEEDED', 'ABORTED',
30-
]
31-
}
32-
```
33-
34-
This retry policy configuration provides exponential backoff with the following behavior:
35-
36-
**`maxAttempts: 3`** - Allows for a maximum of 3 total attempts (1 initial request + 2 retries).
37-
This prevents infinite retry loops while giving sufficient opportunity for transient issues to resolve.
38-
39-
**`initialBackoff: "1s"`** - Sets the initial delay to 1 second before the first retry attempt.
40-
This gives the system time to recover from temporary issues.
41-
42-
**`maxBackoff: "4s"`** - Caps the maximum delay between retries at 4 seconds to prevent excessively long waits that could impact user experience.
43-
44-
**`backoffMultiplier: 2`** - Doubles the backoff time with each retry attempt.
45-
Combined with the other settings, this creates a retry pattern of: 1s → 2s → 4s.
46-
47-
**`retryableStatusCodes`** - Only retries on specific gRPC status codes that indicate transient failures:
48-
-`UNAVAILABLE`: SpiceDB is temporarily unavailable
49-
-`RESOURCE_EXHAUSTED`: SpiceDB is overloaded
50-
-`DEADLINE_EXCEEDED`: Request timed out
51-
-`ABORTED`: Operation was aborted, often due to conflicts that may resolve on retry
52-
53-
You can find a python retry example [here](https://github.com/authzed/examples/blob/main/data/retry/main.py).
54-
559
## Writes: Touch vs Create
5610

5711
A SpiceDB [relationship update](https://buf.build/authzed/api/docs/main:authzed.api.v1#authzed.api.v1.RelationshipUpdate) can use one of three operation types `CREATE`, `TOUCH`, OR `DELETE`.

pages/spicedb/ops/resilience.mdx

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# Improving Resilience
2+
3+
## Retries
4+
5+
When making requests to SpiceDB, it's important to implement proper retry logic to handle transient failures.
6+
The [SpiceDB Client Libraries](/spicedb/getting-started/client-libraries) use gRPC,
7+
which can experience various types of temporary failures that can be resolved through retries.
8+
9+
Retries are recommended for all gRPC methods.
10+
11+
{/*TODO: add footnote once footnotes are supported*/}
12+
[^1]: SpiceDB can also expose an [HTTP API](/spicedb/getting-started/client-libraries#http-clients); however, gRPC is recommended.
13+
14+
### Implementing Retry Policies
15+
16+
You can implement your own retry policies using the gRPC Service Config.
17+
Below, you will find a recommended Retry Policy.
18+
19+
```
20+
"retryPolicy": {
21+
"maxAttempts": 3,
22+
"initialBackoff": "1s",
23+
"maxBackoff": "4s",
24+
"backoffMultiplier": 2,
25+
"retryableStatusCodes": [
26+
'UNAVAILABLE', 'RESOURCE_EXHAUSTED', 'DEADLINE_EXCEEDED', 'ABORTED',
27+
]
28+
}
29+
```
30+
31+
This retry policy configuration provides exponential backoff with the following behavior:
32+
33+
* **`maxAttempts: 3`** - Allows for a maximum of 3 total attempts (1 initial request + 2 retries).
34+
This prevents infinite retry loops while giving sufficient opportunity for transient issues to resolve.
35+
* **`initialBackoff: "1s"`** - Sets the initial delay to 1 second before the first retry attempt.
36+
This gives the system time to recover from temporary issues.
37+
* **`maxBackoff: "4s"`** - Caps the maximum delay between retries at 4 seconds to prevent excessively long waits that could impact user experience.
38+
* **`backoffMultiplier: 2`** - Doubles the backoff time with each retry attempt.
39+
Combined with the other settings, this creates a retry pattern of: 1s → 2s → 4s.
40+
* **`retryableStatusCodes`** - Only retries on specific gRPC status codes that indicate transient failures:
41+
* `UNAVAILABLE`: SpiceDB is temporarily unavailable
42+
* `RESOURCE_EXHAUSTED`: SpiceDB is overloaded
43+
* `DEADLINE_EXCEEDED`: Request timed out
44+
* `ABORTED`: Operation was aborted, often due to conflicts that may resolve on retry
45+
46+
You can find a python retry example [here](https://github.com/authzed/examples/blob/main/data/retry/main.py).
47+
48+
## `ResourceExhausted` and its Causes
49+
50+
SpiceDB will return a [`ResourceExhausted`](https://grpc.io/docs/guides/status-codes/#the-full-list-of-status-codes) error
51+
when it needs to protect its own resources.
52+
These should be treated as transient conditions that can be safely retried, and should be retried with a backoff
53+
in order to allow SpiceDB to recover whichever resource is unavailable.
54+
55+
### Memory Pressure
56+
57+
SpiceDB implements a memory protection middleware that rejects requests if the middleware determines that a request would cause an Out Of Memory
58+
condition. Some potential causes:
59+
60+
* SpiceDB instances provisioned with too little memory
61+
* Fix: provision more memory to the instances
62+
* Large `CheckBulk` or `LookupResources` requests collecting results in memory
63+
* Fix: identify the offending client/caller and add pagination or break up the request
64+
65+
### Connection Pool Contention
66+
67+
The [CockroachDB](/spicedb/concepts/datastores#cockroachdb) and [Postgres](/spicedb/concepts/datastores#postgresql) datastore
68+
implementations use a [pgx connection pool](https://github.com/jackc/pgx/wiki/Getting-started-with-pgx#using-a-connection-pool),
69+
since creating a new Postgres client connection is relatively expensive.
70+
This creates a pool of available connections that can be acquired in order to open transactions and do work.
71+
If this pool is exhausted, SpiceDB may return a `ResourceExhausted` rather than making the calling client wait for connection acquisition.
72+
73+
This can be diagnosed by checking the `pgxpool_empty_acquire` [Prometheus metric](/spicedb/ops/observability#prometheus) or
74+
the `authzed_cloud.spicedb.datastore.pgx.waited_connections` Datadog metric.
75+
If the metric is positive, that indicates that SpiceDB is waiting on database connections.
76+
77+
SpiceDB uses these four flags to configure how many connections it will attempt to create:
78+
79+
* `--datastore-conn-pool-read-max-open`
80+
* `--datastore-conn-pool-read-min-open`
81+
* `--datastore-conn-pool-write-max-open`
82+
* `--datastore-conn-pool-write-min-open`
83+
84+
SpiceDB uses separate read and write pools and the flags describe the minimum and maximum number of connections that it will open.
85+
86+
To address database connection pool contention, take the following steps.
87+
88+
#### Postgres Fix
89+
90+
* Ensure that Postgres has enough available connections.
91+
* Postgres connections are relatively expensive because each connection is a [separate process](https://www.postgresql.org/docs/current/connect-estab.html).
92+
There's typically a maximum number of supported connections for a given size of Postgres instance.
93+
If you see an error like:
94+
95+
```json
96+
{
97+
"level": "error",
98+
"error": "failed to create datastore: failed to create primary datastore: failed to connect to `user=spicedbchULNkGtmeQPUFV database=thumper-pg-db`: 10.96.125.205:5432 (spicedb-dedicated.postgres.svc.cluster.local): server error: FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300)",
99+
"time": "2025-11-24T20:32:43Z",
100+
"message": "terminated with errors"
101+
}
102+
```
103+
104+
This indicates that there are no more connections to be had and you'll need to scale up your Postgres instance.
105+
* If your database load is relatively low compared to the number of connections being used, you might benefit from
106+
a connection pooler like [pgbouncer](https://www.pgbouncer.org/).
107+
This sits between a client like SpiceDB and your Postgres instance and multiplexes connections, helping to mitigate
108+
the cost of Postgres connections.
109+
* Configure the SpiceDB connection flags so that the maximum number of connections requested fits within the number of connections available:
110+
111+
```
112+
(read_max_open + write_max_open) * num_spicedb_instances < total_available_postgres_connections
113+
```
114+
115+
* You may want to leave additional headroom to allow a new instance to come into service without exhausting connections, depending on your deployment model
116+
and how instances roll.
117+
118+
#### CockroachDB fix
119+
120+
* Ensure that CockroachDB has enough available CPU
121+
* CockroachDB has [connection pool sizing recommendations](https://www.cockroachlabs.com/docs/stable/connection-pooling?#size-connection-pools).
122+
Note that the recommendations differ for Basic/Standard and Advanced deployments.
123+
These heuristics are somewhat fuzzy, and it will require some trial-and-error to find the right connection pool size for your workload.
124+
* Configure the SpiceDB connection flags so that the number of connections requested matches the desired number of connections:
125+
```
126+
(read_max_open + write_max_open) * num_spicedb_instances < total_available_cockroach_connections
127+
```

wordlist.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -242,6 +242,7 @@ Subnets
242242
TCP
243243
TLS
244244
Thumper
245+
TODO
245246
TrueTime
246247
Tupleset
247248
TypeScript
@@ -526,6 +527,7 @@ pb
526527
performant
527528
performantly
528529
permissionship
530+
pgbouncer
529531
pgx
530532
pluggable
531533
pnpm

0 commit comments

Comments
 (0)