Skip to content

Commit 9bcf7a4

Browse files
authored
chore: move retries to separate section and document ResourceExhausted (#417)
1 parent c4c9e47 commit 9bcf7a4

File tree

6 files changed

+157
-44
lines changed

6 files changed

+157
-44
lines changed

app/spicedb/ops/_meta.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ export default {
44
eks: "Deploying to AWS EKS",
55
data: "Writing data to SpiceDB",
66
performance: "Improving Performance",
7+
resilience: "Improving Resilience",
78
observability: "Observability Tooling",
89
"load-testing": "Load Testing",
910
"spicedb-langchain-langgraph-rag":

app/spicedb/ops/data/writing-relationships/page.mdx

Lines changed: 2 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -4,49 +4,8 @@ import { Callout } from "nextra/components";
44
# Writing relationships
55

66
This page will provide some practical recommendations for writing relationships to SpiceDB.
7-
If you are interested in relationships as a concept, check out this [page](/spicedb/concepts/relationships).
8-
9-
## Retries
10-
11-
When making requests to SpiceDB, it's important to implement proper retry logic to handle transient failures. [SpiceDB APIs use gRPC\*](/spicedb/getting-started/client-libraries), which can experience various types of temporary failures that can be resolved through retries.
12-
13-
Retries are recommended for all gRPC methods, not just WriteRelationships.
14-
15-
\*SpiceDB can also expose an [HTTP API](/spicedb/getting-started/client-libraries#http-clients); however, gRPC is recommended.
16-
17-
### Implementing Retry Policies
18-
19-
You can implement your own retry policies using the gRPC Service Config.
20-
Below, you will find a recommended Retry Policy.
21-
22-
```
23-
"retryPolicy": {
24-
"maxAttempts": 3,
25-
"initialBackoff": "1s",
26-
"maxBackoff": "4s",
27-
"backoffMultiplier": 2,
28-
"retryableStatusCodes": [
29-
'UNAVAILABLE', 'RESOURCE_EXHAUSTED', 'DEADLINE_EXCEEDED', 'ABORTED',
30-
]
31-
}
32-
```
33-
34-
This retry policy configuration provides exponential backoff with the following behavior:
35-
36-
**`maxAttempts: 3`** - Allows for a maximum of 3 total attempts (1 initial request + 2 retries).
37-
This prevents infinite retry loops while giving sufficient opportunity for transient issues to resolve.
38-
39-
**`initialBackoff: "1s"`** - Sets the initial delay to 1 second before the first retry attempt.
40-
This gives the system time to recover from temporary issues.
41-
42-
**`maxBackoff: "4s"`** - Caps the maximum delay between retries at 4 seconds to prevent excessively long waits that could impact user experience.
43-
44-
**`backoffMultiplier: 2`** - Doubles the backoff time with each retry attempt.
45-
Combined with the other settings, this creates a retry pattern of: 1s → 2s → 4s.
46-
47-
**`retryableStatusCodes`** - Only retries on specific gRPC status codes that indicate transient failures: -`UNAVAILABLE`: SpiceDB is temporarily unavailable -`RESOURCE_EXHAUSTED`: SpiceDB is overloaded -`DEADLINE_EXCEEDED`: Request timed out -`ABORTED`: Operation was aborted, often due to conflicts that may resolve on retry
48-
49-
You can find a python retry example [here](https://github.com/authzed/examples/blob/main/data/retry/main.py).
7+
If you are interested in relationships as a concept, check out [this page](/spicedb/concepts/relationships).
8+
If you are interested in improving the resilience of your writes, check out [this page](/spicedb/ops/resilience).
509

5110
## Writes: Touch vs Create
5211

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
import { Steps } from "nextra/components";
2+
3+
# Improving Resilience
4+
5+
The first step we recommend is making sure that you have [observability](/spicedb/ops/observability) in place.
6+
Once you've done that, this page will help you improve the resilience of your SpiceDB deployment.
7+
8+
## Retries
9+
10+
When making requests to SpiceDB, it's important to implement proper retry logic to handle transient failures.
11+
The [SpiceDB Client Libraries](/spicedb/getting-started/client-libraries) use gRPC[^1],
12+
which can experience various types of temporary failures that can be resolved through retries.
13+
14+
Retries are recommended for all gRPC methods.
15+
16+
[^1]: SpiceDB can also expose an [HTTP API](/spicedb/getting-started/client-libraries#http-clients); however, gRPC is recommended.
17+
18+
### Implementing Retry Policies
19+
20+
You can implement your own retry policies using the gRPC Service Config.
21+
Below, you will find a recommended Retry Policy.
22+
23+
```
24+
"retryPolicy": {
25+
"maxAttempts": 3,
26+
"initialBackoff": "1s",
27+
"maxBackoff": "4s",
28+
"backoffMultiplier": 2,
29+
"retryableStatusCodes": [
30+
'UNAVAILABLE', 'RESOURCE_EXHAUSTED', 'DEADLINE_EXCEEDED', 'ABORTED',
31+
]
32+
}
33+
```
34+
35+
This retry policy configuration provides exponential backoff with the following behavior:
36+
37+
- **`maxAttempts: 3`** - Allows for a maximum of 3 total attempts (1 initial request + 2 retries).
38+
This prevents infinite retry loops while giving sufficient opportunity for transient issues to resolve.
39+
- **`initialBackoff: "1s"`** - Sets the initial delay to 1 second before the first retry attempt.
40+
This gives the system time to recover from temporary issues.
41+
- **`maxBackoff: "4s"`** - Caps the maximum delay between retries at 4 seconds to prevent excessively long waits that could impact user experience.
42+
- **`backoffMultiplier: 2`** - Doubles the backoff time with each retry attempt.
43+
Combined with the other settings, this creates a retry pattern of: 1s → 2s → 4s.
44+
- **`retryableStatusCodes`** - Only retries on specific gRPC status codes that indicate transient failures:
45+
- `UNAVAILABLE`: SpiceDB is temporarily unavailable
46+
- `RESOURCE_EXHAUSTED`: SpiceDB is overloaded
47+
- `DEADLINE_EXCEEDED`: Request timed out
48+
- `ABORTED`: Operation was aborted, often due to conflicts that may resolve on retry
49+
50+
You can find a python retry example [here](https://github.com/authzed/examples/blob/main/data/retry/main.py).
51+
52+
## `ResourceExhausted` and its Causes
53+
54+
SpiceDB will return a [`ResourceExhausted`](https://grpc.io/docs/guides/status-codes/#the-full-list-of-status-codes) error
55+
when it needs to protect its own resources.
56+
These should be treated as transient conditions that can be safely retried, and should be retried with a backoff
57+
in order to allow SpiceDB to recover whichever resource is unavailable.
58+
59+
### Memory Pressure
60+
61+
SpiceDB implements a memory protection middleware that rejects requests if the middleware determines that a request would cause an Out Of Memory
62+
condition. Some potential causes:
63+
64+
- SpiceDB instances provisioned with too little memory
65+
- Fix: provision more memory to the instances
66+
- Large `CheckBulk` or `LookupResources` requests collecting results in memory
67+
- Fix: identify the offending client/caller and add pagination or break up the request
68+
69+
### Connection Pool Contention
70+
71+
The [CockroachDB](/spicedb/concepts/datastores#cockroachdb) and [Postgres](/spicedb/concepts/datastores#postgresql) datastore
72+
implementations use a [pgx connection pool](https://github.com/jackc/pgx/wiki/Getting-started-with-pgx#using-a-connection-pool),
73+
since creating a new Postgres client connection is relatively expensive.
74+
This creates a pool of available connections that can be acquired in order to open transactions and do work.
75+
If this pool is exhausted, SpiceDB may return a `ResourceExhausted` rather than making the calling client wait for connection acquisition.
76+
77+
This can be diagnosed by checking the `pgxpool_empty_acquire` [Prometheus metric](/spicedb/ops/observability#prometheus) or
78+
the `authzed_cloud.spicedb.datastore.pgx.waited_connections` Datadog metric.
79+
If the metric is positive, that indicates that SpiceDB is waiting on database connections.
80+
81+
SpiceDB uses these four flags to configure how many connections it will attempt to create:
82+
83+
- `--datastore-conn-pool-read-max-open`
84+
- `--datastore-conn-pool-read-min-open`
85+
- `--datastore-conn-pool-write-max-open`
86+
- `--datastore-conn-pool-write-min-open`
87+
88+
SpiceDB uses separate read and write pools and the flags describe the minimum and maximum number of connections that it will open.
89+
90+
To address database connection pool contention, take the following steps.
91+
92+
#### How To Fix Postgres Connection Pool Contention
93+
94+
<Steps>
95+
96+
##### Ensure that Postgres has enough available connections
97+
98+
Postgres connections are relatively expensive because each connection is a [separate process](https://www.postgresql.org/docs/current/connect-estab.html).
99+
There's typically a maximum number of supported connections for a given size of Postgres instance.
100+
If you see an error like:
101+
102+
```json
103+
{
104+
"level": "error",
105+
"error": "failed to create datastore: failed to create primary datastore: failed to connect to `user=spicedbchULNkGtmeQPUFV database=thumper-pg-db`: 10.96.125.205:5432 (spicedb-dedicated.postgres.svc.cluster.local): server error: FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300)",
106+
"time": "2025-11-24T20:32:43Z",
107+
"message": "terminated with errors"
108+
}
109+
```
110+
111+
This indicates that there are no more connections to be had and you'll need to scale up your Postgres instance.
112+
113+
##### Use a Connection Pooler
114+
115+
If your database load is relatively low compared to the number of connections being used, you might benefit from
116+
a connection pooler like [pgbouncer](https://www.pgbouncer.org/).
117+
This sits between a client like SpiceDB and your Postgres instance and multiplexes connections, helping to mitigate
118+
the cost of Postgres connections.
119+
120+
##### Configure Connection Flags
121+
122+
Configure the SpiceDB connection flags so that the maximum number of connections requested fits within the number of connections available:
123+
124+
```
125+
(read_max_open + write_max_open) * num_spicedb_instances < total_available_postgres_connections
126+
```
127+
128+
You may want to leave additional headroom to allow a new instance to come into service without exhausting connections, depending on your deployment model
129+
and how instances roll.
130+
131+
</Steps>
132+
133+
#### How To Fix CockroachDB Connection Pool Contention
134+
135+
<Steps>
136+
137+
##### Ensure that CockroachDB has enough available CPU
138+
139+
CockroachDB has [connection pool sizing recommendations](https://www.cockroachlabs.com/docs/stable/connection-pooling?#size-connection-pools).
140+
Note that the recommendations differ for Basic/Standard and Advanced deployments.
141+
These heuristics are somewhat fuzzy, and it will require some trial-and-error to find the right connection pool size for your workload.
142+
143+
##### Configure Connection Flags
144+
145+
Configure the SpiceDB connection flags so that the number of connections requested matches the desired number of connections:
146+
147+
```
148+
(read_max_open + write_max_open) * num_spicedb_instances < total_available_cockroach_connections
149+
```
150+
151+
</Steps>

next-env.d.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
/// <reference types="next" />
22
/// <reference types="next/image-types/global" />
3+
/// <reference types="next/navigation-types/compat/navigation" />
34
/// <reference path="./.next/types/routes.d.ts" />
45

56
// NOTE: This file should not be edited

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,5 +57,5 @@
5757
"typescript": "^5.9.3",
5858
"yaml-loader": "^0.8.1"
5959
},
60-
"packageManager": "pnpm@10.17.1"
60+
"packageManager": "pnpm@10.24.0"
6161
}

wordlist.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -526,6 +526,7 @@ pb
526526
performant
527527
performantly
528528
permissionship
529+
pgbouncer
529530
pgx
530531
pluggable
531532
pnpm

0 commit comments

Comments
 (0)