Skip to content

Commit a507c7f

Browse files
Merge #71
71: docs: Add monitoring section to node onboarding docs r=jwolski2 a=jwolski2 <!-- Thank you for your Pull Request. Please provide a description above and review the requirements below. Bug fixes and new features should include tests. Contributors guide: https://github.com/NillionNetwork/nillion/blob/master/CONTRIBUTING.md --> ## Motivation <!-- Explain the context and why you're making that change. What is the problem you're trying to solve? In some cases there is not a problem and this can be thought of as being the motivation for your change. --> Node operators have asked for advice/recommendations on which nilvm metrics to monitor and set alerts on. ## Solution <!-- Summarize the solution and provide any necessary context needed to understand the code change. --> This commit adds a "Monitor Your Node" section to the onboarding documentation which provides the advice, taken from our (internal) Prometheus / Alert Manager rules. Fixes # Design discussion issue (if applicable) # ## Merge requirement checklist * [ ] [CONTRIBUTING](https://github.com/NillionNetwork/nillion/blob/main/CONTRIBUTING.md) guidelines followed * [ ] Unit tests added/updated (if applicable) * [ ] Breaking change analysis completed (if applicable). "Will this change require all network cluster operators to update? Does it break public APIs?" * [ ] For new features or breaking changes, created a documentation issue in [nillion-docs](https://github.com/NillionNetwork/nillion-docs/issues/new/choose) Co-authored-by: Jeff Wolski <[email protected]>
2 parents 75fbe27 + 5281cde commit a507c7f

File tree

1 file changed

+26
-0
lines changed

1 file changed

+26
-0
lines changed

docs/node_onboarding/README.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,32 @@ docker run -e CONFIG_PATH=/etc/nillion/node.yaml public.ecr.aws/k5d9x2g2/nilvm:$
115115

116116
All nodes present in `cluster.members` must be available before network functions can be validated.
117117

118+
## Monitor Your Node
119+
120+
Set alerts on the following metrics and Prometheus expressions; it is recommended to use a `5m` duration for alerts.
121+
122+
### Error Rate Metrics
123+
124+
| Metric | Prometheus Expression | Description |
125+
| --------------------------------- | -------------------------------------------------------------------------------- | --------------------------------------- |
126+
| High blob operation error rate | `sum by (operation) (increase(blob_operation_errors_total[5m])) > 0` | Blob operation error rate is above 0 |
127+
| High error rate | `sum(rate(grpc_request_duration_seconds_count{status_code="Internal"}[1m])) > 0` | gRPC error rate is above 0 |
128+
| High token price query error rate | `sum by (env) (increase(token_price_errors_total[5m])) > 0` | Token price query error rate is above 0 |
129+
130+
### Latency Metrics
131+
132+
| Metric | Prometheus Expression | Description |
133+
| ------------------------------ | -------------------------------------------------------------------------------------------------------- | ----------------------------------------------------- |
134+
| High blob latency | `histogram_quantile(0.99, sum(rate(blob_operation_duration_seconds_bucket[5m])) by (operation, le)) > 1` | 99th percentile blob operation latency is above 1s |
135+
| High latency | `histogram_quantile(0.99, sum(rate(grpc_request_duration_seconds_bucket[1m])) by (le)) > 5` | 99th percentile gRPC latency is above 5s |
136+
| High token price query latency | `histogram_quantile(0.99, sum(rate(token_price_duration_seconds_bucket[5m])) by (le)) > 5` | 99th percentile token price query latency is above 5s |
137+
138+
### Other Metrics
139+
140+
| Metric | Prometheus Expression | Description |
141+
| -------------------------- | ----------------------------------------------------------------------------------------------------------------------- | --------------------------------------------- |
142+
| Low preprocessing elements | `max(preprocessing_offsets{offset="latest"} - on(element) preprocessing_offsets{offset="committed"}) by (element) < 32` | Amount of preprocessing elements is below 32 |
143+
118144
[nillion-sdk]: https://docs.nillion.com/nillion-sdk-and-tools
119145
[node-yaml-mainnet-1]: ./networks/nilvm-mainnet-1.yaml
120146
[node-yaml-testnet-1]: ./networks/nilvm-testnet-1.yaml

0 commit comments

Comments
 (0)