Skip to content

Commit 63adae3

Browse files
committed
update readme for samples
1 parent f43585e commit 63adae3

File tree

1 file changed

+70
-10
lines changed

1 file changed

+70
-10
lines changed

samples/alerts/README.md

Lines changed: 70 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -51,18 +51,78 @@ To customize an alert:
5151
2. Modify the alert parameters as needed (thresholds, evaluation intervals, etc.)
5252
3. Deploy the monitoring components to apply your custom alerts
5353
54-
### Common Customizations Needed
54+
### Required Customizations
55+
56+
The following elements need to be adjusted to match your specific environment:
57+
58+
#### 1. Namespace Specifications
59+
- Change `namespace="viya"` to match your SAS Viya namespace in:
60+
- `platform/viya-pod-restart-count-high.yaml`
61+
- Verify the pattern `job=~"sas-.*"` in `platform/high-viya-api-latency.yaml` matches your service naming convention
62+
63+
#### 2. Persistent Volume Claims
64+
- Update PVC names in:
65+
- `other/nfs-share-high-usage.yaml`: `persistentvolumeclaim="cas-default-data"`
66+
- `database/crunchy-backrest-repo.yaml`: `persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"`
67+
- `database/crunchy-pgdata-usage-high.yaml`: `persistentvolumeclaim=~"sas-crunchy-platform-postgres-00-.*"`
68+
69+
#### 3. Container Names
70+
- Verify container names in:
71+
- `database/catalog-db-connections-high.yaml`: `container="sas-catalog-services"`
72+
- `platform/viya-readiness-probe-failed.yaml`: `container="sas-readiness"`
73+
74+
#### 4. Alert Thresholds
75+
Adjust thresholds based on your environment size and requirements:
76+
- `cas/cas-thread-count-high.yaml`: > 400 threads
77+
- `cas/cas-memory-usage-high.yaml`: > 300 GB
78+
- `database/postgresql-connection-utilization-high.yaml`: > 85%
79+
- `platform/rabbitmq-ready-queue-backlog.yaml`: > 10,000 messages
80+
- `platform/rabbitmq-unacked-queue-backlog.yaml`: > 5,000 messages
81+
- `platform/viya-pod-restart-count-high.yaml`: > 20 restarts
82+
- `other/nfs-share-high-usage.yaml`: > 85% full
83+
- `platform/high-viya-api-latency.yaml`: > 1 second (95th percentile)
84+
- `database/crunchy-pgdata-usage-high.yaml` and `database/crunchy-backrest-repo.yaml`: > 50% full
85+
86+
#### 5. Verify Metric Availability
87+
Ensure the following metrics are available in your Prometheus instance:
88+
- CAS metrics: `cas_thread_count`, `cas_grid_uptime_seconds_total`
89+
- Database metrics: `sas_db_pool_connections`, `pg_stat_activity_count`, `pg_settings_max_connections`
90+
- RabbitMQ metrics: `rabbitmq_queue_messages_ready`, `rabbitmq_queue_messages_unacknowledged`
91+
- Kubernetes metrics: `kube_pod_container_status_restarts_total`, `kube_pod_container_status_ready`
92+
- HTTP metrics: `http_server_requests_duration_seconds_bucket`
93+
- SAS Job Launcher: `:sas_launcher_pod_status:` (recording rule)
94+
95+
### Alert Expression Format
5596

56-
#### Namespace Adjustments
57-
Some alerts reference specific namespaces that need to be adjusted for your environment:
58-
- In several rules (like `viya-pod-restart-count-high.yaml`), the namespace is set to "viya" with `namespace="viya"`. Change this to match your SAS Viya namespace.
59-
60-
#### Alert Expression Format
6197
Alert expressions in these samples use a multi-part approach for better compatibility with newer Grafana versions:
62-
- Part A: Fetches the raw metric
63-
- Part B: Reduces the result (using the "reduce" function)
64-
- Part C: Applies the threshold using a dedicated threshold component
6598

66-
This approach addresses issues where direct threshold comparisons (e.g., `metric > threshold`) might not work properly in recent Grafana versions.
99+
- **Part A**: Fetches the raw metric
100+
- **Part B**: Reduces the result (using the "reduce" function)
101+
- **Part C**: Applies the threshold using a dedicated threshold component
102+
103+
This approach addresses issues where direct threshold comparisons (e.g., `metric > threshold`) might not work properly in recent Grafana versions. If you experience "no data" results when the underlying metric has data, ensure your alert is using this multi-part approach.
104+
105+
For example, instead of:
106+
```yaml
107+
expr: cas_thread_count > 400
108+
```
109+
110+
Use:
111+
```yaml
112+
# Part A: Fetch the metric
113+
expr: cas_thread_count
114+
115+
# Part B: Reduce the result
116+
type: reduce
117+
expression: A
118+
119+
# Part C: Apply threshold
120+
type: threshold
121+
expression: B
122+
evaluator:
123+
type: gt
124+
params:
125+
- 400
126+
```
67127

68128
For more detailed information on Grafana alerting, see the [Grafana documentation](https://grafana.com/docs/grafana/latest/alerting/).

0 commit comments

Comments
 (0)