diff --git a/docs/docs/how-tos/setup-monitoring.md b/docs/docs/how-tos/setup-monitoring.md index 028f7b9ae..9a6abd120 100644 --- a/docs/docs/how-tos/setup-monitoring.md +++ b/docs/docs/how-tos/setup-monitoring.md @@ -136,7 +136,8 @@ End users viewing the logs in `Grafana` will create queries using `Loki` as the Loki's "labels" are used to filter collections of logs from the available [kubernetes_sd](https://grafana.com/docs/loki/latest/send-data/promtail/configuration/#kubernetes_sd_config) API endpoints, in a similar way as to how Prometheus handles metrics. These labels are configured through Promtail, which is the agent responsible for collecting and shipping logs to Loki, based on the defined [targets](https://grafana.com/docs/loki/latest/send-data/promtail/configuration/#scrape_configs) and scraping configurations. ::: -For details on how to view specific logs in Loki, check out the document ["How to access system logs (Loki) via Grafana"](access-logs-loki) +For details on how to view specific logs in Loki, check out the document ["How to access +system logs (Loki) via Grafana"](access-logs-loki) ## References @@ -145,4 +146,4 @@ For details on how to view specific logs in Loki, check out the document ["How t -[access-logs-loki]: /how-tos/access-logs-loki.md +[access-logs-loki]: /docs/how-tos/access-logs-loki diff --git a/docs/docs/tutorials/creating-grafana-alerts.md b/docs/docs/tutorials/creating-grafana-alerts.md new file mode 100644 index 000000000..bdadece2c --- /dev/null +++ b/docs/docs/tutorials/creating-grafana-alerts.md @@ -0,0 +1,200 @@ +--- +id: create-alerts +title: Create and Manage Grafana Alerts +description: Quickly setup Grafana alerts to monitor your Nebari deployment. +--- + +# Create and Manage Grafana Alerts + +Given the robust structure and components of Nebari, monitoring the health and +performance of your deployment is crucial. Nebari integrates Grafana for monitoring and +visualization of metrics, commonly used for its dashboard capabilities. Grafana also +provides a powerful alerting system that allows you to create, manage, and receive +notifications based on specific conditions in your data. + +In this guide we will walk you through the steps to create and manage alerts in Grafana, +while providing a quick example of setting up an alert to monitor CPU usage on your Nebari +deployment. + +## Accessing Grafana + +To access Grafana and start creating or managing Alerts, you need to first make sure you +have the necessary permissions within Keycloak to access the admin features of Grafana. +Check out [Keycloak roles and permissions][keycloak-roles-permissions] for more +information. + +Once you have the necessary permissions, you can access Grafana by navigating to +`http:///monitoring` in your web browser: + +![Grafana Home Page](/img/tutorials/grafana_home_page_admin.png) + +::note +If your permissions are correct, you should see the "Administration" section in the left sidebar. +:: + +Once logged in, you can navigate to the **Alerting** section to create and manage your +alerts. + +## Creating Alerts + +![Grafana Alerting Section](/img/tutorials/grafana_alert_rules_page.png) + +Nebari comes with a set of pre-configured alert rules that monitor various aspects of +the cluster, which are deployed as part of the Grafana Helm chart installation stack. +You can view and manage these existing alerts, or create new ones tailored to your +specific monitoring needs. + +### Steps to Create a New Alert Rule + +1. **Navigate to Alerting → Alert Rules** in the left sidebar. +2. Click on **New alert rule**. +3. Fill in the alert rule details: + - **Name**: Choose a descriptive name (e.g., `High CPU Usage`). + - **Query**: Select your **data source** (typically Prometheus in Nebari) and build a query. + Example: + ```promql + avg(rate(container_cpu_usage_seconds_total{namespace="default"}[5m])) * 100 + ``` + This calculates average CPU usage over 5 minutes. + - **Condition**: Define the threshold (e.g., `> 80%`). + - **Evaluation interval**: Set how often the rule should be evaluated (e.g., every 1m). + - **For**: Specify how long the condition must be true before firing (e.g., 5m). + - **Labels/Annotations**: Add metadata such as severity (`warning`, `critical`) and description. +4. Under **Notifications**, attach the rule to a **contact point**. + - Contact points can be email, Slack, PagerDuty, etc. (configured in the "Contact points" tab). +5. Save the alert rule. + +### Example: Network Performance Alert + +Here's a practical example of creating a network-related alert that monitors network connectivity and performance, similar to what enterprise network monitoring tools like WhatsUp Gold would track: + +#### Alert: High Network Error Rate + +This alert monitors for excessive network errors which could indicate connectivity issues, hardware problems, or network congestion. + +**Alert Configuration:** + +- **Name**: `High Network Error Rate` +- **Query A** (Network errors): + ```promql + rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m]) + ``` +- **Query B** (Total packets): + ```promql + rate(node_network_receive_packets_total[5m]) + rate(node_network_transmit_packets_total[5m]) + ``` +- **Expression** (Error percentage): + ```promql + (${A} / ${B}) * 100 + ``` +- **Condition**: `IS ABOVE 1` (Alert when error rate exceeds 1%) +- **Evaluation interval**: `1m` +- **For**: `3m` (Alert fires after condition persists for 3 minutes) + +**Labels:** + +``` +severity: warning +component: network +team: infrastructure +``` + +**Annotations:** + +``` +summary: High network error rate detected on {{ $labels.instance }} +description: Network error rate is {{ $value }}% on interface {{ $labels.device }} of node {{ $labels.instance }}, which exceeds the 1% threshold. +``` + +#### Alert: Network Interface Down + +This alert detects when network interfaces go offline, which is critical for maintaining connectivity. + +**Alert Configuration:** + +- **Name**: `Network Interface Down` +- **Query**: + ```promql + up{job="node-exporter"} == 0 or node_network_up == 0 + ``` +- **Condition**: `IS BELOW 1` +- **Evaluation interval**: `30s` +- **For**: `1m` + +**Labels:** + +``` +severity: critical +component: network +team: infrastructure +``` + +**Annotations:** + +``` +summary: Network interface down on {{ $labels.instance }} +description: Network interface {{ $labels.device }} on node {{ $labels.instance }} is down or unreachable. +``` + +#### Alert: High Network Bandwidth Utilization + +Monitor for high bandwidth usage that could impact application performance. + +**Alert Configuration:** + +- **Name**: `High Network Bandwidth Usage` +- **Query**: + ```promql + (rate(node_network_receive_bytes_total[5m]) + rate(node_network_transmit_bytes_total[5m])) * 8 / 1e9 + ``` +- **Condition**: `IS ABOVE 0.8` (Alert when usage exceeds 800 Mbps on a 1Gbps link) +- **Evaluation interval**: `1m` +- **For**: `5m` + +**Labels:** + +``` +severity: warning +component: network +team: infrastructure +``` + +**Annotations:** + +``` +summary: High bandwidth utilization on {{ $labels.instance }} +description: Network bandwidth usage is {{ $value }}Gbps on interface {{ $labels.device }} of node {{ $labels.instance }}, approaching capacity limits. +``` + +These network alerts provide comprehensive monitoring similar to enterprise tools and help ensure: + +- **Connectivity**: Early detection of interface failures +- **Performance**: Monitoring for error rates and bandwidth saturation +- **Reliability**: Proactive alerting before network issues impact users + +### Managing Notifications and Policies + +After creating rules, you need to configure **how and when alerts are sent**: + +- **Contact Points**: Define where alerts should be delivered (e.g., team email, Slack channel). +- **Notification Policies**: Control routing, grouping, and silencing of alerts. This is particularly useful to: + - Prevent alert fatigue by grouping related alerts. + - Define escalation paths. + - Mute alerts during maintenance windows. + +For example, you can create a notification policy that routes all `critical` alerts to Slack, and `warning` alerts to email. + +--- + +## Next Steps + +- Regularly review and tune your alert thresholds to match real-world workloads. +- Use **silences** during maintenance windows to avoid noisy alerts. +- Explore **alert dashboards** to visualize trends in triggered alerts. + +For more information on handling alerts in Grafana, check out the official Grafana +documentation: [Create Grafana managed rule](https://grafana.com/docs/grafana/latest/alerting/alerting-rules/create-grafana-managed-rule/) + + + +[keycloak-roles-permissions]: /docs/how-tos/configuring-keycloak#in-depth-look-at-roles-and-groups diff --git a/docs/sidebars.js b/docs/sidebars.js index 4c82e8c85..bf92d3359 100644 --- a/docs/sidebars.js +++ b/docs/sidebars.js @@ -39,6 +39,7 @@ module.exports = { "tutorials/using_dask", "tutorials/create-dashboard", "tutorials/creating-new-environments", + "tutorials/create-alerts", "tutorials/jupyter-scheduler", "tutorials/argo-workflows-walkthrough", ], diff --git a/docs/static/img/tutorials/grafana_alert_rules_page.png b/docs/static/img/tutorials/grafana_alert_rules_page.png new file mode 100644 index 000000000..23c4f697c Binary files /dev/null and b/docs/static/img/tutorials/grafana_alert_rules_page.png differ diff --git a/docs/static/img/tutorials/grafana_home_page_admin.png b/docs/static/img/tutorials/grafana_home_page_admin.png new file mode 100644 index 000000000..4a68a6d42 Binary files /dev/null and b/docs/static/img/tutorials/grafana_home_page_admin.png differ