|
| 1 | +--- |
| 2 | +meta: |
| 3 | + title: How to add alerts to a job |
| 4 | + description: Learn how to add monitoring alerts to Serverless Jobs with Scaleway. |
| 5 | +content: |
| 6 | + h1: How to add alerts to a job |
| 7 | + paragraph: Learn how to add monitoring alerts to Serverless Jobs with Scaleway. |
| 8 | +tags: jobs alerts grafana threshold monitoring cockpit |
| 9 | +dates: |
| 10 | + validation: 2025-02-10 |
| 11 | + posted: 2025-02-10 |
| 12 | +categories: |
| 13 | + - serverless |
| 14 | +--- |
| 15 | + |
| 16 | +This page shows you how to configure alerts for Scaleway Serverless Jobs using Scaleway Cockpit and Grafana. |
| 17 | + |
| 18 | +<Macro id="requirements" /> |
| 19 | + |
| 20 | + - A Scaleway account logged into the [console](https://console.scaleway.com) |
| 21 | + - [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization |
| 22 | + - Scaleway resources you can monitor |
| 23 | + - [Created Grafana credentials](/cockpit/how-to/retrieve-grafana-credentials/) with the **Editor** role |
| 24 | + - [Enabled](/cockpit/how-to/enable-alert-manager/) the alert manager, and [activated preconfigured alerts](/cockpit/how-to/activate-managed-alerts/) |
| 25 | + - [Created](/cockpit/how-to/add-contact-points/) at least one contact point |
| 26 | + - Selected the **Scaleway Alerting** alert manager in Grafana |
| 27 | + |
| 28 | +1. [Log in to Grafana](/cockpit/how-to/access-grafana-and-managed-dashboards/) using your credentials. |
| 29 | +2. Click the **Toggle menu** then click **Alerting**. |
| 30 | +3. Click **Alert rules** and **+ New alert rule**. |
| 31 | +4. Scroll down to the **Define query and alert condition** section and click **Switch to data source-managed alert rule**. |
| 32 | + <Message type="important"> |
| 33 | + This allows you to configure alert rules managed by the data source of your choice, instead of using Grafana's managed alert rules. |
| 34 | + </Message> |
| 35 | + |
| 36 | +5. Type in a name for your alert. |
| 37 | +6. Select the data source you want to configure alerts for. For the sake of this documentation, we are choosing the **Scaleway Metrics** data source. |
| 38 | +7. In the Metrics browser drop-down, select the metric you want to configure an alert for. Refer to the table below for details on each alert for Serverless Jobs. |
| 39 | + |
| 40 | +**AnyJobError** |
| 41 | + |
| 42 | +Pending period |
| 43 | + |
| 44 | +: 5s |
| 45 | + |
| 46 | +Summary |
| 47 | + |
| 48 | +: Job run `{{ $labels.resource_id }}` is in error. |
| 49 | + |
| 50 | +Query and alert condition |
| 51 | + |
| 52 | +: `(serverless_job_run:state_failed == 1)` OR `(serverless_job_run:state_internal_error == 1)` |
| 53 | + |
| 54 | +Description |
| 55 | + |
| 56 | +: Job run `{{ $labels.resource_id }}` from the job def `{{ $labels.resource_name }}` finish in error. Check the console to find out the error message. |
| 57 | + |
| 58 | +**JobError** |
| 59 | + |
| 60 | +Pending period |
| 61 | + |
| 62 | +: 5s |
| 63 | + |
| 64 | +Summary |
| 65 | + |
| 66 | +: Job run `{{ $labels.resource_id }}` is in error. |
| 67 | + |
| 68 | +Query and alert condition |
| 69 | + |
| 70 | +: `(serverless_job_run:state_failed{resource_name="your-job-name-here"} == 1)` OR `(serverless_job_run:state_internal_error{resource_name="your-job-name-here"} == 1)` |
| 71 | + |
| 72 | +Description |
| 73 | + |
| 74 | +: Job run `{{ $labels.resource_id }}` from the job def `{{ $labels.resource_name }}` finish in error. Check the console to find out the error message. |
| 75 | + |
| 76 | +**AnyJobHighCPUUsage** |
| 77 | + |
| 78 | +Pending period |
| 79 | + |
| 80 | +: 10s |
| 81 | + |
| 82 | +Summary |
| 83 | + |
| 84 | +: High CPU usage for job run `{{ $labels.resource_id }}`. |
| 85 | + |
| 86 | +Query and alert condition |
| 87 | + |
| 88 | +: `serverless_job_run:cpu_usage_seconds_total:rate30s / serverless_job_run:cpu_limit * 100 > 90` |
| 89 | + |
| 90 | +Description |
| 91 | + |
| 92 | +: Job run `{{ $labels.resource_name }}` from the job def `{{ $labels.resource_name }}` is using more than `{{ printf "%.0f" $value }}`% of its available CPU since 10s. |
| 93 | + |
| 94 | +**JobHighCPUUsage** |
| 95 | + |
| 96 | +Pending period |
| 97 | + |
| 98 | +: 10s |
| 99 | + |
| 100 | +Summary |
| 101 | + |
| 102 | +: High CPU usage for job run `{{ $labels.resource_id }}`. |
| 103 | + |
| 104 | +Query and alert condition |
| 105 | + |
| 106 | +: `serverless_job_run:cpu_usage_seconds_total:rate30s{resource_name="your-job-name-here"} / serverless_job_run:cpu_limit{resource_name="your-job-name-here"} * 100 > 90` |
| 107 | + |
| 108 | +Description |
| 109 | + |
| 110 | +: Job run `{{ $labels.resource_name }}` from the job def `{{ $labels.resource_name }}` is using more than `{{ printf "%.0f" $value }}`% of its available CPU since 10s. |
| 111 | + |
| 112 | +**AnyJobHighMemoryUsage** |
| 113 | + |
| 114 | +Pending period |
| 115 | + |
| 116 | +: 10s |
| 117 | + |
| 118 | +Summary |
| 119 | + |
| 120 | +: High memory usage for job run `{{ $labels.resource_id }}`. |
| 121 | + |
| 122 | +Query and alert condition |
| 123 | + |
| 124 | +: `(serverless_job_run:memory_usage_bytes / serverless_job_run:memory_limit_bytes ) * 100 > 80` |
| 125 | + |
| 126 | +Description |
| 127 | + |
| 128 | +: Job run `{{ $labels.resource_name }}` from the job def `{{ $labels.resource_name }}` is using more than `{{ printf "%.0f" $value }}`% of its available RAM since 10s. |
| 129 | + |
| 130 | +**JobHighMemoryUsage** |
| 131 | + |
| 132 | +Pending period |
| 133 | + |
| 134 | +: 10s |
| 135 | + |
| 136 | +Summary |
| 137 | + |
| 138 | +: High memory usage for job run `{{ $labels.resource_id }}`. |
| 139 | + |
| 140 | +Query and alert condition |
| 141 | + |
| 142 | +: `(serverless_job_run:memory_usage_bytes{resource_id="your-job-name-here"} / serverless_job_run:memory_limit_bytes{resource_id="your-job-name-here"}) * 100 > 80` |
| 143 | + |
| 144 | +Description |
| 145 | + |
| 146 | +: Job run `{{ $labels.resource_name }}` from the job def `{{ $labels.resource_name }}` is using more than `{{ printf "%.0f" $value }}`% of its available RAM since 10s. |
| 147 | + |
| 148 | +8. Select labels that apply to the metric you have selected in the previous step, to target your desired resources and fine-tune your alert. |
| 149 | +9. Select one or more values for your labels. |
| 150 | +10. Click **Use query** to generate your alert based on the conditions you have defined. |
| 151 | +11. Select a folder to store your rule, or create a new one. Folders allow you to easily manage your different rules. |
| 152 | +12. Select an evaluation group to add your rule to. Rules within the same group are evaluated sequentially over the same time interval. |
| 153 | +13. In the **Set alert evaluation behavior** field, configure the amount of time during which the alert can be in breach of the condition(s) you have defined until it triggers. |
| 154 | + <Message type="note"> |
| 155 | + For example, if you wish to be alerted after your alert has been in breach of the condition for 2 minutes without interruption, type `2` and select `minutes` in the drop-down. |
| 156 | + </Message> |
| 157 | +14. Optionally, add a summary and a description. |
| 158 | +15. Click **Save rule** at the top right corner of your screen to save your alert. Once your alert meets the requirements you have configured, you will receive an email to inform you that your alert has triggered. |
0 commit comments