Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
159 changes: 84 additions & 75 deletions pages/serverless-jobs/how-to/configure-alerts-jobs.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,12 @@ title: How to configure alerts for a job
description: Learn how to add monitoring alerts to Serverless Jobs with Scaleway.
tags: jobs alerts grafana threshold monitoring cockpit
dates:
validation: 2025-09-02
validation: 2025-09-19
posted: 2025-02-10
---
import Requirements from '@macros/iam/requirements.mdx'
import AdvancedOptionsGrafana from './assets/scaleway-advanced-options.webp'
import DataSourceManaged from './assets/scaleway-datasource-managed.webp'

This page shows you how to configure alerts for Scaleway Serverless Jobs using Scaleway Cockpit and Grafana.

Expand All @@ -17,137 +19,144 @@ This page shows you how to configure alerts for Scaleway Serverless Jobs using S
- Scaleway resources you can monitor
- [Created Grafana credentials](/cockpit/how-to/retrieve-grafana-credentials/) with the **Editor** role
- [Enabled](/cockpit/how-to/enable-alert-manager/) the alert manager
- [Created](/cockpit/how-to/add-contact-points/) at least one contact point
- [Added](/cockpit/how-to/add-contact-points/) at least one contact in the Scaleway console or contact points in Grafana
- Selected the **Scaleway Alerting** alert manager in Grafana

1. [Log in to Grafana](/cockpit/how-to/access-grafana-and-managed-dashboards/) using your credentials.
2. Click the **Toggle menu** then click **Alerting**.
3. Click **Alert rules** and **+ New alert rule**.
4. Scroll down to the **Define query and alert condition** section and click **Switch to data source-managed alert rule**.
2. Click the Grafana icon in the top left side of your screen to open the menu.
3. Click the arrow next to **Alerting** on the left-side menu, then click **Alert rules**.
4. Click **+ New alert rule**.
5. Enter a name for your alert.
6. In the **Define query and alert condition** section, toggle **Advanced options**.
<Lightbox image={AdvancedOptionsGrafana} alt="" />
7. Select the data source you want to configure alerts for. For the sake of this documentation, we are choosing the **Scaleway Metrics** data source.
8. In the **Rule type** subsection, click the **Data source-managed** tab.
<Lightbox image={DataSourceManaged} alt="" />

<Message type="important">
This allows you to configure alert rules managed by the data source of your choice, instead of using Grafana's managed alert rules.
Data source managed alert rules allow you to configure alerts managed by the data source of your choice, instead of using Grafana's managed alerting system **which is not supported by Cockpit**.
This step is **mandatory** because Cockpit does not support Grafana’s built-in alerting system, but only alerts configured and evaluated by the data source itself.
</Message>
9. In the query field next to the **Loading metrics... >** button, select the metric you want to configure an alert for. Refer to the table below for details on each alert for Serverless Jobs.

5. Type in a name for your alert.
6. Select the data source you want to configure alerts for. For the sake of this documentation, we are choosing the **Scaleway Metrics** data source.
7. In the Metrics browser drop-down, select the metric you want to configure an alert for. Refer to the table below for details on each alert for Serverless Jobs.

**AnyJobError**
**AnyJobError**

Pending period
Pending period

: 5s
: 5s

Summary
Summary

: Job run `{{ $labels.resource_id }}` is in error.
: Job run `{{ $labels.resource_id }}` is in error.

Query and alert condition
Query and alert condition

: `(serverless_job_run:state_failed == 1)` OR `(serverless_job_run:state_internal_error == 1)`
: `(serverless_job_run:state_failed == 1)` OR `(serverless_job_run:state_internal_error == 1)`

Description
Description

: Job run `{{ $labels.resource_id }}` from the job definition `{{ $labels.resource_name }}` finish in error. Check the console to find out the error message.
: Job run `{{ $labels.resource_id }}` from the job definition `{{ $labels.resource_name }}` finish in error. Check the console to find out the error message.

**JobError**
**JobError**

Pending period
Pending period

: 5s
: 5s

Summary
Summary

: Job run `{{ $labels.resource_id }}` is in error.
: Job run `{{ $labels.resource_id }}` is in error.

Query and alert condition
Query and alert condition

: `(serverless_job_run:state_failed{resource_name="your-job-name-here"} == 1)` OR `(serverless_job_run:state_internal_error{resource_name="your-job-name-here"} == 1)`
: `(serverless_job_run:state_failed{resource_name="your-job-name-here"} == 1)` OR `(serverless_job_run:state_internal_error{resource_name="your-job-name-here"} == 1)`

Description
Description

: Job run `{{ $labels.resource_id }}` from the job definition `{{ $labels.resource_name }}` finish in error. Check the console to find out the error message.
: Job run `{{ $labels.resource_id }}` from the job definition `{{ $labels.resource_name }}` finish in error. Check the console to find out the error message.

**AnyJobHighCPUUsage**
**AnyJobHighCPUUsage**

Pending period
Pending period

: 10s
: 10s

Summary
Summary

: High CPU usage for job run `{{ $labels.resource_id }}`.
: High CPU usage for job run `{{ $labels.resource_id }}`.

Query and alert condition
Query and alert condition

: `serverless_job_run:cpu_usage_seconds_total:rate30s / serverless_job_run:cpu_limit * 100 > 90`
: `serverless_job_run:cpu_usage_seconds_total:rate30s / serverless_job_run:cpu_limit * 100 > 90`

Description
Description

: Job run `{{ $labels.resource_name }}` from the job definition `{{ $labels.resource_name }}` is using more than `{{ printf "%.0f" $value }}`% of its available CPU since 10s.
: Job run `{{ $labels.resource_name }}` from the job definition `{{ $labels.resource_name }}` is using more than `{{ printf "%.0f" $value }}`% of its available CPU since 10s.

**JobHighCPUUsage**
**JobHighCPUUsage**

Pending period
Pending period

: 10s
: 10s

Summary
Summary

: High CPU usage for job run `{{ $labels.resource_job definition }}`.
: High CPU usage for job run `{{ $labels.resource_job definition }}`.

Query and alert condition
Query and alert condition

: `serverless_job_run:cpu_usage_seconds_total:rate30s{resource_name="your-job-name-here"} / serverless_job_run:cpu_limit{resource_name="your-job-name-here"} * 100 > 90`
: `serverless_job_run:cpu_usage_seconds_total:rate30s{resource_name="your-job-name-here"} / serverless_job_run:cpu_limit{resource_name="your-job-name-here"} * 100 > 90`

Description
Description

: Job run `{{ $labels.resource_name }}` from the job definition `{{ $labels.resource_name }}` is using more than `{{ printf "%.0f" $value }}`% of its available CPU since 10s.
: Job run `{{ $labels.resource_name }}` from the job definition `{{ $labels.resource_name }}` is using more than `{{ printf "%.0f" $value }}`% of its available CPU since 10s.

**AnyJobHighMemoryUsage**
**AnyJobHighMemoryUsage**

Pending period
Pending period

: 10s
: 10s

Summary
Summary

: High memory usage for job run `{{ $labels.resource_job definition }}`.
: High memory usage for job run `{{ $labels.resource_job definition }}`.

Query and alert condition
Query and alert condition

: `(serverless_job_run:memory_usage_bytes / serverless_job_run:memory_limit_bytes ) * 100 > 80`
: `(serverless_job_run:memory_usage_bytes / serverless_job_run:memory_limit_bytes ) * 100 > 80`

Description
Description

: Job run `{{ $labels.resource_name }}` from the job definition `{{ $labels.resource_name }}` is using more than `{{ printf "%.0f" $value }}`% of its available RAM since 10s.
: Job run `{{ $labels.resource_name }}` from the job definition `{{ $labels.resource_name }}` is using more than `{{ printf "%.0f" $value }}`% of its available RAM since 10s.

**JobHighMemoryUsage**
**JobHighMemoryUsage**

Pending period
Pending period

: 10s
: 10s

Summary
Summary

: High memory usage for job run `{{ $labels.resource_id }}`.
: High memory usage for job run `{{ $labels.resource_id }}`.

Query and alert condition
Query and alert condition

: `(serverless_job_run:memory_usage_bytes{resource_id="your-job-name-here"} / serverless_job_run:memory_limit_bytes{resource_id="your-job-name-here"}) * 100 > 80`
: `(serverless_job_run:memory_usage_bytes{resource_id="your-job-name-here"} / serverless_job_run:memory_limit_bytes{resource_id="your-job-name-here"}) * 100 > 80`

Description
Description

: Job run `{{ $labels.resource_name }}` from the job definition `{{ $labels.resource_name }}` is using more than `{{ printf "%.0f" $value }}`% of its available RAM since 10s.
: Job run `{{ $labels.resource_name }}` from the job definition `{{ $labels.resource_name }}` is using more than `{{ printf "%.0f" $value }}`% of its available RAM since 10s.

8. Select labels that apply to the metric you have selected in the previous step, to target your desired resources and fine-tune your alert.
9. Select one or more values for your labels.
10. Click **Use query** to generate your alert based on the conditions you have defined.
11. Select a folder to store your rule, or create a new one. Folders allow you to easily manage your different rules.
12. Select an evaluation group to add your rule to. Rules within the same group are evaluated sequentially over the same time interval.
13. In the **Set alert evaluation behavior** field, configure the amount of time during which the alert can be in breach of the condition(s) you have defined until it triggers.
<Message type="note">
For example, if you wish to be alerted after your alert has been in breach of the condition for 2 minutes without interruption, type `2` and select `minutes` in the drop-down.
</Message>
14. Optionally, add a summary and a description.
15. Click **Save rule** at the top right corner of your screen to save your alert. Once your alert meets the requirements you have configured, you will receive an email to inform you that your alert has been triggered.
10. Make sure that the values for the labels you have selected correspond to those of the target resource.
11. In the **Set alert evaluation behavior** section, specify how long the condition must be met before triggering the alert.
12. Enter a name in the **Namespace** and **Group** fields to categorize and manage your alert rules. Rules that share the same group will use the same configuration, including the evaluation interval which determines how often the rule is evaluated (by default: every 1 minute). You can modify this interval later in the group settings.
<Message type="note">
The evaluation interval is different from the pending period set in step 2. The evaluation interval controls how often the rule is checked, while the pending period defines how long the condition must be continuously met before the alert fires.
</Message>
13. In the **Configure labels and notifications** section, click **+ Add labels**. A pop-up appears.
14. Enter a label and value name and click **Save**. You can skip this step if you want your alerts to be sent to the contacts you may already have created in the Scaleway console.
<Message type="note">
In Grafana, notifications are sent by matching alerts to notification policies based on labels. This step is about deciding how alerts will reach you or your team (Slack, email, etc.) based on labels you attach to them. Then, you can set up rules that define who receives notifications in the **Notification policies** page.
Find out how to [configure notification policies in Grafana](/tutorials/configure-slack-alerting/#configuring-a-notification-policy).
</Message>
15. Click **Save rule and exit** in the top right corner of your screen to save and activate your alert. Once your alert meets the requirements you have configured, you will receive an email to inform you that your alert has been triggered.