Skip to content

Commit 09b96f2

Browse files
Merge pull request #247679 from JnHs/jh-arck8-extorg
add alert steps
2 parents 77eee8a + e7fe15d commit 09b96f2

File tree

2 files changed

+234
-8
lines changed

2 files changed

+234
-8
lines changed
124 KB
Loading

articles/azure-arc/kubernetes/monitor-gitops-flux-2.md

Lines changed: 234 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
---
22
title: Monitor GitOps (Flux v2) status and activity
3-
ms.date: 07/28/2023
3+
ms.date: 08/11/2023
44
ms.topic: how-to
55
description: Learn how to monitor status, compliance, resource consumption, and reconciliation activity for GitOps with Flux v2.
66
---
77

88
# Monitor GitOps (Flux v2) status and activity
99

10-
We provide dashboards to help you monitor status, compliance, resource consumption, and reconciliation activity for GitOps with Flux v2 in your Azure Arc-enabled Kubernetes clusters or Azure Kubernetes Service (AKS) clusters. These JSON dashboards can be imported to Grafana to help you view and analyze your data in real time.
10+
We provide dashboards to help you monitor status, compliance, resource consumption, and reconciliation activity for GitOps with Flux v2 in your Azure Arc-enabled Kubernetes clusters or Azure Kubernetes Service (AKS) clusters. These JSON dashboards can be imported to Grafana to help you view and analyze your data in real time. You can also set up alerts for this information.
1111

1212
## Prerequisites
1313

@@ -53,14 +53,78 @@ The **Flux Configuration Compliance Status** table lists all Flux configurations
5353

5454
:::image type="content" source="media/monitor-gitops-flux2/flux-configuration-compliance.png" alt-text="Screenshot showing the Flux Configuration Compliance Status table in the Application Deployments dashboard." lightbox="media/monitor-gitops-flux2/flux-configuration-compliance.png":::
5555

56-
The **Count of Flux Extension Deployments by Status** chart shows the count of clusters, based on their provisioning state.
56+
The **Count of Flux Extension Deployments by Status** chart shows the count of clusters, based on their provisioning state.
5757

5858
:::image type="content" source="media/monitor-gitops-flux2/flux-deployments-by-status.png" alt-text="Screenshot of the Flux Extension Deployments by Status pie chart in the Application Deployments dashboard.":::
5959

6060
The **Count of Flux Configurations by Compliance Status** chart shows the count of Flux configurations, based on their compliance status with respect to the source repository.
6161

6262
:::image type="content" source="media/monitor-gitops-flux2/flux-configurations-by-status.png" alt-text="Screenshot of the Flux Configuration by Compliance Status chart on the Application Deployments dashboard.":::
6363

64+
### Filter dashboard data to track application deployments
65+
66+
You can filter data in the **GitOps Flux - Application Deployments Dashboard** to change the information shown. For example, you can show data for only certain subscriptions or resource groups, or limit data to a particular cluster. To do so, select the filter option either from the top level dropdowns or from any column header in the tables.
67+
68+
For example, in the **Flux Configuration Compliance Status** table, you can select a specific commit from the **SourceLastSyncCommit** column. By doing so, you can track the status of a configuration deployment to all of the clusters affected by that commit.
69+
70+
### Create alerts for extension and configuration failures
71+
72+
After you've imported the dashboard as described in the previous section, you can set up alerts. These alerts notify you when Flux extensions or Flux configurations experience failures.
73+
74+
Follow the steps below to create an alert. Example queries are provided to detect extension provisioning or extension upgrade failures, or to detect compliance state failures.
75+
76+
1. In the left navigation menu of the dashboard, select **Alerting**.
77+
1. Select **Alert rules**.
78+
1. Select **+ Create alert rule**. The new alert rule page opens, with the **Grafana managed alerts** option selected by default.
79+
1. In **Rule name**, add a descriptive name. This name is displayed in the alert rule list, and it will be the used as the `alertname` label for every alert instance created from this rule.
80+
1. Under **Set a query and alert condition**:
81+
82+
- Select a data source. The same data source used for the dashboard may be used here.
83+
- For **Service**, select **Azure Resource Graph**.
84+
- Select the subscriptions from the dropdown list.
85+
- Enter the query you want to use. For example, for extension provisioning or upgrade failures, you can enter this query:
86+
87+
```kusto
88+
kubernetesconfigurationresources
89+
| where type == "microsoft.kubernetesconfiguration/extensions"
90+
| extend provisioningState = tostring(properties.ProvisioningState)
91+
| where provisioningState == "Failed"
92+
| summarize count() by provisioningState
93+
```
94+
95+
Or for compliance state failures, you can enter this query:
96+
97+
```kusto
98+
kubernetesconfigurationresources
99+
| where type == "microsoft.kubernetesconfiguration/fluxconfigurations"
100+
| extend complianceState=tostring(properties.complianceState)
101+
| where complianceState == "Non-Compliant"
102+
| summarize count() by complianceState
103+
```
104+
105+
- For **Threshold box**, select **A** for input type and set the threshold to **0** to receive alerts even if just one extension fails on the cluster. Mark this as the **Alert condition**.
106+
107+
:::image type="content" source="media/monitor-gitops-flux2/application-dashboard-set-alerts.png" alt-text="Screenshot showing the alert creation process." lightbox="media/monitor-gitops-flux2/application-dashboard-set-alerts.png":::
108+
109+
1. Specify the alert evaluation interval:
110+
111+
- For **Condition**, select the query or expression to trigger the alert rule.
112+
- For **Evaluate every**, enter the evaluation frequency as a multiple of 10 seconds.
113+
- For **Evaluate for**, specify how long the condition must be true before the alert is created.
114+
- In **Configure no data and error handling**, indicate what should happen when the alert rule returns no data or returns an error.
115+
- To check the results from running the query, select **Preview**.
116+
117+
1. Add the storage location, rule group, and any additional metadata that you want to associate with the rule.
118+
119+
- For **Folder**, select the folder where the rule should be stored.
120+
- For **Group**, specify a predefined group.
121+
- If desired, add a description and summary to customize alert messages.
122+
- Add Runbook URL, panel, dashboard, and alert IDs as needed.
123+
124+
1. If desired, add any custom labels. Then select **Save**.
125+
126+
You can also [configure contact points](https://grafana.com/docs/grafana/latest/alerting/alerting-rules/manage-contact-points/) and [configure notification policies](https://grafana.com/docs/grafana/latest/alerting/alerting-rules/create-notification-policy/) for your alerts.
127+
64128
## Monitor resource consumption and reconciliations
65129
66130
Follow these steps to import dashboards that let you monitor Flux resource consumption, reconciliations, API requests, and reconciler status.
@@ -118,7 +182,7 @@ Follow these steps to import dashboards that let you monitor Flux resource consu
118182
1. [Link the Managed Prometheus workspace to the Managed Grafana instance](/azure/azure-monitor/essentials/azure-monitor-workspace-manage#link-a-grafana-workspace). This takes a few minutes to complete.
119183
1. Follow the steps to [import these JSON dashboards to Grafana](/azure/managed-grafana/how-to-create-dashboard#import-a-json-dashboard).
120184

121-
After you have imported the dashboards, they'll display information from the clusters that you're monitoring.
185+
After you have imported the dashboards, they'll display information from the clusters that you're monitoring. To show information only for a particular cluster or namespace, use the filters near the top of each dashboard.
122186

123187
The **Flux Control Plane** dashboard shows details about status resource consumption, reconciliations at the cluster level, and Kubernetes API requests.
124188

@@ -128,11 +192,173 @@ The **Flux Cluster Stats** dashboard shows details about the number of reconcile
128192

129193
:::image type="content" source="media/monitor-gitops-flux2/flux-cluster-stats-dashboard.png" alt-text="Screenshot of the Flux Cluster Stats dashboard." lightbox="media/monitor-gitops-flux2/flux-cluster-stats-dashboard.png":::
130194

131-
## Filter dashboard data to track Application Deployments
195+
### Create alerts for resource consumption and reconciliation issues
196+
197+
After you've imported the dashboard as described in the previous section, you can set up alerts. These alerts notify you of resource consumption and reconciliation issues that may require attention.
198+
199+
To enable these alerts, you deploy a Bicep template similar to the one shown here. The alert rules in this template are samples that can be modified as needed.
200+
201+
Once you've downloaded the Bicep template and made your changes, [follow these steps to deploy the template](/azure/azure-resource-manager/bicep/template-specs).
202+
203+
```bicep
204+
param azureMonitorWorkspaceName string
205+
param alertReceiverEmailAddress string
206+
207+
param kustomizationLookbackPeriodInMinutes int = 5
208+
param helmReleaseLookbackPeriodInMinutes int = 5
209+
param gitRepositoryLookbackPeriodInMinutes int = 5
210+
param bucketLookbackPeriodInMinutes int = 5
211+
param helmRepoLookbackPeriodInMinutes int = 5
212+
param timeToResolveAlerts string = 'PT10M'
213+
param location string = resourceGroup().location
214+
215+
resource azureMonitorWorkspace 'Microsoft.Monitor/accounts@2023-04-03' = {
216+
name: azureMonitorWorkspaceName
217+
location: location
218+
}
219+
220+
resource fluxRuleActionGroup 'Microsoft.Insights/actionGroups@2023-01-01' = {
221+
name: 'fluxRuleActionGroup'
222+
location: 'global'
223+
properties: {
224+
enabled: true
225+
groupShortName: 'fluxGroup'
226+
emailReceivers: [
227+
{
228+
name: 'emailReceiver'
229+
emailAddress: alertReceiverEmailAddress
230+
}
231+
]
232+
}
233+
}
234+
235+
resource fluxRuleGroup 'Microsoft.AlertsManagement/prometheusRuleGroups@2023-03-01' = {
236+
name: 'fluxRuleGroup'
237+
location: location
238+
properties: {
239+
description: 'Flux Prometheus Rule Group'
240+
scopes: [
241+
azureMonitorWorkspace.id
242+
]
243+
enabled: true
244+
interval: 'PT1M'
245+
rules: [
246+
{
247+
alert: 'KustomizationNotReady'
248+
expression: 'sum by (cluster, namespace, name) (gotk_reconcile_condition{type="Ready", status="False", kind="Kustomization"}) > 0'
249+
for: 'PT${kustomizationLookbackPeriodInMinutes}M'
250+
labels: {
251+
description: 'Kustomization reconciliation failing for last ${kustomizationLookbackPeriodInMinutes} minutes.'
252+
}
253+
annotations: {
254+
description: 'Kustomization reconciliation failing for last ${kustomizationLookbackPeriodInMinutes} minutes.'
255+
}
256+
enabled: true
257+
severity: 3
258+
resolveConfiguration: {
259+
autoResolved: true
260+
timeToResolve: timeToResolveAlerts
261+
}
262+
actions: [
263+
{
264+
actionGroupId: fluxRuleActionGroup.id
265+
}
266+
]
267+
}
268+
{
269+
alert: 'HelmReleaseNotReady'
270+
expression: 'sum by (cluster, namespace, name) (gotk_reconcile_condition{type="Ready", status="False", kind="HelmRelease"}) > 0'
271+
for: 'PT${helmReleaseLookbackPeriodInMinutes}M'
272+
labels: {
273+
description: 'HelmRelease reconciliation failing for last ${helmReleaseLookbackPeriodInMinutes} minutes.'
274+
}
275+
annotations: {
276+
description: 'HelmRelease reconciliation failing for last ${helmReleaseLookbackPeriodInMinutes} minutes.'
277+
}
278+
enabled: true
279+
severity: 3
280+
resolveConfiguration: {
281+
autoResolved: true
282+
timeToResolve: timeToResolveAlerts
283+
}
284+
actions: [
285+
{
286+
actionGroupId: fluxRuleActionGroup.id
287+
}
288+
]
289+
}
290+
{
291+
alert: 'GitRepositoryNotReady'
292+
expression: 'sum by (cluster, namespace, name) (gotk_reconcile_condition{type="Ready", status="False", kind="GitRepository"}) > 0'
293+
for: 'PT${gitRepositoryLookbackPeriodInMinutes}M'
294+
labels: {
295+
description: 'GitRepository reconciliation failing for last ${gitRepositoryLookbackPeriodInMinutes} minutes.'
296+
}
297+
annotations: {
298+
description: 'GitRepository reconciliation failing for last ${gitRepositoryLookbackPeriodInMinutes} minutes.'
299+
}
300+
enabled: true
301+
severity: 3
302+
resolveConfiguration: {
303+
autoResolved: true
304+
timeToResolve: timeToResolveAlerts
305+
}
306+
actions: [
307+
{
308+
actionGroupId: fluxRuleActionGroup.id
309+
}
310+
]
311+
}
312+
{
313+
alert: 'BucketNotReady'
314+
expression: 'sum by (cluster, namespace, name) (gotk_reconcile_condition{type="Ready", status="False", kind="Bucket"}) > 0'
315+
for: 'PT${bucketLookbackPeriodInMinutes}M'
316+
labels: {
317+
description: 'Bucket reconciliation failing for last ${bucketLookbackPeriodInMinutes} minutes.'
318+
}
319+
annotations: {
320+
description: 'Bucket reconciliation failing for last ${bucketLookbackPeriodInMinutes} minutes.'
321+
}
322+
enabled: true
323+
severity: 3
324+
resolveConfiguration: {
325+
autoResolved: true
326+
timeToResolve: timeToResolveAlerts
327+
}
328+
actions: [
329+
{
330+
actionGroupId: fluxRuleActionGroup.id
331+
}
332+
]
333+
}
334+
{
335+
alert: 'HelmRepositoryNotReady'
336+
expression: 'sum by (cluster, namespace, name) (gotk_reconcile_condition{type="Ready", status="False", kind="HelmRepository"}) > 0'
337+
for: 'PT${helmRepoLookbackPeriodInMinutes}M'
338+
labels: {
339+
description: 'HelmRepository reconciliation failing for last ${helmRepoLookbackPeriodInMinutes} minutes.'
340+
}
341+
annotations: {
342+
description: 'HelmRepository reconciliation failing for last ${helmRepoLookbackPeriodInMinutes} minutes.'
343+
}
344+
enabled: true
345+
severity: 3
346+
resolveConfiguration: {
347+
autoResolved: true
348+
timeToResolve: timeToResolveAlerts
349+
}
350+
actions: [
351+
{
352+
actionGroupId: fluxRuleActionGroup.id
353+
}
354+
]
355+
}
356+
]
357+
}
358+
}
359+
360+
```
132361

133-
You can filter data in the **GitOps Flux - Application Deployments Dashboard** to change the information shown. For example, you can show data for only certain subscriptions or resource groups, or limit data to a particular cluster. To do so, select the filter option either from the top level dropdowns or from any column header in the tables.
134-
135-
For example, in the **Flux Configuration Compliance Status** table, you can select a specific commit from the **SourceLastSyncCommit** column. By doing so, you can track the status of a configuration deployment to all of the clusters affected by that commit.
136362

137363
## Next steps
138364

0 commit comments

Comments
 (0)