You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/sre-agent/incident-management-tools.md
+1-18Lines changed: 1 addition & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ Azure SRE Agent leverages a comprehensive suite of specialized incident manageme
14
14
15
15
The tools are designed to handle various scenarios including:
16
16
17
-
-**Incident Management**: Acknowledge, manage, and track incidents across ICM and PagerDuty
17
+
-**Incident Management**: Acknowledge, manage, and track incidents in PagerDuty
18
18
-**Resource Monitoring**: Analyze performance metrics, health status, and configuration changes
19
19
-**Diagnostic Operations**: Collect logs, analyze memory dumps, and perform deep system analysis
20
20
-**Infrastructure Management**: Scale resources, modify configurations, and manage network settings
@@ -28,16 +28,12 @@ The following table provides a comprehensive list of all tools available to Azur
28
28
29
29
| Tool name | Description |
30
30
|---|---|
31
-
| AcknowledgeIncident | Acknowledges an ICM incident |
32
31
| AcknowledgePagerDutyIncident | Acknowledges a PagerDuty incident |
33
-
| AddDiscussionEntryRCAContainerApp |**Note: DO NOT CALL IT AUTOMATICALLY. ALWAYS ASK USER BEFORE CALLING IT** <br><br> Add a valid HTML-formatted message discussion entry or summary of final investigate to an ICM incident <br><br> This operation adds a discussion entry to the given IcM Incident. <br><br> input parameters: <br><br> - incidentId: The Id of the IcM incident. It is usually an integer number. <br><br> - text: A well HTML-formatted message to add as discussion to IcM. <br><br> NOTE: <br><br> - text MUST be always valid HTML formatted message <br><br> - Remove all emojis if any present. <br><br> The operation adds a discussion entry to the given incident. <br><br> The return value is a boolean value for indicating if the operation is successful. |
34
32
| AddIgnoreTagToResource | Adds a tag to a resource to prevent it from being flagged in a scan for a specified period of time. |
35
-
| AddKeywordToIncident | Add a keyword to an ICM incident |
36
33
| AddParentIncidentLink | Adds a parent incident link to the given incident id |
37
34
| AddRelatedIncidentLink | Adds a related incident link to the given incident id |
38
35
| AddRoleAssignment | Adds a role assignment for a user or managed identity on an Azure resource |
39
36
| AddSourceCodeNodeToContainerAppNode | Adds the GitHub repo url node and an edge from the container app node to it |
40
-
| AddTagToIncident | Add a tag to an ICM incident |
41
37
| AnalyzeDotnetAppMemoryInAKSContainer | Performs memory analysis for a . NET application running in a specific pod and container within an AKS cluster. <br><br> This analysis involves collecting a memory dump, running an analyzer tool inside the container, and returning the analysis results. <br><br> This tool can help identify memory leaks, high memory usage patterns, and other memory-related issues in . NET applications. <br><br> Use this tool when investigating memory problems for a . NET app in AKS. <br><br> Example: Analyze the memory of the . NET app in pod cart-service-pod-abc789 in the e-commerce namespace. <br><br> Example: My . NET app order-processor in pod proc-pod-123 seems to be using too much memory, can you analyze it? |
42
38
| AskUserForInput | Sends the specified message to the user and indicates that you require a response to proceed. Do not use this tool for any scenario where you just need to send the user an update in a fire and forget manner. If the user responds in a manner that does not satisfactorily answer your question, use this tool again. |
43
39
| AutoScaleApp | Create auto scale settings for app to auto scale app |
@@ -63,7 +59,6 @@ The following table provides a comprehensive list of all tools available to Azur
63
59
| DiscoverApplications | Analyzes an Azure subscription and returns a `List<ApplicationGraph>`, where each ApplicationGraph represents a distinct application. Each ApplicationGraph contains: id, name, entryPoint (main resource Node), nodes (`List<Node>` of related resources), and edges (`List<Edge>` showing relationships). Entry points are identified from Container Apps, App Services. The function maps out application topologies, including all connected resources and relationships. Returns an empty list if no applications are found. |
64
60
| DiscoverPrometheusMetrics | Discover available Prometheus metrics with optional filtering. |
65
61
| DnsServerManagerOperation | Retrieve DNS server manager operations to identify any DNS resolution issues that might affect outbound connections for a specific container app pod. <br><br> This query examines logs from DNS-related components to identify configuration issues or operational problems. <br><br> <br><br> What this metric measures: <br><br> - DNS server manager operations and their outcomes <br><br> - DNS listener manager activities <br><br> - CoreDNS manager operations <br><br> - Timing and trace information for DNS operations <br><br> <br><br> When it is applicable: <br><br> Useful for correlating connection issues with DNS problems, identifying DNS configuration changes, or troubleshooting name resolution failures. |
66
-
| DowngradeSeverity | Downgrade severity of ICM incident 2 to 3 |
67
62
| EventHubSetLocalAuthSupport | Sets the key based local auth setting on event hub accounts microsoft.eventhub/namespaces. This procedure forces callers to use non key based authentication methods such as managed identities or service principals. |
68
63
| ExecuteClusterKustoQuery | Executes a fully qualified Kusto query on a specific cluster and database, returning the result in JSON format. |
69
64
| ExtractTextFromImageInGitHubIssue | Extract text from an image in a GitHub issue body or comment. The image URL is of the form https://github.com/user-attachments/assets/GUID.|
@@ -147,12 +142,9 @@ The following table provides a comprehensive list of all tools available to Azur
147
142
| GetCustomContainerSessionEnvoyRequests | Get all failed envoy requests for a custom container session in the given time range. <br><br> This is useful to identify the issues with failed requests. <br><br> For each failed request, it returns the Status which is the status code and ResponseCodeDetails which is the envoy response code for the request. |
148
143
| GetCustomContainerSessionLegionPoolStatus | Get the status of a custom container session legion pool for a given subscription, resource group, and session pool name. <br><br> It returns the number of pods in pool which are ready, pending, allocated and inactive. |
149
144
| GetCustomDNSServers | Get list of custom DNS servers configured for the container app environment at start and end of time window. It also checks if custom DNS servers are configured or not. <br><br> If no data is returned then ask to validate inputs again as it should never be the case. |
150
-
| GetCustomFields | Get ICM incident custom fields |
151
145
| GetDeleteNetworkContainerOperation | Retrieves the delete operation details for a specific NetworkContainerID. <br><br>This tool will return all the DeleteNetworkContainer operations with detailed Message. <br><br>- If no results are returned, it means there is no delete operation for the specified NetworkContainerID within the given time range. You need to highlight it since no delete operation was found it may indicate that the NetworkContainerID is leaked. <br><br>- If the results are not empty, it means the delete operation was performed successfully or failed. The Message field will provide more details about the operation. Always show timestamp, NodeId, ContainerId, OperationName and Message fields in the result. |
152
146
| GetDeploymentActivity | Gets Deployment Activities on the specified app |
153
147
| GetDeploymentTimes | Get the deployment times of a Container App instance. |
154
-
| GetDiscussionEntries | Get ICM discussion entries |
155
-
| GetDiscussionEntriesRCAContainerApp | Get original ICM discussion entries <br><br> This operation will get all the discussion entries of the given IcM Incident. <br><br> Input parameters: <br><br> - IncidentId: The Id of the IcM incident. It is usually an integer number. <br><br> - QueryFrom: The timestamp for filter the discussion entries which are created after it. <br><br> The return value is a list of discussion entries of the given IcM Incident. Each discussion entry includes the following information: <br><br> - IncidentId: The Id of the IcM incident. <br><br> - TimeStamp: The timestamp of the discussion entry. <br><br> - ChangedBy: The user who created this discussion entry. |
156
148
| GetDNSConfigUpdateStatus | Get DNS config update status for the container app environment for a given time frame |
157
149
| GetEnvironmentQuota | Get Container App Environment Quota limit. <br><br> Input parameters: <br><br> - environmentResourceURL: The resource url of the container app environment. Format /subscriptions/[SubscriptionId]/resourceGroups/[resource group name]/providers/Microsoft. App/managedEnvironments/[environment name] <br><br> - region: The region of the quota needs to be set. example eastus <br><br> - quotaType: The quota type. example ManagedEnvironmentConsumptionCores <br><br> The return value is a string containing the quota limit value for the specified environment, region, and quota type. |
158
150
| GetEnvironmentQuotaOperationResult | Get the operation result of setting Managed Environment Quota limit. <br><br>Input parameters: <br><br>- operationId: The trace id of the operation, which can be used to track the operation in the Kusto table ContainerAppsAdminEvents. <br><br>- region: The region of the quota needs to be set. <br><br> <br><br>Output: <br><br>- PreciseTimeStamp: the time when the operation is completed. <br><br>- operationStatus: the status of the operation result. <br><br>- message: Describes the set Managed Environment Quota limit operation result. |
@@ -181,11 +173,7 @@ The following table provides a comprehensive list of all tools available to Azur
181
173
| GetFunctionAppSlotSwapHistory | Gets detailed Function App slot swap information to analyze swap operations. Retrieves chronological slot swap records, including source and target slots, operation status, and timestamps. Returns comprehensive slot swap timeline with success/failure information. |
182
174
| GetGeneralHealth | Retrieves dashboard metrics for a specific Azure resource and generates an AI-powered health summary. This function is useful when you need to: 1) Get a quick health assessment of a resource/general health of the resource for questions like 'how is my resource doing?', 2) Understand performance trends and potential issues, 3) View summarized metrics without accessing the Azure portal, or 4) Get actionable insights about resource behavior. The resources themselves also have a health score cord, use this method for verbose analysis. The output is a text summary that describes the resources health status, important metrics, and any anomalies or concerns. |
183
175
| GetHostRuntimeErrorEvents | Gets host runtime error events from the activity logs for an Azure Function App |
184
-
| GetIcmCorrelationAndLinkingRules | This tool identifies potential relationships between incidents. Invoke this tool whenever the user requests assistance with finding related, parent, or child incidents; especially when conditions such as time windows, title matching, or shared patterns are specified. The rules are applied internally to guide the agents actions without being returned to the user. |
185
176
| GetImageReferenceFromResourceId | Gets the container image reference from a resource ID |
186
-
| GetIncidentInfo | Get ICM incident details |
187
-
| GetIncidentInfoRCAContainerApp | Get original ICM incident information. |
188
-
| GetIncidentRepairItems | Get repair items associated with an ICM incident |
189
177
| GetInputPressureOnLogProcessor | Get Input Pressure on Log Processor for the managed Kubernetes cluster, segmented by node or VMSS over a specified time range. <br><br> <br><br> What this metric measures: The query calculates the total records input to log-processor. <br><br> <br><br> When it is applicable: Anomaly in this indicates high resource pressure on the log-processor. |
190
178
| GetIssueInvestigationTimeRangeRCAContainerApp | Calculates the effective time range for issue investigation based on the available input parameters. <br><br> At least one of the following must be provided: issueFirstOccurrence, issueLastOccurrence, or reportedIssueObservedOnTime. <br><br> **Important:** <br><br> - Do NOT use this function if none of the input parameters are available. |
191
179
| GetJobDefinition | Retrieve the Container Apps job definition (spec) for a given Container App Job <br><br>Projects: <br><br> - Timestamp: Timestamp of the job definition. More than 1 row indicates change in job defintion(spec). <br><br> - Configuration: Configuration details for the job, like trigger type, retries, job deadlines, completion times <br><br> parallelism for the job, container registry, assigned identity details. <br><br> - Template: Job template containing job containers details, cpu, memory resource details. <br><br> - Labels: Labels for the job. It has the managed environment name and workload profile name for the job. <br><br> - Status: Status of the container app Job. It has jobRunningState and jobProvisioningState. <br><br> Possible values are for jobRunningState: Running, Suspended. <br><br> Possible values for jobProvisioningState: Provisioned, Failed. |
@@ -298,7 +286,6 @@ The following table provides a comprehensive list of all tools available to Azur
298
286
| ListRevisions | List all revisions for a container app by its resource ID. |
299
287
| ListSubscriptions | Returns a list of all Azure subscription IDs present in the knowledge graph. This function is useful when you need to: 1) Discover available subscriptions, 2) Verify subscription visibility to the agent, 3) Get subscription IDs for use with other commands, or 4) Perform an inventory of monitored subscriptions. The output is a list of subscription IDs without additional details. |
300
288
| ListWorkloadRevisions | List all revisions for a specific Kubernetes workload (Deployment or StatefulSet) and sort by revision number. <br><br>For deployments, it fetches ReplicaSets owned by the deployment. <br><br>For StatefulSets, it fetches ControllerRevision objects. <br><br>Used whenever user wants to check the revision history of a workload. <br><br>eg: show me all revisions of the nginx deployment in the default namespace. |
301
-
| MitigateIncident | Mitigate ICM incident |
302
289
| ModifyContainerAppScaleRule | Adds a new scaling rule to a Container App. Use this to define custom scaling behavior based on CPU, HTTP traffic, Azure Queue length, or any scaler from the scaler list. |
303
290
| ModifyGrafanaDashboard | Modifies an existing Grafana dashboard based on user-requested changes or creates a new one from a template. Dashboard can be specified by name or UID. |
304
291
| NotifyUser | Sends the specified message to the user. Use this to send updates about your current task as you are working on it. Do not use this for asking questions to the user, only for status updates. |
@@ -311,7 +298,6 @@ The following table provides a comprehensive list of all tools available to Azur
311
298
| PlotPieChart | Generates a pie chart from the provided data and returns (or posts) it. <br><br>Parameters: <br><br>chartTitle: The title displayed at the top of the pie chart. <br><br>dataPoints: Semicolon-separated items in format sliceLabel|value, for example: Category A|45;Category B|30;Category C|25. <br><br>description: A short message to summarize the image. |
312
299
| PlotScatter | Generates a scatter plot from X-Y coordinate pairs and returns (or posts) it. <br><br>Parameters: <br><br>chartTitle: The title displayed at the top of the scatter plot. <br><br>xAxisLabel: Label for the X-axis. <br><br>yAxisLabel: Label for the Y-axis. <br><br>dataPoints: Semicolon-separated items in format x|y|label, <br><br>Example: 1.2|3.4|Point A;2.3|4.5|Point B;3.4|5.6|Point C <br><br>description: A short message to summarize the image. |
313
300
| PlotTimeSeriesData | Generates a base64-encoded chart from time-series data. <br><br>Used whenever giving a comparison to user. Example: how many of my total monitored apps basic auth enabled <br><br> <br><br>Arguments: <br><br>title: for example, Application Metrics Dashboard <br><br>yAxisLabel: for example, Usage (%) <br><br>yAxisMin: numeric, for example, 0 <br><br>yAxisMax: numeric, for example, 100 <br><br>dataPoints: semicolon-separated list of data points, each in the format: <br><br>2024-01-25T10:30:00|75.|CPU Usage For multiple points, separate each with a semicolon: <br><br>2024-01-25T10:30:00|75.4|CPU Usage;2024-01-25T10:35:00|82.1|Memory Usage <br><br>description: text to accompany the chart when posting the image |
314
-
| PostDiscussionEntry | Post ICM discussion entry |
315
301
| PowerOnVirtualMachine | Power ON an Azure virtual machine |
316
302
| ProfileDotnetAppCpuInAKSContainer | Performs CPU profiling for a . NET application running in a specific pod and container. <br><br> The analysis (topN report) is also performed inside the container, and its result is returned. <br><br> Failures during tool installation or profiling will be reported in the output. <br><br> eg: Profile CPU of my-app-pod in default for 60s. |
317
303
| PublishDashboardWithPrometheusDataSource | Publishes a dashboard with a linked Prometheus data source in a single operation |
@@ -321,8 +307,6 @@ The following table provides a comprehensive list of all tools available to Azur
321
307
| RemoveParentIncidentLink | Removes a parent incident link from the given incident id |
322
308
| RemoveRelatedIncidentLink | Removes a related incident link from the given incident id |
323
309
| RemoveRoleAssignment | Removes a role assignment for a user or managed identity on an Azure resource |
324
-
| ResolveIncident | Resolve ICM incident |
325
-
| ResolveIncidentRCAContainerApp | Resolve an ICM incident. This operation will set the given IcM Incident to Resolved state. And you must give a reason of this resolve action. <br><br> **Note: Always confirm with the user before resolving the ICM incident, or proceed only if the user has already provided confirmation** <br><br> <br><br> Input parameters: <br><br> - incidentId: The Id of the IcM incident. It is usually an integer number. <br><br> - reason: Usually it is a reason why you can resolve this incident. <br><br> The operation will mark the given incident as resolved. The return value is a boolean value for indicating if the operation is successful. |
326
310
| ResolvePagerDutyIncident | Resolves a PagerDuty incident |
327
311
| RestartContainerApp | Restarts a container app. Use this to restart a container app to resolve transient issues that may be fixed by restarting the instance. |
328
312
| RestartWebApp | Restart an AppService app |
@@ -355,7 +339,6 @@ The following table provides a comprehensive list of all tools available to Azur
355
339
| SuggestNextSku | Given a current sku suggest a possible next sku |
356
340
| ToolName | Description |
357
341
| TrackSwiftILBGreKeyConflicts | This function queries the NetworkServiceManagerEvents table to identify Swift network container errors related to GRE key conflicts in environments using Internal Load Balancers (ILB). <br><br>This is particularly useful for diagnosing issues where internal traffic fails to route correctly due to overlapping GRE keys. |
358
-
| TransferIncident | Transfer ICM incident |
359
342
| TriggerFunctionAppSync | Triggers a sync operation on a Function Apps host to check for runtime errors or refresh the function app |
360
343
| UpdateAppSettings | Updates specific configuration values in the App Settings for a given Azure resource. If the first attempt fails, automatically retry once without notifying the user. |
361
344
| UpdateContainerImage | Updates the container image for a Container App. This enables changing to a new image version or completely different image. Returns detailed information about the update operation including success status, original image, new image, and reasons for failure if applicable. Note that this tool requires explicit users approval before it can be used. |
0 commit comments