|
| 1 | +--- |
| 2 | +mode: 'agent' |
| 3 | +description: 'Analyze Azure resource health, diagnose issues from logs and telemetry, and create a remediation plan for identified problems.' |
| 4 | +--- |
| 5 | + |
| 6 | +# Azure Resource Health & Issue Diagnosis |
| 7 | + |
| 8 | +This workflow analyzes a specific Azure resource to assess its health status, diagnose potential issues using logs and telemetry data, and develop a comprehensive remediation plan for any problems discovered. |
| 9 | + |
| 10 | +## Prerequisites |
| 11 | +- Azure MCP server configured and authenticated |
| 12 | +- Target Azure resource identified (name and optionally resource group/subscription) |
| 13 | +- Resource must be deployed and running to generate logs/telemetry |
| 14 | +- Prefer Azure MCP tools (`azmcp-*`) over direct Azure CLI when available |
| 15 | + |
| 16 | +## Workflow Steps |
| 17 | + |
| 18 | +### Step 1: Get Azure Best Practices |
| 19 | +**Action**: Retrieve diagnostic and troubleshooting best practices |
| 20 | +**Tools**: Azure MCP best practices tool |
| 21 | +**Process**: |
| 22 | +1. **Load Best Practices**: |
| 23 | + - Execute Azure best practices tool to get diagnostic guidelines |
| 24 | + - Focus on health monitoring, log analysis, and issue resolution patterns |
| 25 | + - Use these practices to inform diagnostic approach and remediation recommendations |
| 26 | + |
| 27 | +### Step 2: Resource Discovery & Identification |
| 28 | +**Action**: Locate and identify the target Azure resource |
| 29 | +**Tools**: Azure MCP tools + Azure CLI fallback |
| 30 | +**Process**: |
| 31 | +1. **Resource Lookup**: |
| 32 | + - If only resource name provided: Search across subscriptions using `azmcp-subscription-list` |
| 33 | + - Use `az resource list --name <resource-name>` to find matching resources |
| 34 | + - If multiple matches found, prompt user to specify subscription/resource group |
| 35 | + - Gather detailed resource information: |
| 36 | + - Resource type and current status |
| 37 | + - Location, tags, and configuration |
| 38 | + - Associated services and dependencies |
| 39 | + |
| 40 | +2. **Resource Type Detection**: |
| 41 | + - Identify resource type to determine appropriate diagnostic approach: |
| 42 | + - **Web Apps/Function Apps**: Application logs, performance metrics, dependency tracking |
| 43 | + - **Virtual Machines**: System logs, performance counters, boot diagnostics |
| 44 | + - **Cosmos DB**: Request metrics, throttling, partition statistics |
| 45 | + - **Storage Accounts**: Access logs, performance metrics, availability |
| 46 | + - **SQL Database**: Query performance, connection logs, resource utilization |
| 47 | + - **Application Insights**: Application telemetry, exceptions, dependencies |
| 48 | + - **Key Vault**: Access logs, certificate status, secret usage |
| 49 | + - **Service Bus**: Message metrics, dead letter queues, throughput |
| 50 | + |
| 51 | +### Step 3: Health Status Assessment |
| 52 | +**Action**: Evaluate current resource health and availability |
| 53 | +**Tools**: Azure MCP monitoring tools + Azure CLI |
| 54 | +**Process**: |
| 55 | +1. **Basic Health Check**: |
| 56 | + - Check resource provisioning state and operational status |
| 57 | + - Verify service availability and responsiveness |
| 58 | + - Review recent deployment or configuration changes |
| 59 | + - Assess current resource utilization (CPU, memory, storage, etc.) |
| 60 | + |
| 61 | +2. **Service-Specific Health Indicators**: |
| 62 | + - **Web Apps**: HTTP response codes, response times, uptime |
| 63 | + - **Databases**: Connection success rate, query performance, deadlocks |
| 64 | + - **Storage**: Availability percentage, request success rate, latency |
| 65 | + - **VMs**: Boot diagnostics, guest OS metrics, network connectivity |
| 66 | + - **Functions**: Execution success rate, duration, error frequency |
| 67 | + |
| 68 | +### Step 4: Log & Telemetry Analysis |
| 69 | +**Action**: Analyze logs and telemetry to identify issues and patterns |
| 70 | +**Tools**: Azure MCP monitoring tools for Log Analytics queries |
| 71 | +**Process**: |
| 72 | +1. **Find Monitoring Sources**: |
| 73 | + - Use `azmcp-monitor-workspace-list` to identify Log Analytics workspaces |
| 74 | + - Locate Application Insights instances associated with the resource |
| 75 | + - Identify relevant log tables using `azmcp-monitor-table-list` |
| 76 | + |
| 77 | +2. **Execute Diagnostic Queries**: |
| 78 | + Use `azmcp-monitor-log-query` with targeted KQL queries based on resource type: |
| 79 | + |
| 80 | + **General Error Analysis**: |
| 81 | + ```kql |
| 82 | + // Recent errors and exceptions |
| 83 | + union isfuzzy=true |
| 84 | + AzureDiagnostics, |
| 85 | + AppServiceHTTPLogs, |
| 86 | + AppServiceAppLogs, |
| 87 | + AzureActivity |
| 88 | + | where TimeGenerated > ago(24h) |
| 89 | + | where Level == "Error" or ResultType != "Success" |
| 90 | + | summarize ErrorCount=count() by Resource, ResultType, bin(TimeGenerated, 1h) |
| 91 | + | order by TimeGenerated desc |
| 92 | + ``` |
| 93 | + |
| 94 | + **Performance Analysis**: |
| 95 | + ```kql |
| 96 | + // Performance degradation patterns |
| 97 | + Perf |
| 98 | + | where TimeGenerated > ago(7d) |
| 99 | + | where ObjectName == "Processor" and CounterName == "% Processor Time" |
| 100 | + | summarize avg(CounterValue) by Computer, bin(TimeGenerated, 1h) |
| 101 | + | where avg_CounterValue > 80 |
| 102 | + ``` |
| 103 | + |
| 104 | + **Application-Specific Queries**: |
| 105 | + ```kql |
| 106 | + // Application Insights - Failed requests |
| 107 | + requests |
| 108 | + | where timestamp > ago(24h) |
| 109 | + | where success == false |
| 110 | + | summarize FailureCount=count() by resultCode, bin(timestamp, 1h) |
| 111 | + | order by timestamp desc |
| 112 | + |
| 113 | + // Database - Connection failures |
| 114 | + AzureDiagnostics |
| 115 | + | where ResourceProvider == "MICROSOFT.SQL" |
| 116 | + | where Category == "SQLSecurityAuditEvents" |
| 117 | + | where action_name_s == "CONNECTION_FAILED" |
| 118 | + | summarize ConnectionFailures=count() by bin(TimeGenerated, 1h) |
| 119 | + ``` |
| 120 | + |
| 121 | +3. **Pattern Recognition**: |
| 122 | + - Identify recurring error patterns or anomalies |
| 123 | + - Correlate errors with deployment times or configuration changes |
| 124 | + - Analyze performance trends and degradation patterns |
| 125 | + - Look for dependency failures or external service issues |
| 126 | + |
| 127 | +### Step 5: Issue Classification & Root Cause Analysis |
| 128 | +**Action**: Categorize identified issues and determine root causes |
| 129 | +**Process**: |
| 130 | +1. **Issue Classification**: |
| 131 | + - **Critical**: Service unavailable, data loss, security breaches |
| 132 | + - **High**: Performance degradation, intermittent failures, high error rates |
| 133 | + - **Medium**: Warnings, suboptimal configuration, minor performance issues |
| 134 | + - **Low**: Informational alerts, optimization opportunities |
| 135 | + |
| 136 | +2. **Root Cause Analysis**: |
| 137 | + - **Configuration Issues**: Incorrect settings, missing dependencies |
| 138 | + - **Resource Constraints**: CPU/memory/disk limitations, throttling |
| 139 | + - **Network Issues**: Connectivity problems, DNS resolution, firewall rules |
| 140 | + - **Application Issues**: Code bugs, memory leaks, inefficient queries |
| 141 | + - **External Dependencies**: Third-party service failures, API limits |
| 142 | + - **Security Issues**: Authentication failures, certificate expiration |
| 143 | + |
| 144 | +3. **Impact Assessment**: |
| 145 | + - Determine business impact and affected users/systems |
| 146 | + - Evaluate data integrity and security implications |
| 147 | + - Assess recovery time objectives and priorities |
| 148 | + |
| 149 | +### Step 6: Generate Remediation Plan |
| 150 | +**Action**: Create a comprehensive plan to address identified issues |
| 151 | +**Process**: |
| 152 | +1. **Immediate Actions** (Critical issues): |
| 153 | + - Emergency fixes to restore service availability |
| 154 | + - Temporary workarounds to mitigate impact |
| 155 | + - Escalation procedures for complex issues |
| 156 | + |
| 157 | +2. **Short-term Fixes** (High/Medium issues): |
| 158 | + - Configuration adjustments and resource scaling |
| 159 | + - Application updates and patches |
| 160 | + - Monitoring and alerting improvements |
| 161 | + |
| 162 | +3. **Long-term Improvements** (All issues): |
| 163 | + - Architectural changes for better resilience |
| 164 | + - Preventive measures and monitoring enhancements |
| 165 | + - Documentation and process improvements |
| 166 | + |
| 167 | +4. **Implementation Steps**: |
| 168 | + - Prioritized action items with specific Azure CLI commands |
| 169 | + - Testing and validation procedures |
| 170 | + - Rollback plans for each change |
| 171 | + - Monitoring to verify issue resolution |
| 172 | + |
| 173 | +### Step 7: User Confirmation & Report Generation |
| 174 | +**Action**: Present findings and get approval for remediation actions |
| 175 | +**Process**: |
| 176 | +1. **Display Health Assessment Summary**: |
| 177 | + ``` |
| 178 | + 🏥 Azure Resource Health Assessment |
| 179 | + |
| 180 | + 📊 Resource Overview: |
| 181 | + • Resource: [Name] ([Type]) |
| 182 | + • Status: [Healthy/Warning/Critical] |
| 183 | + • Location: [Region] |
| 184 | + • Last Analyzed: [Timestamp] |
| 185 | + |
| 186 | + 🚨 Issues Identified: |
| 187 | + • Critical: X issues requiring immediate attention |
| 188 | + • High: Y issues affecting performance/reliability |
| 189 | + • Medium: Z issues for optimization |
| 190 | + • Low: N informational items |
| 191 | + |
| 192 | + 🔍 Top Issues: |
| 193 | + 1. [Issue Type]: [Description] - Impact: [High/Medium/Low] |
| 194 | + 2. [Issue Type]: [Description] - Impact: [High/Medium/Low] |
| 195 | + 3. [Issue Type]: [Description] - Impact: [High/Medium/Low] |
| 196 | + |
| 197 | + 🛠️ Remediation Plan: |
| 198 | + • Immediate Actions: X items |
| 199 | + • Short-term Fixes: Y items |
| 200 | + • Long-term Improvements: Z items |
| 201 | + • Estimated Resolution Time: [Timeline] |
| 202 | + |
| 203 | + ❓ Proceed with detailed remediation plan? (y/n) |
| 204 | + ``` |
| 205 | + |
| 206 | +2. **Generate Detailed Report**: |
| 207 | + ```markdown |
| 208 | + # Azure Resource Health Report: [Resource Name] |
| 209 | + |
| 210 | + **Generated**: [Timestamp] |
| 211 | + **Resource**: [Full Resource ID] |
| 212 | + **Overall Health**: [Status with color indicator] |
| 213 | + |
| 214 | + ## 🔍 Executive Summary |
| 215 | + [Brief overview of health status and key findings] |
| 216 | + |
| 217 | + ## 📊 Health Metrics |
| 218 | + - **Availability**: X% over last 24h |
| 219 | + - **Performance**: [Average response time/throughput] |
| 220 | + - **Error Rate**: X% over last 24h |
| 221 | + - **Resource Utilization**: [CPU/Memory/Storage percentages] |
| 222 | + |
| 223 | + ## 🚨 Issues Identified |
| 224 | + |
| 225 | + ### Critical Issues |
| 226 | + - **[Issue 1]**: [Description] |
| 227 | + - **Root Cause**: [Analysis] |
| 228 | + - **Impact**: [Business impact] |
| 229 | + - **Immediate Action**: [Required steps] |
| 230 | + |
| 231 | + ### High Priority Issues |
| 232 | + - **[Issue 2]**: [Description] |
| 233 | + - **Root Cause**: [Analysis] |
| 234 | + - **Impact**: [Performance/reliability impact] |
| 235 | + - **Recommended Fix**: [Solution steps] |
| 236 | + |
| 237 | + ## 🛠️ Remediation Plan |
| 238 | + |
| 239 | + ### Phase 1: Immediate Actions (0-2 hours) |
| 240 | + ```bash |
| 241 | + # Critical fixes to restore service |
| 242 | + [Azure CLI commands with explanations] |
| 243 | + ``` |
| 244 | + |
| 245 | + ### Phase 2: Short-term Fixes (2-24 hours) |
| 246 | + ```bash |
| 247 | + # Performance and reliability improvements |
| 248 | + [Azure CLI commands with explanations] |
| 249 | + ``` |
| 250 | + |
| 251 | + ### Phase 3: Long-term Improvements (1-4 weeks) |
| 252 | + ```bash |
| 253 | + # Architectural and preventive measures |
| 254 | + [Azure CLI commands and configuration changes] |
| 255 | + ``` |
| 256 | + |
| 257 | + ## 📈 Monitoring Recommendations |
| 258 | + - **Alerts to Configure**: [List of recommended alerts] |
| 259 | + - **Dashboards to Create**: [Monitoring dashboard suggestions] |
| 260 | + - **Regular Health Checks**: [Recommended frequency and scope] |
| 261 | + |
| 262 | + ## ✅ Validation Steps |
| 263 | + - [ ] Verify issue resolution through logs |
| 264 | + - [ ] Confirm performance improvements |
| 265 | + - [ ] Test application functionality |
| 266 | + - [ ] Update monitoring and alerting |
| 267 | + - [ ] Document lessons learned |
| 268 | + |
| 269 | + ## 📝 Prevention Measures |
| 270 | + - [Recommendations to prevent similar issues] |
| 271 | + - [Process improvements] |
| 272 | + - [Monitoring enhancements] |
| 273 | + ``` |
| 274 | + |
| 275 | +## Error Handling |
| 276 | +- **Resource Not Found**: Provide guidance on resource name/location specification |
| 277 | +- **Authentication Issues**: Guide user through Azure authentication setup |
| 278 | +- **Insufficient Permissions**: List required RBAC roles for resource access |
| 279 | +- **No Logs Available**: Suggest enabling diagnostic settings and waiting for data |
| 280 | +- **Query Timeouts**: Break down analysis into smaller time windows |
| 281 | +- **Service-Specific Issues**: Provide generic health assessment with limitations noted |
| 282 | + |
| 283 | +## Success Criteria |
| 284 | +- ✅ Resource health status accurately assessed |
| 285 | +- ✅ All significant issues identified and categorized |
| 286 | +- ✅ Root cause analysis completed for major problems |
| 287 | +- ✅ Actionable remediation plan with specific steps provided |
| 288 | +- ✅ Monitoring and prevention recommendations included |
| 289 | +- ✅ Clear prioritization of issues by business impact |
| 290 | +- ✅ Implementation steps include validation and rollback procedures |
0 commit comments