Skip to content

Commit dfe3eb4

Browse files
hchenxaclaude
andauthored
add claude skills to create the maestro cluster (#458)
* add claude skills to create the maestro cluster Signed-off-by: hchenxa <huichen@redhat.com> * fix: add timeouts for git clone and make commands in setup-maestro-cluster - Add 300s timeout for git clone to prevent indefinite hangs - Add 3600s timeout for make personal-dev-env deployment - Update error messages to indicate timeout conditions Addresses CodeRabbitAI feedback on PR #458 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: hchenxa <huichen@redhat.com> * fix: replace real cluster names with placeholders in run-e2e-tests SKILL.md Replace specific cluster names (pers-usw3supa) with generic placeholders (<cluster-id>) to avoid exposing real cluster information in documentation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * fix: replace Chinese labels with English in send-to-slack.sh Replace all Chinese labels in Slack diagnostic messages with English: - 失败原因 → Root Cause - 导致字段冲突 → Leading to field conflicts - 实际运行状态 → Actual Status - 结论 → Conclusion - 级联失败 → Cascading Failure - 影响 → Impact This ensures the diagnostic tool works properly in international environments. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: add intelligent log analysis to diagnostic tool Add new log analysis module (analyze-logs.sh) that: - Dynamically extracts failed Helm releases from logs - Identifies resource conflicts with detailed field information - Detects error patterns (timing, network, auth, etc.) - Analyzes deployment timeline Update diagnose.sh to: - Use log analysis results to dynamically determine what to check - Check resources and namespaces based on actual failures found - Include error patterns and timeline in diagnostic report - Remove hardcoded assumptions about specific components This makes the diagnostic tool more flexible and intelligent, capable of analyzing any deployment failure pattern rather than just known issues. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: make diagnostic report generation fully dynamic Replace hardcoded issue detection (hypershift, mce, maestro) with: - Dynamic root cause analysis based on error patterns from logs - Automatic severity assessment based on pod status - Generic issue reporting for any failed Helm release - Error pattern categorization (timing, timeout, auth, network, etc.) - Resource conflict details dynamically extracted Benefits: - Can diagnose any deployment failure, not just known issues - Adapts to different failure scenarios automatically - Provides context-aware recommendations - No assumptions about specific components 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * fix: make diagnose.sh compatible with bash 3.x - Replace associative arrays (bash 4+) with standard text processing - Use cut/sort/uniq for pattern counting instead of declare -A - Ensure compatibility with macOS default bash 3.2 - Tested successfully with deployment log analysis and Slack notification * fix: address CodeRabbitAI critical review comments Critical fixes: - Fix IFS=':::' delimiter parsing to use proper parameter expansion - Fix IFS=: parsing for resource_conflicts.txt to handle colons in resource types - Add prerequisite checks for jq (required) and kubelogin (optional) - Handle kubelogin failures gracefully with || true - Track credential failures in CREDENTIAL_ISSUES and report as critical issues - Fix inconsistent severity/CRITICAL_ISSUES logic (remove increment from WARNING branches) - Remove unused variables: NAMESPACES_TO_CHECK, max_count - Replace single-iteration loop with direct namespace check - Sensitive data exposure already mitigated (no kubectl get -o yaml) Details: - Replace 'while IFS=":::"' with proper line reading and parameter expansion - Use awk to properly parse colon-delimited fields with embedded colons - Add jq check (exit if missing) and kubelogin check (warn if missing) - Initialize CREDENTIAL_ISSUES=0 and increment on az aks failures - Report credential failures as critical issues in diagnostic report - Remove CRITICAL_ISSUES increment from partial pod failure (WARNING only) - Remove CRITICAL_ISSUES increment from timeout pattern (WARNING only) - Clean up shellcheck warnings (SC2034, SC2043) All actionable CodeRabbitAI comments addressed. Signed-off-by: hchenxa <huichen@redhat.com> --------- Signed-off-by: hchenxa <huichen@redhat.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent 0f527cd commit dfe3eb4

File tree

11 files changed

+2155
-0
lines changed

11 files changed

+2155
-0
lines changed
Lines changed: 271 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,271 @@
1+
#!/bin/bash
2+
#
3+
# Deployment Monitor Hook
4+
# This hook monitors long-running deployment processes and notifies when complete
5+
#
6+
# Usage: Can be called after triggering a deployment to monitor its progress
7+
#
8+
# Dependencies:
9+
# - Required: bash, wc, tail, sed, grep, cat, tr, date, sleep
10+
# - Optional: curl (for Slack notifications), osascript (for macOS notifications), notify-send (for Linux notifications)
11+
#
12+
# Configuration:
13+
# Set SLACK_WEBHOOK_URL environment variable for Slack notifications
14+
15+
set -e
16+
17+
HOOK_NAME="deployment-monitor"
18+
19+
# Configuration is loaded from environment variables:
20+
# - SLACK_WEBHOOK_URL: Optional Slack webhook URL for notifications
21+
22+
# Function to check if required commands are available
23+
check_command() {
24+
local cmd=$1
25+
local required=$2
26+
27+
if ! command -v "$cmd" &> /dev/null; then
28+
if [ "$required" = "true" ]; then
29+
echo "[$HOOK_NAME] ERROR: Required command '$cmd' is not installed"
30+
return 1
31+
else
32+
echo "[$HOOK_NAME] WARNING: Optional command '$cmd' is not installed"
33+
return 0
34+
fi
35+
fi
36+
return 0
37+
}
38+
39+
# Function to monitor deployment
40+
monitor_deployment() {
41+
local task_id=$1
42+
43+
if [ -z "$task_id" ]; then
44+
echo "[$HOOK_NAME] ERROR: Task ID required"
45+
echo "Usage: $0 monitor <task_id>"
46+
exit 1
47+
fi
48+
49+
# Build task output paths dynamically based on current working directory
50+
local cwd_sanitized
51+
cwd_sanitized=$(pwd | tr '/' '-' | sed 's/^-//')
52+
local task_dir="/tmp/claude/-${cwd_sanitized}/tasks"
53+
54+
# Ensure task directory exists
55+
if [ ! -d "$task_dir" ]; then
56+
echo "[$HOOK_NAME] WARNING: Task directory does not exist: $task_dir"
57+
echo "[$HOOK_NAME] Creating directory..."
58+
mkdir -p "$task_dir"
59+
fi
60+
61+
local output_file="${task_dir}/${task_id}.output"
62+
local exit_code_file="${task_dir}/${task_id}.exit_code"
63+
local start_time
64+
start_time=$(date +%s)
65+
66+
echo "[$HOOK_NAME] Monitoring deployment task: $task_id"
67+
echo "[$HOOK_NAME] Started at: $(date)"
68+
echo ""
69+
70+
# Wait for the deployment to complete
71+
local last_line_count=0
72+
local max_wait_seconds=${MONITOR_TIMEOUT:-7200} # Default 2 hours
73+
while true; do
74+
# Check for timeout
75+
local elapsed=$(($(date +%s) - start_time))
76+
if [ "$elapsed" -ge "$max_wait_seconds" ]; then
77+
local minutes=$((max_wait_seconds / 60))
78+
echo ""
79+
echo "[$HOOK_NAME] ERROR: Maximum wait time (${minutes} minutes) reached"
80+
notify_completion "FAILED" "Deployment monitoring timed out after ${minutes} minutes without completion"
81+
return 2
82+
fi
83+
84+
# Check if exit code file exists (task completed)
85+
if [ -f "$exit_code_file" ]; then
86+
local exit_code
87+
exit_code=$(cat "$exit_code_file")
88+
echo "[$HOOK_NAME] Deployment process finished with exit code: $exit_code"
89+
break
90+
fi
91+
92+
# Show progress if output file exists
93+
if [ -f "$output_file" ]; then
94+
local current_lines
95+
current_lines=$(wc -l < "$output_file" | tr -d ' ')
96+
if [ "$current_lines" != "$last_line_count" ]; then
97+
local elapsed=$(($(date +%s) - start_time))
98+
local minutes=$((elapsed / 60))
99+
local seconds=$((elapsed % 60))
100+
echo "[$HOOK_NAME] Progress: $current_lines lines | Elapsed: ${minutes}m ${seconds}s | $(date +%H:%M:%S)"
101+
102+
# Show latest activity
103+
tail -3 "$output_file" | sed 's/\x1b\[[0-9;]*m//g' | grep -v "^$" | tail -1 | sed "s/^/[$HOOK_NAME] Latest: /"
104+
105+
last_line_count=$current_lines
106+
fi
107+
fi
108+
109+
# Sleep before next check
110+
sleep 15
111+
done
112+
113+
# Calculate total time
114+
local end_time
115+
end_time=$(date +%s)
116+
local total_time=$((end_time - start_time))
117+
local minutes=$((total_time / 60))
118+
local seconds=$((total_time % 60))
119+
120+
# Determine status and send notification
121+
if [ "$exit_code" -eq 0 ]; then
122+
notify_completion "COMPLETE" "Maestro cluster deployment completed successfully in ${minutes}m ${seconds}s!"
123+
echo ""
124+
echo "[$HOOK_NAME] Total deployment time: ${minutes}m ${seconds}s"
125+
echo "[$HOOK_NAME] Output file: $output_file"
126+
return 0
127+
else
128+
notify_completion "FAILED" "Deployment failed with exit code $exit_code after ${minutes}m ${seconds}s"
129+
echo ""
130+
echo "[$HOOK_NAME] Total deployment time: ${minutes}m ${seconds}s"
131+
echo "[$HOOK_NAME] Output file: $output_file"
132+
return 1
133+
fi
134+
}
135+
136+
# Function to send Slack notification
137+
send_slack_notification() {
138+
local status=$1
139+
local message=$2
140+
local webhook_url=$3
141+
142+
if [ -z "$webhook_url" ]; then
143+
return 1
144+
fi
145+
146+
# Check if curl is available
147+
if ! check_command "curl" "false"; then
148+
echo "[$HOOK_NAME] Skipping Slack notification - curl not available"
149+
return 1
150+
fi
151+
152+
# Determine color based on status
153+
local color="good"
154+
local emoji=":white_check_mark:"
155+
if [[ "$status" == "FAILED" ]]; then
156+
color="danger"
157+
emoji=":x:"
158+
elif [[ "$status" == "COMPLETE" ]]; then
159+
color="good"
160+
emoji=":white_check_mark:"
161+
fi
162+
163+
# Create JSON payload using jq for safe escaping
164+
local payload
165+
if command -v jq &> /dev/null; then
166+
# Use jq for safe JSON construction
167+
payload=$(jq -n \
168+
--arg color "$color" \
169+
--arg title "$emoji Maestro Deployment $status" \
170+
--arg text "$message" \
171+
--arg footer "Maestro Deployment Monitor" \
172+
--argjson ts "$(date +%s)" \
173+
'{attachments: [{color: $color, title: $title, text: $text, footer: $footer, ts: $ts}]}')
174+
elif command -v python3 &> /dev/null; then
175+
# Fallback: Use Python for proper JSON encoding
176+
payload=$(python3 -c "import json, sys; print(json.dumps({'attachments': [{'color': sys.argv[1], 'title': sys.argv[2] + ' Maestro Deployment ' + sys.argv[3], 'text': sys.argv[4], 'footer': 'Maestro Deployment Monitor', 'ts': int(sys.argv[5])}]}))" "$color" "$emoji" "$status" "$message" "$(date +%s)")
177+
else
178+
# Last resort: Extended manual escaping for all control characters
179+
local escaped_message="${message//\\/\\\\}" # Escape backslashes
180+
escaped_message="${escaped_message//\"/\\\"}" # Escape quotes
181+
escaped_message="${escaped_message//$'\n'/\\n}" # Escape newlines
182+
escaped_message="${escaped_message//$'\r'/\\r}" # Escape carriage returns
183+
escaped_message="${escaped_message//$'\t'/\\t}" # Escape tabs
184+
185+
local escaped_status="${status//\\/\\\\}"
186+
escaped_status="${escaped_status//\"/\\\"}"
187+
escaped_status="${escaped_status//$'\n'/\\n}"
188+
escaped_status="${escaped_status//$'\r'/\\r}"
189+
escaped_status="${escaped_status//$'\t'/\\t}"
190+
191+
payload=$(cat <<EOF
192+
{
193+
"attachments": [
194+
{
195+
"color": "$color",
196+
"title": "$emoji Maestro Deployment $escaped_status",
197+
"text": "$escaped_message",
198+
"footer": "Maestro Deployment Monitor",
199+
"ts": $(date +%s)
200+
}
201+
]
202+
}
203+
EOF
204+
)
205+
fi
206+
207+
# Send to Slack and capture exit status
208+
# --fail ensures curl returns non-zero on HTTP 4xx/5xx errors
209+
local curl_exit_code
210+
if curl -X POST -H 'Content-type: application/json' \
211+
--data "$payload" \
212+
"$webhook_url" \
213+
--silent --show-error --fail; then
214+
curl_exit_code=0
215+
else
216+
curl_exit_code=$?
217+
echo "[$HOOK_NAME] ERROR: Failed to send Slack notification (curl exit code: $curl_exit_code)"
218+
fi
219+
220+
return $curl_exit_code
221+
}
222+
223+
# Function to send notification
224+
notify_completion() {
225+
local status=$1
226+
local message=$2
227+
228+
echo ""
229+
echo "=========================================="
230+
echo "[$HOOK_NAME] DEPLOYMENT $status"
231+
echo "Message: $message"
232+
echo "Time: $(date)"
233+
echo "=========================================="
234+
echo ""
235+
236+
# Send Slack notification if webhook is configured
237+
if [ -n "$SLACK_WEBHOOK_URL" ]; then
238+
echo "[$HOOK_NAME] Sending Slack notification..."
239+
if send_slack_notification "$status" "$message" "$SLACK_WEBHOOK_URL"; then
240+
echo "[$HOOK_NAME] Slack notification sent successfully"
241+
else
242+
echo "[$HOOK_NAME] Failed to send Slack notification"
243+
fi
244+
fi
245+
246+
# Also send system notification if available
247+
if command -v osascript &> /dev/null; then
248+
# macOS notification - escape message for AppleScript
249+
local safe_message="${message//\\/\\\\}"
250+
safe_message="${safe_message//\"/\\\"}"
251+
osascript -e "display notification \"$safe_message\" with title \"Maestro Deployment $status\""
252+
elif command -v notify-send &> /dev/null; then
253+
# Linux notification - use safe argument passing
254+
notify-send -- "Maestro Deployment $status" "$message"
255+
fi
256+
}
257+
258+
# Main execution
259+
case "${1:-notify}" in
260+
monitor)
261+
monitor_deployment "$2"
262+
exit $?
263+
;;
264+
notify)
265+
notify_completion "${2:-COMPLETE}" "${3:-Deployment finished}"
266+
;;
267+
*)
268+
echo "Usage: $0 {monitor <task_id>|notify <status> <message>}"
269+
exit 1
270+
;;
271+
esac

0 commit comments

Comments
 (0)