Skip to content

one2nc/sre-ai-demo

Repository files navigation

Table of Contents

Introduction

This repo contains programs using LLMs to demonstrate

  • RCA quality checks e.g. clear, vague, none.
  • incident correlation e.g. current database issue is similar to a previous issue from a given list.
  • Tying root cause analysis to specific category of known entities e.g. users, vendors, internal servcies.

Environment Setup

Set OpenRouter API key (default):

export OPENROUTER_API_KEY=your-key-here

Or use AWS Bedrock:

export AWS_DEFAULT_REGION=us-east-1
export BEDROCK_MODEL_ID=anthropic.claude-3-haiku-20240307-v1:0

Test your API key:

./test_openrouter_api.py

Note: If the default model (deepseek/deepseek-chat) is not returning results, try using an alternative free model:

./check_rca_quality.py --model "x-ai/grok-beta:free" ...
./correlate_incidents.py --model "x-ai/grok-beta:free" ...
./find_root_cause_entity_type.py --model "x-ai/grok-beta:free" ...

Feature - RCA Quality Checker (check_rca_quality.py)

Analyzes RCA document quality and categorizes as clear/vague/none.

Good RCA Example

Comprehensive RCA with 5 Whys, timeline, and preventive measures:

./check_rca_quality.py \
  --rca-file sample_data/rca_01_good.md \
  --prompt-file prompts/find_rca_type_prompt.md \
  --incident-title "SSL Certificate Expiration"

Expected output: "rca_type": "clear"

Vague RCA Example

Brief RCA lacking detail and structure:

./check_rca_quality.py \
  --rca-file sample_data/rca_02_vague.md \
  --prompt-file prompts/find_rca_type_prompt.md \
  --incident-title "Database Connection Pool"

Expected output: "rca_type": "vague"

No RCA Example

Incident notes without proper RCA:

./check_rca_quality.py \
  --rca-file sample_data/rca_03_none.md \
  --prompt-file prompts/find_rca_type_prompt.md \
  --incident-title "High Memory Usage"

Expected output: "rca_type": "none" or "rca_type": "vague"

Feature - Incident Correlation (correlate_incidents.py)

Finds similar historical incidents using semantic matching.

Query With Similar Incidents

Query for memory/CPU issues will find matches:

./correlate_incidents.py \
  --query "High memory usage on service causing performance issues" \
  --incidents-file sample_data/incidents_04_with_similar.json \
  --prompt-file prompts/incident_similarity_search_prompt.md \
  --min-similarity 0.6

Expected: Multiple similar incidents found (INC-001, INC-003, INC-006)

Query Without Similar Incidents

Query for Redis issues won't find matches in this dataset:

./correlate_incidents.py \
  --query "Redis cluster failover causing connection drops" \
  --incidents-file sample_data/incidents_05_without_similar.json \
  --prompt-file prompts/incident_similarity_search_prompt.md \
  --min-similarity 0.6

Expected: No or very few similar incidents found

Feature - Root Cause Entity Type (find_root_cause_entity_type.py)

Categorizes incidents as internal/vendor/end_user/vague based on the source of the problem.

Internal Incident - Cloud Provider Issue

Cloud provider (AWS/Azure/GCP) issues are INTERNAL because they support YOUR infrastructure:

./find_root_cause_entity_type.py \
  --incident-file sample_data/incident_01.json \
  --prompt-file prompts/find_root_cause_entity_type_prompt_02.md

Expected: "root_cause_entity_type": "internal" (AWS RDS is infrastructure supporting your database)

Vendor Incident - Payment Gateway Issue

Third-party service providers (Stripe, PayPal, etc.) are VENDOR:

./find_root_cause_entity_type.py \
  --incident-file sample_data/incident_02.json \
  --prompt-file prompts/find_root_cause_entity_type_prompt_02.md

Expected: "root_cause_entity_type": "vendor" (Stripe is a third-party service provider)

End User Incident - Device/Browser Issue

Client-side issues on user devices/browsers are END_USER:

./find_root_cause_entity_type.py \
  --incident-file sample_data/incident_03.json \
  --prompt-file prompts/find_root_cause_entity_type_prompt_02.md

Expected: "root_cause_entity_type": "end_user" (Safari iOS browser issue on user device)

Gotcha - Vendor Differentiation: Importance of Clear Prompts

The same AWS RDS incident can be categorized differently based on prompt clarity:

Vague Prompt (Incorrect):

./find_root_cause_entity_type.py \
  --incident-file sample_data/incident_01.json \
  --prompt-file prompts/find_root_cause_entity_type_prompt.md

Output:

{
  "incident_id": "incident_01",
  "incident_title": "AWS RDS Database Connection Failures",
  "root_cause_entity_type": "vendor",
  "reason": "The incident is attributed to AWS RDS, a third-party service provider, experiencing connection timeouts in the us-east-1 region, which directly impacts database connectivity."
}

INCORRECT: Categorizes AWS as vendor

Clear Prompt (Correct):

./find_root_cause_entity_type.py \
  --incident-file sample_data/incident_01.json \
  --prompt-file prompts/find_root_cause_entity_type_prompt_02.md

Output:

{
  "incident_id": "incident_01",
  "incident_title": "AWS RDS Database Connection Failures",
  "root_cause_entity_type": "internal",
  "reason": "The incident is caused by AWS RDS database connection failures in the us-east-1 region, which is part of the cloud provider infrastructure supporting the application. Since cloud provider infrastructure issues are categorized as internal (as they support your infrastructure), this incident falls under the internal category."
}

CORRECT: Categorizes AWS as infrastructure supporting your systems

Key Distinction:

  • Infrastructure Providers (AWS, Azure, GCP) = INTERNAL (they support YOUR infrastructure)
  • Business Service Providers (Stripe, SendGrid, Twilio) = VENDOR (they provide business functionality)

This demonstrates how prompt engineering directly impacts categorization accuracy.

Gotcha - Understanding LLM Behavior

LLM Inconsistency with Default Temperature

Running the same query multiple times can produce different results due to LLM non-determinism:

First Run:

./correlate_incidents.py \
  --query "Redis cluster failover causing connection drops" \
  --incidents-file sample_data/incidents_05_without_similar.json \
  --prompt-file prompts/incident_similarity_search_prompt.md \
  --min-similarity 0.6

Output:

{
  "similar_incidents": [
    {"incident_id": "INC-103", "similarity_score": 0.75},
    {"incident_id": "INC-105", "similarity_score": 0.65}
  ]
}

Second Run (same command):

Output:

{
  "similar_incidents": []
}

This inconsistency is expected LLM behavior, not a bug.

Using Temperature 0.0 for Better Consistency

Setting temperature to 0.0 improves consistency (but doesn't guarantee it):

./correlate_incidents.py \
  --query "Redis cluster failover causing connection drops" \
  --incidents-file sample_data/incidents_05_without_similar.json \
  --prompt-file prompts/incident_similarity_search_prompt.md \
  --min-similarity 0.6 \
  --temperature 0.0

Output:

{
  "similar_incidents": [
    {"incident_id": "INC-103", "similarity_score": 0.75}
  ]
}

More consistent across runs, but temperature 0.0 is NOT a guarantee of determinism - especially with free models. It's the best effort approach for consistency.

High Temperature Creates Chaos

Setting temperature too high (2.0) produces gibberish and parse errors:

./correlate_incidents.py \
  --query "Redis cluster failover causing connection drops" \
  --incidents-file sample_data/incidents_05_without_similar.json \
  --prompt-file prompts/incident_similarity_search_prompt.md \
  --min-similarity 0.6 \
  --temperature 2.0

Output: JSON parse errors with random text fragments

Recommendation: Use temperature 0.0-0.2 for structured tasks. With free models, accept some variability as a tradeoff. See TEMPERATURE_GUIDE.md for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages