GitHub - bluearchio/aws-misconfig-db: llm-formatted aws misconfiguration library

╔═══════════════════════════════════════════════════════════════════════════╗
║                                                                           ║
║      █████╗ ██╗    ██╗███████╗                                            ║
║     ██╔══██╗██║    ██║██╔════╝                                            ║
║     ███████║██║ █╗ ██║███████╗                                            ║
║     ██╔══██║██║███╗██║╚════██║                                            ║
║     ██║  ██║╚███╔███╔╝███████║                                            ║
║     ╚═╝  ╚═╝ ╚══╝╚══╝ ╚══════╝                                            ║
║                                                                           ║
║     ███╗   ███╗██╗███████╗ ██████╗ ██████╗ ███╗   ██╗███████╗██╗ ██████╗  ║
║     ████╗ ████║██║██╔════╝██╔════╝██╔═══██╗████╗  ██║██╔════╝██║██╔════╝  ║
║     ██╔████╔██║██║███████╗██║     ██║   ██║██╔██╗ ██║█████╗  ██║██║  ███╗ ║
║     ██║╚██╔╝██║██║╚════██║██║     ██║   ██║██║╚██╗██║██╔══╝  ██║██║   ██║ ║
║     ██║ ╚═╝ ██║██║███████║╚██████╗╚██████╔╝██║ ╚████║██║     ██║╚██████╔╝ ║
║     ╚═╝     ╚═╝╚═╝╚══════╝ ╚═════╝ ╚═════╝ ╚═╝  ╚═══╝╚═╝     ╚═╝ ╚═════╝  ║
║                                                                           ║
║     ██████╗  █████╗ ████████╗ █████╗ ██████╗  █████╗ ███████╗███████╗     ║
║     ██╔══██╗██╔══██╗╚══██╔══╝██╔══██╗██╔══██╗██╔══██╗██╔════╝██╔════╝     ║
║     ██║  ██║███████║   ██║   ███████║██████╔╝███████║███████╗█████╗       ║
║     ██║  ██║██╔══██║   ██║   ██╔══██║██╔══██╗██╔══██║╚════██║██╔══╝       ║
║     ██████╔╝██║  ██║   ██║   ██║  ██║██████╔╝██║  ██║███████║███████╗     ║
║     ╚═════╝ ╚═╝  ╚═╝   ╚═╝   ╚═╝  ╚═╝╚═════╝ ╚═╝  ╚═╝╚══════╝╚══════╝     ║
║                                                                           ║
║                 🔥 323 Recommendations • 46 Services 🔥                   ║
║                                                                           ║
╚═══════════════════════════════════════════════════════════════════════════╝

Production-ready AWS misconfiguration detection & remediation

💰 Cost • 🛠️ Operations • ⚡ Performance • 🔐 Security • 🔄 Reliability

What Is This?

A structured, queryable database of AWS misconfigurations and best practices. Use it to:

Power LLM-based AWS advisors - Feed recommendations to Claude, GPT, or your own models
Extend cloud management tools - Integrate with Vantage, Cloud Custodian, Steampipe
Build custom scanners - Create detection rules for your infrastructure
Train teams - Reference material for AWS best practices

Quick Start (2 minutes)

1. Clone and Initialize

git clone https://github.com/bluearchio/aws-misconfig-db.git
cd aws-misconfig-db

# Install all dependencies (includes DuckDB, ingest pipeline, and test tools)
pip install -r requirements.txt

# Build the queryable database
python3 scripts/db-init.py

2. Explore Recommendations

# View full summary
python3 scripts/db-query.py summary

# List recommendations for a service
python3 scripts/db-query.py service ec2
python3 scripts/db-query.py service s3
python3 scripts/db-query.py service lambda

# Search across all recommendations
python3 scripts/db-query.py search "encryption"
python3 scripts/db-query.py search "cost"
python3 scripts/db-query.py search "idle"

# Interactive SQL mode
python3 scripts/db-query.py interactive

3. Query with SQL

import duckdb

conn = duckdb.connect('db/recommendations.duckdb')

# Top cost optimization opportunities
conn.execute("""
    SELECT service_name, scenario, recommendation_action
    FROM recommendations
    WHERE risk_detail LIKE '%cost%'
    AND build_priority = 0
    ORDER BY service_name
""").fetchdf()

# Security issues by service
conn.execute("""
    SELECT service_name, COUNT(*) as issues
    FROM recommendations
    WHERE risk_detail LIKE '%security%'
    GROUP BY service_name
    ORDER BY issues DESC
""").fetchdf()

Ingest Pipeline

The ingest pipeline automatically discovers new AWS misconfigurations from RSS feeds, HTML docs, and GitHub repositories. It deduplicates against the existing database using TF-IDF similarity, converts findings into schema-compliant recommendations via Claude, and stages them for human review.

1. Add New Sources

Sources live in data/ingest/sources.json. Each source needs an id, type, url, and categories:

{
  "id": "my-new-source",
  "name": "My New Source",
  "type": "rss",
  "url": "https://example.com/feed/",
  "categories": ["security", "cost"],
  "enabled": true,
  "fetch_config": { "max_items": 50 }
}

Type	Use for	Key config
`rss`	RSS/Atom feeds	`max_items`
`html`	AWS doc pages	`follow_links`, `link_pattern`, `item_selector`
`github`	Repo rule files	`branch`, `rules_path`, `file_pattern`, `max_files`

# See all 51 configured sources
python3 scripts/ingest/cli.py list-sources

# See only enabled sources
python3 scripts/ingest/cli.py list-sources --enabled-only

2. Run the Pipeline

# Dry run — fetch and deduplicate without converting or staging
python3 scripts/ingest/cli.py fetch --dry-run

# Fetch from specific sources only
python3 scripts/ingest/cli.py fetch --sources aws-security-blog aws-database-blog --dry-run

# Fetch only RSS sources
python3 scripts/ingest/cli.py fetch --source-type rss --dry-run

# Full pipeline — fetch, dedup, convert via Claude, validate, stage
# (requires ANTHROPIC_API_KEY in environment)
export ANTHROPIC_API_KEY=sk-ant-...
python3 scripts/ingest/cli.py fetch

# Skip LLM — fetch and dedup only, no conversion
python3 scripts/ingest/cli.py fetch --skip-llm

# Tune dedup sensitivity and limit items per source
python3 scripts/ingest/cli.py fetch --max-items 10 --similarity-threshold 0.80

The CLI shows real-time progress with labeled progress bars and a summary panel:

╭─ AWS Misconfig DB · Ingest Pipeline v1.0.0 ──╮
│ Mode:       dry-run                            │
│ Sources:    7 enabled                          │
│ Threshold:  0.7                                │
╰────────────────────────────────────────────────╯
  Loaded 313 existing recommendations for dedup

    ✓ AWS Security Blog               RSS   20 items → 20 novel
    ✓ AWS Architecture Blog           RSS   20 items → 20 novel
    ✗ AWS Cost Management Blog        RSS   XML parse error
    ✓ AWS Database Blog               RSS   20 items → 20 novel
    ✓ Security Hub Controls           HTM  172 items → 172 novel

╭─ Summary ─────────────────────────────────────╮
│ Sources      5 processed · 2 errors            │
│ Fetched      252 items                         │
│ Time         12.3s                             │
╰────────────────────────────────────────────────╯

3. Review Output and Promote

New recommendations land in data/staging/. Review before promoting to the main database:

# List staged recommendations (table, detail, or json)
python3 scripts/ingest/cli.py show-staged
python3 scripts/ingest/cli.py show-staged --format detail
python3 scripts/ingest/cli.py show-staged --filter-service rds

# Promote a recommendation into data/by-service/<service>.json
python3 scripts/ingest/cli.py promote <uuid>

# Reject a recommendation (removes from staging)
python3 scripts/ingest/cli.py reject <uuid> --reason "Duplicate"

After promoting, rebuild the database and aggregates:

python3 scripts/generate.py                       # Regenerate SUMMARY.md and stats
python3 scripts/db-init.py                        # Rebuild DuckDB
python3 scripts/validate.py data/by-service/      # Verify schema compliance

4. Monitor Health

# Run health checks (stale sources, staging overflow, state corruption)
python3 scripts/ingest/cli.py health

# View pipeline run history
python3 scripts/ingest/cli.py history

Integration Examples

LLM Integration (Claude/GPT)

Use the database as context for an AWS infrastructure advisor:

import duckdb
import anthropic  # or openai

# Load relevant recommendations
conn = duckdb.connect('db/recommendations.duckdb')
recommendations = conn.execute("""
    SELECT service_name, scenario, recommendation_action,
           recommendation_description_detailed, risk_detail
    FROM recommendations
    WHERE service_name IN ('ec2', 's3', 'iam', 'rds')
    AND build_priority <= 1
""").fetchdf().to_dict('records')

# Build context for LLM
context = "You are an AWS infrastructure advisor. Use these recommendations:\n\n"
for rec in recommendations:
    context += f"**{rec['service_name'].upper()}**: {rec['scenario']}\n"
    context += f"Action: {rec['recommendation_action']}\n"
    context += f"Risk: {rec['risk_detail']}\n\n"

# Query with Claude
client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=context,
    messages=[{"role": "user", "content": "Review my EC2 setup: I have 50 instances, 20 are t2.micro running 24/7, no auto-scaling, and EBS volumes are unencrypted."}]
)
print(response.content[0].text)

Vantage Integration

Export recommendations as Vantage-compatible cost insights:

import duckdb
import json

conn = duckdb.connect('db/recommendations.duckdb')

# Get cost recommendations in Vantage-friendly format
cost_recs = conn.execute("""
    SELECT
        id,
        service_name as resource_type,
        scenario as title,
        recommendation_action as recommendation,
        recommendation_description_detailed as description,
        CASE build_priority
            WHEN 0 THEN 'critical'
            WHEN 1 THEN 'high'
            WHEN 2 THEN 'medium'
            ELSE 'low'
        END as priority
    FROM recommendations
    WHERE risk_detail LIKE '%cost%'
""").fetchdf()

# Export for Vantage custom reports
vantage_insights = []
for _, rec in cost_recs.iterrows():
    vantage_insights.append({
        "category": "cost_optimization",
        "resource_type": f"aws:{rec['resource_type']}",
        "title": rec['title'],
        "recommendation": rec['recommendation'],
        "priority": rec['priority'],
        "source": "aws-misconfig-db"
    })

with open('vantage-insights.json', 'w') as f:
    json.dump(vantage_insights, f, indent=2)

print(f"Exported {len(vantage_insights)} cost insights for Vantage")

Cloud Custodian Policies

Generate Cloud Custodian policies from recommendations:

import duckdb
import yaml

conn = duckdb.connect('db/recommendations.duckdb')

# Get EC2 security recommendations
ec2_security = conn.execute("""
    SELECT scenario, alert_criteria, recommendation_action
    FROM recommendations
    WHERE service_name = 'ec2'
    AND risk_detail LIKE '%security%'
""").fetchdf()

# Generate Custodian policies
policies = {"policies": []}

# Example: Unencrypted EBS volumes
policies["policies"].append({
    "name": "ec2-unencrypted-volumes",
    "resource": "ebs",
    "description": "Flag unencrypted EBS volumes (from aws-misconfig-db)",
    "filters": [
        {"Encrypted": False}
    ],
    "actions": [
        {"type": "notify",
         "template": "Unencrypted EBS volume detected",
         "transport": {"type": "sns", "topic": "arn:aws:sns:us-east-1:123456789:alerts"}}
    ]
})

# Example: Unused Elastic IPs
policies["policies"].append({
    "name": "ec2-unused-elastic-ips",
    "resource": "network-addr",
    "description": "Find unassociated Elastic IPs (from aws-misconfig-db)",
    "filters": [
        {"AssociationId": "absent"}
    ],
    "actions": [
        {"type": "notify",
         "template": "Unassociated Elastic IP found - wasting $3.65/month",
         "transport": {"type": "sns", "topic": "arn:aws:sns:us-east-1:123456789:alerts"}}
    ]
})

with open('custodian-policies.yml', 'w') as f:
    yaml.dump(policies, f, default_flow_style=False)

print("Generated Cloud Custodian policies")

Steampipe Integration

Query recommendations alongside live AWS data:

-- In Steampipe, create a foreign table from the DuckDB export
-- First, export to CSV:
-- python3 -c "import duckdb; duckdb.connect('db/recommendations.duckdb').execute('COPY recommendations TO \"recommendations.csv\" (HEADER, DELIMITER \",\")').fetchall()"

-- Then in Steampipe:
CREATE FOREIGN TABLE aws_recommendations (
    id text,
    service_name text,
    scenario text,
    recommendation_action text,
    risk_detail text,
    build_priority int
) SERVER steampipe OPTIONS (filename '/path/to/recommendations.csv', format 'csv', header 'true');

-- Join with live EC2 data
SELECT
    i.instance_id,
    i.instance_type,
    r.scenario,
    r.recommendation_action
FROM aws_ec2_instance i
CROSS JOIN aws_recommendations r
WHERE r.service_name = 'ec2'
AND r.scenario LIKE '%idle%'
AND i.cpu_utilization_average < 5;

AWS Config Rules

Generate AWS Config custom rules:

import duckdb
import json

conn = duckdb.connect('db/recommendations.duckdb')

# Get recommendations with detection methods
detectable = conn.execute("""
    SELECT service_name, scenario, alert_criteria, detection_methods
    FROM recommendations
    WHERE detection_methods != '[]'
    AND alert_criteria != ''
""").fetchdf()

# Generate Config rule skeletons
config_rules = []
for _, rec in detectable.iterrows():
    methods = json.loads(rec['detection_methods'])
    for method in methods:
        if method.get('method') == 'CloudWatch Metric':
            config_rules.append({
                "ConfigRuleName": f"misconfig-{rec['service_name']}-check",
                "Description": rec['scenario'][:256],
                "Source": {
                    "Owner": "CUSTOM_LAMBDA",
                    "SourceIdentifier": "arn:aws:lambda:REGION:ACCOUNT:function:config-rule-checker"
                },
                "InputParameters": json.dumps({
                    "alert_criteria": rec['alert_criteria'],
                    "detection_details": method.get('details', '')
                })
            })

print(f"Generated {len(config_rules)} AWS Config rule templates")

Database Schema

Each recommendation contains:

Field	Description
`id`	Unique UUID
`service_name`	AWS service (ec2, s3, lambda, etc.)
`scenario`	What the misconfiguration is
`alert_criteria`	When to trigger an alert
`recommendation_action`	What to do about it
`risk_detail`	Risk type(s): cost, security, operations, performance, reliability
`build_priority`	0 (critical) to 3 (low)
`recommendation_description_detailed`	Full explanation
`category`	Resource category (compute, storage, database, etc.)
`references`	AWS documentation links
`architectural_patterns`	Related design patterns (Circuit Breaker, Cache-Aside, etc.)
`detection_methods`	How to detect (CloudWatch, CLI, API)
`remediation_examples`	Code examples (Python, Terraform, AWS CLI)

Repository Structure

aws-misconfig-db/
├── data/
│   ├── by-service/            # Source of truth (46 JSON files)
│   │   ├── ec2.json           # 49 recommendations
│   │   ├── s3.json            # 24 recommendations
│   │   ├── lambda.json        # 21 recommendations
│   │   └── ...
│   ├── staging/               # Candidate recommendations awaiting review
│   └── ingest/
│       └── sources.json       # Source configuration (51 sources)
├── scripts/
│   ├── ingest/                # Ingest pipeline
│   │   ├── cli.py             # CLI entrypoint
│   │   ├── orchestrator.py    # Pipeline runner
│   │   ├── progress.py        # Rich terminal progress display
│   │   ├── config.py          # Source config loader
│   │   ├── dedup.py           # TF-IDF deduplication
│   │   ├── convert.py         # Claude API conversion
│   │   ├── stage.py           # Staging/promote/reject
│   │   ├── state.py           # State persistence
│   │   ├── health.py          # Health checks
│   │   ├── validate_entry.py  # Schema validation wrapper
│   │   ├── fetchers/          # RSS, HTML, GitHub fetchers
│   │   └── parsers/           # RSS, HTML, GitHub parsers
│   ├── db-init.py             # Build the DuckDB database
│   ├── db-query.py            # Query helper CLI
│   ├── validate.py            # Schema validation
│   └── generate.py            # Generate SUMMARY.md
├── tests/                     # 106 tests
├── db/                        # Generated DuckDB database
└── schema/
    └── misconfig-schema.json

Common Queries

-- All high-priority cost issues
SELECT service_name, scenario, recommendation_action
FROM recommendations
WHERE risk_detail LIKE '%cost%' AND build_priority = 0;

-- Security issues with remediation code
SELECT service_name, scenario, remediation_examples
FROM recommendations
WHERE risk_detail LIKE '%security%' AND remediation_examples != '[]';

-- Recommendations by architectural pattern
SELECT r.service_name, r.scenario,
       json_extract_string(p.pattern, '$.pattern_name') as pattern
FROM recommendations r,
     LATERAL unnest(json_extract(r.architectural_patterns, '$[*]')) as p(pattern)
WHERE r.architectural_patterns != '[]';

-- Services with most recommendations
SELECT service_name, COUNT(*) as count
FROM recommendations
GROUP BY service_name
ORDER BY count DESC
LIMIT 10;

Contributing

# Run the ingest pipeline to discover new recommendations
python3 scripts/ingest/cli.py fetch --dry-run

# Run tests
python3 -m pytest tests/ -v

# Validate your changes
python3 scripts/validate.py data/by-service/

# Rebuild database and docs
python3 scripts/db-init.py
python3 scripts/generate.py

See CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE

🔥 323 recommendations • 46 services • Query with SQL • Integrate anywhere 🔥

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Production-ready AWS misconfiguration detection & remediation

What Is This?

Quick Start (2 minutes)

1. Clone and Initialize

2. Explore Recommendations

3. Query with SQL

Ingest Pipeline

1. Add New Sources

2. Run the Pipeline

3. Review Output and Promote

4. Monitor Health

Integration Examples

LLM Integration (Claude/GPT)

Vantage Integration

Cloud Custodian Policies

Steampipe Integration

AWS Config Rules

Database Schema

Repository Structure

Common Queries

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github		.github
data		data
db		db
docs		docs
examples		examples
schema		schema
scripts		scripts
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SOURCES.md		SOURCES.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Production-ready AWS misconfiguration detection & remediation

What Is This?

Quick Start (2 minutes)

1. Clone and Initialize

2. Explore Recommendations

3. Query with SQL

Ingest Pipeline

1. Add New Sources

2. Run the Pipeline

3. Review Output and Promote

4. Monitor Health

Integration Examples

LLM Integration (Claude/GPT)

Vantage Integration

Cloud Custodian Policies

Steampipe Integration

AWS Config Rules

Database Schema

Repository Structure

Common Queries

Contributing

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages