v0.2.0: Add CI/CD quality gate, launch drafts, GitHub Actions

Rowusuduah · claude · Rowusuduah · commit bcdf8d30ec6e · 2026-03-27T23:27:20.000-04:00
- CI gate: `llm-sentry-gate --config guardrail.json` blocks deploys if AI quality drops
- GitHub Actions CI workflow for Python 3.10-3.12
- Example config for CI integration
- HN, Reddit, Twitter launch post drafts
- 41 tests passing

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,27 @@
+name: CI
+
+on:
+  push:
+    branches: [master, main]
+  pull_request:
+    branches: [master, main]
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.10", "3.11", "3.12"]
+
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Install dependencies
+        run: pip install -e ".[dev]"
+      - name: Run tests
+        run: pytest tests/ -v --tb=short
+      - name: Lint
+        run: ruff check llmguardrail/
diff --git a/LAUNCH.md b/LAUNCH.md
@@ -0,0 +1,179 @@
+# Launch Plan for llm-sentry
+
+## Hacker News Post (Show HN)
+
+**Title:** Show HN: LLM Sentry – 12 diagnostic engines for AI pipelines, zero dependencies, no API keys
+
+**Text:**
+
+I built llm-sentry because I was tired of debugging LLM apps by staring at RAGAS scores.
+
+RAGAS tells you your RAG pipeline scores 0.6. Now what? Is it retrieval? Generation? Context assembly? Is your agent stuck in a loop? Did your last prompt change break something? RAGAS won't tell you. Neither will DeepEval, TruLens, or Promptfoo.
+
+llm-sentry runs 12 specialized diagnostic engines across your entire AI stack in a single scan:
+
+- RAG Pathology: Classifies failures into 4 types (retrieval miss, poor grounding, noisy context, healthy) — tells you exactly WHERE your RAG pipeline breaks
+- Chain Probe: CASCADE fault analysis for multi-step pipelines — finds the root cause, not just the symptom
+- Agent Patrol: Detects 5 agent pathologies (futile cycles, oscillation, stall, drift, abandonment)
+- CoT Coherence: Catches reasoning gaps, contradictions, and unsupported conclusions
+- Prompt Brittleness: Stress-tests prompts under paraphrase — finds fragile prompts before production does
+- Plus 7 more: injection detection, mutation testing, model swap parity, output drift, contracts, context recall
+
+Key differentiators:
+
+1. Zero dependencies. No OpenAI key required. No LLM calls to evaluate LLMs.
+2. Works completely offline.
+3. One install, one API: `pip install llm-sentry` gives you everything.
+4. Diagnosis, not just scores. Every check tells you what's wrong AND what to fix.
+
+```python
+import llmguardrail as lg
+
+report = lg.scan(
+    pipeline_name="my_app",
+    checks=["rag", "coherence", "agents"],
+    rag_queries=[("What is the return policy?", [("Returns within 30 days", 0.95)], "Returns within 30 days")],
+)
+print(report.summary())
+# Pipeline: my_app
+# Health: HEALTHY (92%)
+# Checks: 3 run
+```
+
+GitHub: https://github.com/Rowusuduah/llm-sentry
+PyPI: https://pypi.org/project/llm-sentry/
+License: MIT
+
+I'd love feedback on what checks you wish existed, or what's missing from your current AI debugging workflow.
+
+---
+
+## Reddit r/MachineLearning Post
+
+**Title:** [P] I built a zero-dependency diagnostic toolkit for LLM pipelines — 12 engines, no API keys, works offline
+
+**Body:**
+
+Every LLM eval tool I've tried (RAGAS, DeepEval, TruLens) gives me a score. A score doesn't help when production is broken at 2 AM.
+
+I built llm-sentry — a unified platform with 12 diagnostic engines that tell you WHAT is wrong and WHERE in your pipeline:
+
+**What it catches:**
+- RAG failures: Is it retrieval, generation, or noisy context? (Four Soils classification)
+- Agent loops: Futile cycles, oscillation, stall, drift, abandonment
+- Reasoning breaks: CoT gaps, contradictions, unsupported conclusions
+- Prompt fragility: Which prompts break under paraphrase?
+- Pipeline faults: Root cause analysis across multi-step chains
+- Output drift: Schema violations, behavioral changes after model swaps
+
+**Why it's different:**
+- No API keys. No OpenAI calls. Zero external dependencies.
+- Works offline. Runs in CI/CD.
+- Diagnoses, not just scores. Every check gives you a fix.
+
+```bash
+pip install llm-sentry
+```
+
+```python
+import llmguardrail as lg
+report = lg.scan(pipeline_name="prod", checks=["rag", "coherence", "agents"])
+print(report.summary())
+```
+
+GitHub: https://github.com/Rowusuduah/llm-sentry
+
+Looking for feedback — what diagnostic would you add? What's the hardest part of debugging your LLM apps?
+
+---
+
+## Reddit r/Python Post
+
+**Title:** [Project] llm-sentry: Unified diagnostic platform for LLM pipelines — 12 engines, pure Python, zero deps
+
+**Body:**
+
+I built llm-sentry, a pure-Python toolkit for diagnosing failures in LLM-powered applications.
+
+The problem: You have an LLM app in production. Something breaks. Existing tools (RAGAS, DeepEval) give you a score but don't tell you what's wrong.
+
+llm-sentry gives you 12 diagnostic engines under one API:
+
+| Engine | What it catches |
+|--------|----------------|
+| RAG Pathology | Retrieval miss vs. grounding failure vs. context noise |
+| Chain Probe | Root cause in multi-step pipelines |
+| Agent Patrol | 5 agent pathologies (loops, stall, drift, etc.) |
+| CoT Coherence | Reasoning gaps and contradictions |
+| Prompt Brittleness | Prompts that break under paraphrase |
+| + 7 more | Injection, mutation, model parity, drift, contracts, context |
+
+Design decisions:
+- Zero dependencies (pure Python, stdlib only)
+- No LLM calls needed to evaluate LLMs
+- Every engine has a SQLite store built in for history/trends
+- Unified `scan()` API runs any combination of checks
+- Extensible: register your own custom checks
+
+```bash
+pip install llm-sentry
+```
+
+Built with hatchling, tested with pytest (37+ tests), MIT licensed.
+
+GitHub: https://github.com/Rowusuduah/llm-sentry
+PyPI: https://pypi.org/project/llm-sentry/
+
+---
+
+## Twitter/X Thread
+
+**Tweet 1:**
+I just shipped llm-sentry — 12 diagnostic engines for LLM pipelines, zero dependencies, no API keys.
+
+RAGAS gives you a score. llm-sentry tells you what's broken and how to fix it.
+
+pip install llm-sentry
+
+🧵 What's inside:
+
+**Tweet 2:**
+1/ RAG Pathology — "Four Soils" classification
+
+Your RAG scores 0.6. But WHY?
+
+- PATH: Retrieval missed entirely → fix embeddings
+- ROCKY: Good retrieval, bad generation → fix grounding prompt
+- THORNY: Noisy context → add reranking
+- GOOD: Working as intended
+
+**Tweet 3:**
+2/ Agent Patrol — detects 5 agent pathologies
+
+Your agent is "thinking" for 5 minutes. Is it:
+- Futile cycling (same actions over and over)?
+- Oscillating between two states?
+- Stalled on a subtask?
+- Drifted from the original goal?
+- Abandoned the task entirely?
+
+Now you know.
+
+**Tweet 4:**
+3/ Chain Probe — CASCADE fault analysis
+
+Multi-step pipeline fails at step 5. But the REAL problem was step 2.
+
+Chain Probe traces the cascade: ROOT_CAUSE → INHERITED → INHERITED → INHERITED → symptom.
+
+Fix the root, fix everything.
+
+**Tweet 5:**
+The key insight: you don't need GPT-4 to evaluate GPT-4.
+
+Every engine in llm-sentry works with zero LLM calls. Pure algorithmic diagnosis. Works offline. Runs in CI/CD.
+
+GitHub: https://github.com/Rowusuduah/llm-sentry
+PyPI: https://pypi.org/project/llm-sentry/
+
+MIT licensed. Feedback welcome.
diff --git a/examples/guardrail.json b/examples/guardrail.json
@@ -0,0 +1,16 @@
+{
+    "pipeline_name": "my_rag_app",
+    "checks": ["rag", "agents"],
+    "threshold": 0.7,
+    "rag_queries": [
+        ["What is the return policy?", [["Returns accepted within 30 days of purchase", 0.95]], "Our return policy allows returns within 30 days."],
+        ["How do I contact support?", [["Email support@example.com or call 1-800-HELP", 0.9]], "You can email support@example.com or call 1-800-HELP."],
+        ["What are shipping costs?", [["Random unrelated content about cooking", 0.1]], "I'm not sure about shipping costs."]
+    ],
+    "agent_task": "answer customer questions",
+    "agent_actions": [
+        ["search_knowledge_base", "found 5 relevant articles"],
+        ["generate_response", "response generated successfully"],
+        ["verify_response", "response verified against sources"]
+    ]
+}
diff --git a/llmguardrail/__init__.py b/llmguardrail/__init__.py
@@ -31,7 +31,7 @@
 
 from __future__ import annotations
 
-__version__ = "0.1.0"
+__version__ = "0.2.0"
 __all__ = [
     # Top-level API
     "scan",
diff --git a/llmguardrail/ci_gate.py b/llmguardrail/ci_gate.py
@@ -0,0 +1,108 @@
+"""CI/CD quality gate for LLM pipelines.
+
+Run llm-sentry checks as part of your CI/CD pipeline.
+Fails the build if AI quality drops below threshold.
+
+Usage:
+    python -m llmguardrail.ci_gate --config guardrail.json --threshold 0.7
+
+Config file (guardrail.json):
+    {
+        "pipeline_name": "my_app",
+        "checks": ["rag", "coherence"],
+        "threshold": 0.7,
+        "rag_queries": [
+            ["What is X?", [["X is Y", 0.9]], "X is Y"]
+        ]
+    }
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+
+from llmguardrail import HealthStatus, ScanStore, scan
+
+
+def run_gate(config_path: str, threshold: float | None = None, db_path: str | None = None) -> int:
+    """Run CI gate check. Returns 0 for pass, 1 for fail."""
+    config = json.loads(Path(config_path).read_text())
+
+    pipeline_name = config.get("pipeline_name", "ci_check")
+    checks = config.get("checks", [])
+    gate_threshold = threshold or config.get("threshold", 0.7)
+
+    # Build kwargs from config
+    kwargs: dict = {}
+    if "rag_queries" in config:
+        kwargs["rag_queries"] = [
+            (q, [(c, s) for c, s in chunks], answer)
+            for q, chunks, answer in config["rag_queries"]
+        ]
+    if "coherence_traces" in config:
+        kwargs["coherence_traces"] = [
+            (steps, conclusion)
+            for steps, conclusion in config["coherence_traces"]
+        ]
+    if "agent_task" in config:
+        kwargs["agent_task"] = config["agent_task"]
+    if "agent_actions" in config:
+        kwargs["agent_actions"] = [
+            (action, result) for action, result in config["agent_actions"]
+        ]
+
+    report = scan(pipeline_name=pipeline_name, checks=checks, **kwargs)
+
+    # Print summary
+    print("=" * 60)
+    print("LLM SENTRY CI GATE")
+    print("=" * 60)
+    print(report.summary())
+    print("=" * 60)
+    print(f"Threshold: {gate_threshold:.0%}")
+    print(f"Score:     {report.overall_score:.0%}")
+
+    # Save to store if db_path provided
+    if db_path:
+        store = ScanStore(db_path)
+        scan_id = store.save(report)
+        trend = store.trend(pipeline_name, last_n=5)
+        if len(trend) > 1:
+            delta = trend[-1] - trend[-2]
+            direction = "+" if delta >= 0 else ""
+            print(f"Trend:     {direction}{delta:.0%} vs last run")
+        print(f"Scan ID:   {scan_id}")
+        store.close()
+
+    passed = report.overall_score >= gate_threshold
+    if passed:
+        print("\nRESULT: PASS")
+    else:
+        print(f"\nRESULT: FAIL (score {report.overall_score:.0%} < threshold {gate_threshold:.0%})")
+        if report.recommendations:
+            print("\nRecommendations:")
+            for r in report.recommendations[:5]:
+                print(f"  - {r}")
+
+    print("=" * 60)
+    return 0 if passed else 1
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        prog="llm-sentry-gate",
+        description="LLM Sentry CI/CD Quality Gate",
+    )
+    parser.add_argument("--config", required=True, help="Path to gate config JSON")
+    parser.add_argument("--threshold", type=float, default=None, help="Override score threshold (0.0-1.0)")
+    parser.add_argument("--db", default=None, help="SQLite db path for tracking history")
+
+    args = parser.parse_args()
+    sys.exit(run_gate(args.config, args.threshold, args.db))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "llm-sentry"
-version = "0.1.0"
+version = "0.2.0"
 description = "Unified AI reliability platform. One install, 12 diagnostic engines. Continuous monitoring, fault diagnosis, and compliance for LLM pipelines."
 readme = "README.md"
 license = { text = "MIT" }
@@ -47,6 +47,7 @@ dev = ["pytest>=7.0", "pytest-cov", "ruff"]
 
 [project.scripts]
 llm-sentry = "llmguardrail:_cli_main"
+llm-sentry-gate = "llmguardrail.ci_gate:main"
 
 [project.urls]
 Homepage = "https://github.com/Rowusuduah/llm-sentry"
diff --git a/tests/test_ci_gate.py b/tests/test_ci_gate.py
diff --git a/tests/test_llmguardrail.py b/tests/test_llmguardrail.py