Skip to content

Commit bcdf8d3

Browse files
Rowusuduahclaude
andcommitted
v0.2.0: Add CI/CD quality gate, launch drafts, GitHub Actions
- CI gate: `llm-sentry-gate --config guardrail.json` blocks deploys if AI quality drops - GitHub Actions CI workflow for Python 3.10-3.12 - Example config for CI integration - HN, Reddit, Twitter launch post drafts - 41 tests passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent f9b6fcf commit bcdf8d3

File tree

8 files changed

+400
-3
lines changed

8 files changed

+400
-3
lines changed

.github/workflows/ci.yml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [master, main]
6+
pull_request:
7+
branches: [master, main]
8+
9+
jobs:
10+
test:
11+
runs-on: ubuntu-latest
12+
strategy:
13+
matrix:
14+
python-version: ["3.10", "3.11", "3.12"]
15+
16+
steps:
17+
- uses: actions/checkout@v4
18+
- name: Set up Python ${{ matrix.python-version }}
19+
uses: actions/setup-python@v5
20+
with:
21+
python-version: ${{ matrix.python-version }}
22+
- name: Install dependencies
23+
run: pip install -e ".[dev]"
24+
- name: Run tests
25+
run: pytest tests/ -v --tb=short
26+
- name: Lint
27+
run: ruff check llmguardrail/

LAUNCH.md

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# Launch Plan for llm-sentry
2+
3+
## Hacker News Post (Show HN)
4+
5+
**Title:** Show HN: LLM Sentry – 12 diagnostic engines for AI pipelines, zero dependencies, no API keys
6+
7+
**Text:**
8+
9+
I built llm-sentry because I was tired of debugging LLM apps by staring at RAGAS scores.
10+
11+
RAGAS tells you your RAG pipeline scores 0.6. Now what? Is it retrieval? Generation? Context assembly? Is your agent stuck in a loop? Did your last prompt change break something? RAGAS won't tell you. Neither will DeepEval, TruLens, or Promptfoo.
12+
13+
llm-sentry runs 12 specialized diagnostic engines across your entire AI stack in a single scan:
14+
15+
- RAG Pathology: Classifies failures into 4 types (retrieval miss, poor grounding, noisy context, healthy) — tells you exactly WHERE your RAG pipeline breaks
16+
- Chain Probe: CASCADE fault analysis for multi-step pipelines — finds the root cause, not just the symptom
17+
- Agent Patrol: Detects 5 agent pathologies (futile cycles, oscillation, stall, drift, abandonment)
18+
- CoT Coherence: Catches reasoning gaps, contradictions, and unsupported conclusions
19+
- Prompt Brittleness: Stress-tests prompts under paraphrase — finds fragile prompts before production does
20+
- Plus 7 more: injection detection, mutation testing, model swap parity, output drift, contracts, context recall
21+
22+
Key differentiators:
23+
24+
1. Zero dependencies. No OpenAI key required. No LLM calls to evaluate LLMs.
25+
2. Works completely offline.
26+
3. One install, one API: `pip install llm-sentry` gives you everything.
27+
4. Diagnosis, not just scores. Every check tells you what's wrong AND what to fix.
28+
29+
```python
30+
import llmguardrail as lg
31+
32+
report = lg.scan(
33+
pipeline_name="my_app",
34+
checks=["rag", "coherence", "agents"],
35+
rag_queries=[("What is the return policy?", [("Returns within 30 days", 0.95)], "Returns within 30 days")],
36+
)
37+
print(report.summary())
38+
# Pipeline: my_app
39+
# Health: HEALTHY (92%)
40+
# Checks: 3 run
41+
```
42+
43+
GitHub: https://github.com/Rowusuduah/llm-sentry
44+
PyPI: https://pypi.org/project/llm-sentry/
45+
License: MIT
46+
47+
I'd love feedback on what checks you wish existed, or what's missing from your current AI debugging workflow.
48+
49+
---
50+
51+
## Reddit r/MachineLearning Post
52+
53+
**Title:** [P] I built a zero-dependency diagnostic toolkit for LLM pipelines — 12 engines, no API keys, works offline
54+
55+
**Body:**
56+
57+
Every LLM eval tool I've tried (RAGAS, DeepEval, TruLens) gives me a score. A score doesn't help when production is broken at 2 AM.
58+
59+
I built llm-sentry — a unified platform with 12 diagnostic engines that tell you WHAT is wrong and WHERE in your pipeline:
60+
61+
**What it catches:**
62+
- RAG failures: Is it retrieval, generation, or noisy context? (Four Soils classification)
63+
- Agent loops: Futile cycles, oscillation, stall, drift, abandonment
64+
- Reasoning breaks: CoT gaps, contradictions, unsupported conclusions
65+
- Prompt fragility: Which prompts break under paraphrase?
66+
- Pipeline faults: Root cause analysis across multi-step chains
67+
- Output drift: Schema violations, behavioral changes after model swaps
68+
69+
**Why it's different:**
70+
- No API keys. No OpenAI calls. Zero external dependencies.
71+
- Works offline. Runs in CI/CD.
72+
- Diagnoses, not just scores. Every check gives you a fix.
73+
74+
```bash
75+
pip install llm-sentry
76+
```
77+
78+
```python
79+
import llmguardrail as lg
80+
report = lg.scan(pipeline_name="prod", checks=["rag", "coherence", "agents"])
81+
print(report.summary())
82+
```
83+
84+
GitHub: https://github.com/Rowusuduah/llm-sentry
85+
86+
Looking for feedback — what diagnostic would you add? What's the hardest part of debugging your LLM apps?
87+
88+
---
89+
90+
## Reddit r/Python Post
91+
92+
**Title:** [Project] llm-sentry: Unified diagnostic platform for LLM pipelines — 12 engines, pure Python, zero deps
93+
94+
**Body:**
95+
96+
I built llm-sentry, a pure-Python toolkit for diagnosing failures in LLM-powered applications.
97+
98+
The problem: You have an LLM app in production. Something breaks. Existing tools (RAGAS, DeepEval) give you a score but don't tell you what's wrong.
99+
100+
llm-sentry gives you 12 diagnostic engines under one API:
101+
102+
| Engine | What it catches |
103+
|--------|----------------|
104+
| RAG Pathology | Retrieval miss vs. grounding failure vs. context noise |
105+
| Chain Probe | Root cause in multi-step pipelines |
106+
| Agent Patrol | 5 agent pathologies (loops, stall, drift, etc.) |
107+
| CoT Coherence | Reasoning gaps and contradictions |
108+
| Prompt Brittleness | Prompts that break under paraphrase |
109+
| + 7 more | Injection, mutation, model parity, drift, contracts, context |
110+
111+
Design decisions:
112+
- Zero dependencies (pure Python, stdlib only)
113+
- No LLM calls needed to evaluate LLMs
114+
- Every engine has a SQLite store built in for history/trends
115+
- Unified `scan()` API runs any combination of checks
116+
- Extensible: register your own custom checks
117+
118+
```bash
119+
pip install llm-sentry
120+
```
121+
122+
Built with hatchling, tested with pytest (37+ tests), MIT licensed.
123+
124+
GitHub: https://github.com/Rowusuduah/llm-sentry
125+
PyPI: https://pypi.org/project/llm-sentry/
126+
127+
---
128+
129+
## Twitter/X Thread
130+
131+
**Tweet 1:**
132+
I just shipped llm-sentry — 12 diagnostic engines for LLM pipelines, zero dependencies, no API keys.
133+
134+
RAGAS gives you a score. llm-sentry tells you what's broken and how to fix it.
135+
136+
pip install llm-sentry
137+
138+
🧵 What's inside:
139+
140+
**Tweet 2:**
141+
1/ RAG Pathology — "Four Soils" classification
142+
143+
Your RAG scores 0.6. But WHY?
144+
145+
- PATH: Retrieval missed entirely → fix embeddings
146+
- ROCKY: Good retrieval, bad generation → fix grounding prompt
147+
- THORNY: Noisy context → add reranking
148+
- GOOD: Working as intended
149+
150+
**Tweet 3:**
151+
2/ Agent Patrol — detects 5 agent pathologies
152+
153+
Your agent is "thinking" for 5 minutes. Is it:
154+
- Futile cycling (same actions over and over)?
155+
- Oscillating between two states?
156+
- Stalled on a subtask?
157+
- Drifted from the original goal?
158+
- Abandoned the task entirely?
159+
160+
Now you know.
161+
162+
**Tweet 4:**
163+
3/ Chain Probe — CASCADE fault analysis
164+
165+
Multi-step pipeline fails at step 5. But the REAL problem was step 2.
166+
167+
Chain Probe traces the cascade: ROOT_CAUSE → INHERITED → INHERITED → INHERITED → symptom.
168+
169+
Fix the root, fix everything.
170+
171+
**Tweet 5:**
172+
The key insight: you don't need GPT-4 to evaluate GPT-4.
173+
174+
Every engine in llm-sentry works with zero LLM calls. Pure algorithmic diagnosis. Works offline. Runs in CI/CD.
175+
176+
GitHub: https://github.com/Rowusuduah/llm-sentry
177+
PyPI: https://pypi.org/project/llm-sentry/
178+
179+
MIT licensed. Feedback welcome.

examples/guardrail.json

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
{
2+
"pipeline_name": "my_rag_app",
3+
"checks": ["rag", "agents"],
4+
"threshold": 0.7,
5+
"rag_queries": [
6+
["What is the return policy?", [["Returns accepted within 30 days of purchase", 0.95]], "Our return policy allows returns within 30 days."],
7+
["How do I contact support?", [["Email support@example.com or call 1-800-HELP", 0.9]], "You can email support@example.com or call 1-800-HELP."],
8+
["What are shipping costs?", [["Random unrelated content about cooking", 0.1]], "I'm not sure about shipping costs."]
9+
],
10+
"agent_task": "answer customer questions",
11+
"agent_actions": [
12+
["search_knowledge_base", "found 5 relevant articles"],
13+
["generate_response", "response generated successfully"],
14+
["verify_response", "response verified against sources"]
15+
]
16+
}

llmguardrail/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@
3131

3232
from __future__ import annotations
3333

34-
__version__ = "0.1.0"
34+
__version__ = "0.2.0"
3535
__all__ = [
3636
# Top-level API
3737
"scan",

llmguardrail/ci_gate.py

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
"""CI/CD quality gate for LLM pipelines.
2+
3+
Run llm-sentry checks as part of your CI/CD pipeline.
4+
Fails the build if AI quality drops below threshold.
5+
6+
Usage:
7+
python -m llmguardrail.ci_gate --config guardrail.json --threshold 0.7
8+
9+
Config file (guardrail.json):
10+
{
11+
"pipeline_name": "my_app",
12+
"checks": ["rag", "coherence"],
13+
"threshold": 0.7,
14+
"rag_queries": [
15+
["What is X?", [["X is Y", 0.9]], "X is Y"]
16+
]
17+
}
18+
"""
19+
20+
from __future__ import annotations
21+
22+
import argparse
23+
import json
24+
import sys
25+
from pathlib import Path
26+
27+
from llmguardrail import HealthStatus, ScanStore, scan
28+
29+
30+
def run_gate(config_path: str, threshold: float | None = None, db_path: str | None = None) -> int:
31+
"""Run CI gate check. Returns 0 for pass, 1 for fail."""
32+
config = json.loads(Path(config_path).read_text())
33+
34+
pipeline_name = config.get("pipeline_name", "ci_check")
35+
checks = config.get("checks", [])
36+
gate_threshold = threshold or config.get("threshold", 0.7)
37+
38+
# Build kwargs from config
39+
kwargs: dict = {}
40+
if "rag_queries" in config:
41+
kwargs["rag_queries"] = [
42+
(q, [(c, s) for c, s in chunks], answer)
43+
for q, chunks, answer in config["rag_queries"]
44+
]
45+
if "coherence_traces" in config:
46+
kwargs["coherence_traces"] = [
47+
(steps, conclusion)
48+
for steps, conclusion in config["coherence_traces"]
49+
]
50+
if "agent_task" in config:
51+
kwargs["agent_task"] = config["agent_task"]
52+
if "agent_actions" in config:
53+
kwargs["agent_actions"] = [
54+
(action, result) for action, result in config["agent_actions"]
55+
]
56+
57+
report = scan(pipeline_name=pipeline_name, checks=checks, **kwargs)
58+
59+
# Print summary
60+
print("=" * 60)
61+
print("LLM SENTRY CI GATE")
62+
print("=" * 60)
63+
print(report.summary())
64+
print("=" * 60)
65+
print(f"Threshold: {gate_threshold:.0%}")
66+
print(f"Score: {report.overall_score:.0%}")
67+
68+
# Save to store if db_path provided
69+
if db_path:
70+
store = ScanStore(db_path)
71+
scan_id = store.save(report)
72+
trend = store.trend(pipeline_name, last_n=5)
73+
if len(trend) > 1:
74+
delta = trend[-1] - trend[-2]
75+
direction = "+" if delta >= 0 else ""
76+
print(f"Trend: {direction}{delta:.0%} vs last run")
77+
print(f"Scan ID: {scan_id}")
78+
store.close()
79+
80+
passed = report.overall_score >= gate_threshold
81+
if passed:
82+
print("\nRESULT: PASS")
83+
else:
84+
print(f"\nRESULT: FAIL (score {report.overall_score:.0%} < threshold {gate_threshold:.0%})")
85+
if report.recommendations:
86+
print("\nRecommendations:")
87+
for r in report.recommendations[:5]:
88+
print(f" - {r}")
89+
90+
print("=" * 60)
91+
return 0 if passed else 1
92+
93+
94+
def main():
95+
parser = argparse.ArgumentParser(
96+
prog="llm-sentry-gate",
97+
description="LLM Sentry CI/CD Quality Gate",
98+
)
99+
parser.add_argument("--config", required=True, help="Path to gate config JSON")
100+
parser.add_argument("--threshold", type=float, default=None, help="Override score threshold (0.0-1.0)")
101+
parser.add_argument("--db", default=None, help="SQLite db path for tracking history")
102+
103+
args = parser.parse_args()
104+
sys.exit(run_gate(args.config, args.threshold, args.db))
105+
106+
107+
if __name__ == "__main__":
108+
main()

pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "llm-sentry"
7-
version = "0.1.0"
7+
version = "0.2.0"
88
description = "Unified AI reliability platform. One install, 12 diagnostic engines. Continuous monitoring, fault diagnosis, and compliance for LLM pipelines."
99
readme = "README.md"
1010
license = { text = "MIT" }
@@ -47,6 +47,7 @@ dev = ["pytest>=7.0", "pytest-cov", "ruff"]
4747

4848
[project.scripts]
4949
llm-sentry = "llmguardrail:_cli_main"
50+
llm-sentry-gate = "llmguardrail.ci_gate:main"
5051

5152
[project.urls]
5253
Homepage = "https://github.com/Rowusuduah/llm-sentry"

0 commit comments

Comments
 (0)