|
1 | | -# llmguardrail |
| 1 | +# llm-sentry |
2 | 2 |
|
3 | 3 | **One install. 12 diagnostic engines. Your AI pipeline's immune system.** |
4 | 4 |
|
5 | | -Stop guessing why your LLM app is broken. `llmguardrail` runs 12 specialized diagnostic engines across your entire AI stack — RAG pipelines, agent loops, chain-of-thought reasoning, prompt stability, model swaps, and output drift — in a single scan. |
| 5 | +Stop guessing why your LLM app is broken. `llm-sentry` runs 12 specialized diagnostic engines across your entire AI stack — RAG pipelines, agent loops, chain-of-thought reasoning, prompt stability, model migrations, and output drift — in a single scan. |
6 | 6 |
|
7 | 7 | ```bash |
8 | 8 | pip install llm-sentry |
9 | 9 | ``` |
10 | 10 |
|
11 | | -## The Problem |
12 | | - |
13 | | -You have an LLM in production. It breaks. You don't know why. |
14 | | - |
15 | | -- Is it retrieval? Generation? Both? |
16 | | -- Is your agent stuck in a loop? |
17 | | -- Did your last prompt change break something? |
18 | | -- Is your chain-of-thought reasoning actually coherent? |
19 | | -- Did swapping models change behavior? |
20 | | - |
21 | | -You're left stitching together 5+ tools, each with different APIs, different install processes, different report formats. |
22 | | - |
23 | | -**llmguardrail** gives you one API, one report, one answer. |
24 | | - |
25 | | -## Quick Start |
26 | | - |
27 | 11 | ```python |
28 | 12 | import llmguardrail as lg |
29 | 13 |
|
30 | | -# Run a full diagnostic scan |
31 | 14 | report = lg.scan( |
32 | | - pipeline_name="my_rag_app", |
33 | | - checks=["rag", "coherence", "agents"], |
34 | | - rag_queries=[ |
35 | | - ("What is the return policy?", |
36 | | - [("Returns accepted within 30 days", 0.95)], |
37 | | - "Our return policy allows returns within 30 days."), |
38 | | - ], |
39 | | - coherence_traces=[ |
40 | | - ("1. User asked about returns\n2. Policy doc found\n3. Answer generated", |
41 | | - "Returns are allowed within 30 days"), |
42 | | - ], |
43 | | - agent_task="answer customer questions", |
44 | | - agent_actions=[ |
45 | | - ("search_docs", "found 3 results"), |
46 | | - ("generate_answer", "answer produced"), |
47 | | - ], |
| 15 | + pipeline_name="my_app", |
| 16 | + checks=["rag", "coherence", "agents", "prompts"], |
| 17 | + ... |
48 | 18 | ) |
49 | | - |
50 | 19 | print(report.summary()) |
51 | | -# Pipeline: my_rag_app |
52 | | -# Health: HEALTHY (85%) |
53 | | -# Checks: 3 run |
54 | | -# [+] rag: healthy (90%) |
55 | | -# [+] coherence: healthy (88%) |
56 | | -# [+] agents: healthy (100%) |
57 | 20 | ``` |
58 | 21 |
|
59 | | -## 12 Diagnostic Engines |
| 22 | +--- |
60 | 23 |
|
61 | | -| Engine | What it catches | Package | |
62 | | -|--------|----------------|---------| |
63 | | -| **RAG Pathology** | Retrieval miss, poor grounding, context noise (Four Soils) | `rag-pathology` | |
64 | | -| **Chain Probe** | CASCADE fault analysis — finds root cause in multi-step pipelines | `chain-probe` | |
65 | | -| **Agent Patrol** | Futile cycles, oscillation, stall, drift, abandonment in agents | `agent-patrol` | |
66 | | -| **CoT Coherence** | Reasoning gaps, contradictions, unsupported conclusions | `cot-coherence` | |
67 | | -| **Prompt Brittleness** | Prompts that break under paraphrase stress | `prompt-brittleness` | |
68 | | -| **Inject Lock** | Prompt injection vulnerability detection | `inject-lock` | |
69 | | -| **LLM Mutation** | Mutation testing for prompt robustness | `llm-mutation` | |
70 | | -| **Model Parity** | Behavioral drift when swapping models | `model-parity` | |
71 | | -| **Spec Drift** | Output schema violations in production | `spec-drift` | |
72 | | -| **Drift Sentinel** | PR intent vs. actual code drift | `drift-sentinel` | |
73 | | -| **LLM Contract** | Runtime output contract enforcement | `llm-contract` | |
74 | | -| **Context Recall** | Context window position bias auditing | `context-recall` | |
| 24 | +## The 12 Diagnostic Engines |
75 | 25 |
|
76 | | -Every engine is zero-dependency. No OpenAI key required. No LLM calls to evaluate LLMs. |
| 26 | +| # | Engine | What It Detects | Module | |
| 27 | +|---|--------|----------------|--------| |
| 28 | +| 1 | **RAG Pathology** | Retrieval failures by type and location (Four Soils classification) | `rag_pathology` | |
| 29 | +| 2 | **Agent Patrol** | Agent loops, stalls, oscillation, drift, and abandonment | `agent_patrol` | |
| 30 | +| 3 | **Chain Probe** | Root-cause step in multi-step pipeline failures (CASCADE analysis) | `chain_probe` | |
| 31 | +| 4 | **Context Lens** | Lost-in-the-middle — LLM failing to retrieve from context positions | `context_lens` | |
| 32 | +| 5 | **LLM Mutation** | Gaps in prompt test coverage via semantic mutation testing | `llm_mutation` | |
| 33 | +| 6 | **Prompt Shield** | Brittle prompts that break under paraphrase stress testing | `prompt_shield` | |
| 34 | +| 7 | **LLM Contract** | Behavioral contract violations on LLM function calls | `llm_contract` | |
| 35 | +| 8 | **Drift Guard** | PR intent drift — code changes that don't match stated purpose | `drift_guard` | |
| 36 | +| 9 | **Spec Drift** | Semantic specification drift even when structural validation passes | `spec_drift` | |
| 37 | +| 10 | **Prompt Lock** | Prompt regression detection with judge calibration and CI gate | `prompt_lock` | |
| 38 | +| 11 | **Model Parity** | Behavioral divergence when swapping LLM providers (7 dimensions) | `model_parity` | |
| 39 | +| 12 | **CoT Coherence** | Silent incoherence in chain-of-thought reasoning between steps | `cot_coherence` | |
77 | 40 |
|
78 | | -## Use Individual Engines |
| 41 | +--- |
79 | 42 |
|
80 | | -```python |
81 | | -# RAG diagnosis |
82 | | -from llmguardrail.rag import RAGDiagnoser, RAGQuery, Chunk |
83 | | - |
84 | | -diagnoser = RAGDiagnoser("my_pipeline") |
85 | | -diagnosis = diagnoser.diagnose_query(RAGQuery( |
86 | | - query="What is GDP?", |
87 | | - retrieved_chunks=[Chunk("GDP is 6%", score=0.9)], |
88 | | - generated_answer="GDP is 6%", |
89 | | -)) |
90 | | -print(diagnosis.soil_type) # SoilType.GOOD |
91 | | - |
92 | | -# Agent monitoring |
93 | | -from llmguardrail.agents import PatrolMonitor |
94 | | - |
95 | | -monitor = PatrolMonitor(task_description="answer questions") |
96 | | -report = monitor.observe(action="searching", result="no results") |
97 | | - |
98 | | -# Chain fault analysis |
99 | | -from llmguardrail.chains import Pipeline |
100 | | - |
101 | | -pipeline = Pipeline("my_chain") |
102 | | -# ... add steps with @pipeline.probe ... |
103 | | -result = pipeline.cascade(initial_input="data") |
104 | | -print(result.root_cause_step) |
105 | | -``` |
| 43 | +## Why One Platform? |
106 | 44 |
|
107 | | -## Scan History & Trends |
| 45 | +Most teams discover LLM failures in production, then stitch together 5+ tools with different APIs, install processes, and report formats. |
108 | 46 |
|
109 | | -```python |
110 | | -from llmguardrail import ScanStore, scan |
| 47 | +**llm-sentry** gives you: |
| 48 | +- **One install** — `pip install llm-sentry` |
| 49 | +- **One API** — `lg.scan()` with check selection |
| 50 | +- **One report** — unified diagnostics across all failure modes |
| 51 | +- **One CI gate** — `llm-sentry ci` blocks merges on regressions |
111 | 52 |
|
112 | | -report = scan(pipeline_name="prod", checks=["rag", "coherence"]) |
| 53 | +--- |
113 | 54 |
|
114 | | -store = ScanStore("guardrail.db") |
115 | | -store.save(report) |
| 55 | +## Use Cases |
116 | 56 |
|
117 | | -# Track health over time |
118 | | -trend = store.trend("prod", last_n=10) |
119 | | -print(trend) # [0.72, 0.75, 0.78, 0.81, ...] |
| 57 | +- **RAG apps**: retrieval quality + generation faithfulness + context window coverage |
| 58 | +- **Agent systems**: loop detection + drift monitoring + abandonment alerts |
| 59 | +- **Prompt engineering**: brittleness testing + regression gating + mutation coverage |
| 60 | +- **Model migrations**: behavioral parity certification across 7 dimensions |
| 61 | +- **Production monitoring**: continuous semantic drift detection + contract enforcement |
120 | 62 |
|
121 | | -# Full history |
122 | | -history = store.get_history("prod") |
123 | | -``` |
124 | | - |
125 | | -## CLI |
| 63 | +--- |
126 | 64 |
|
127 | | -```bash |
128 | | -# Check which engines are installed |
129 | | -llmguardrail status |
130 | | - |
131 | | -# View scan history |
132 | | -llmguardrail history --pipeline my_app |
133 | | -``` |
134 | | - |
135 | | -## Custom Checks |
136 | | - |
137 | | -```python |
138 | | -from llmguardrail import register_check, CheckResult, HealthStatus, scan |
139 | | - |
140 | | -def my_custom_check(**kwargs) -> CheckResult: |
141 | | - # Your custom diagnostic logic |
142 | | - score = run_my_diagnostics() |
143 | | - return CheckResult( |
144 | | - check_name="my_check", |
145 | | - score=score, |
146 | | - status=HealthStatus.from_score(score), |
147 | | - recommendations=["Fix X"] if score < 0.7 else [], |
148 | | - ) |
149 | | - |
150 | | -register_check("my_check", my_custom_check) |
151 | | -report = scan(checks=["rag", "my_check"]) |
152 | | -``` |
| 65 | +## Requirements |
153 | 66 |
|
154 | | -## Why not RAGAS / DeepEval / TruLens? |
155 | | - |
156 | | -| | llmguardrail | RAGAS | DeepEval | TruLens | |
157 | | -|---|---|---|---|---| |
158 | | -| Needs OpenAI key | No | Yes | Yes | Yes | |
159 | | -| Diagnoses failure type | Yes | No (just scores) | No | No | |
160 | | -| Agent monitoring | Yes | No | No | No | |
161 | | -| Chain fault analysis | Yes | No | No | No | |
162 | | -| Prompt robustness | Yes | No | Partial | No | |
163 | | -| Zero dependencies | Yes | No | No | No | |
164 | | -| Works offline | Yes | No | No | No | |
| 67 | +- Python 3.10+ |
| 68 | +- Zero required dependencies (LLM-powered checks optional) |
165 | 69 |
|
166 | 70 | ## License |
167 | 71 |
|
|
0 commit comments