Warning
This is an experimental/learning project only - generated entirely by claude code.
An LLM-powered compliance evaluation framework that automatically parses structured policy documents (in markdown) and evaluates any text against the extracted criteria. The system uses AI to intelligently extract requirements, sub-criteria, and logical relationships from policies, then builds dynamic evaluation workflows that provide granular pass/fail results with confidence scores and reasoning for each criterion.
Ideal for financial regulation compliance, content moderation, contract analysis, or any domain requiring automated policy enforcement with explainable, auditable results.
- Generic: Works with any policy document in markdown format
- Two-Step Parsing: Normalizes policy then generates workflow for auditability
- Explainable: Node IDs match clause numbers for full traceability
- Model-agnostic: Uses LiteLLM to support 100+ LLM providers
- Configurable: Environment-based configuration with
.envsupport
PolicyFlow recently underwent significant architectural improvements to reduce boilerplate and enhance maintainability:
- Reduced Boilerplate: @node_schema decorator eliminates ~85% of node definition boilerplate
- Improved Abstractions: Extracted CacheManager and RateLimiter for better separation of concerns
- Cleaner Configuration: Migrated to pydantic-settings with cross-field validation
- Enhanced Testing: Added 100 new tests (496 total), all passing
- Better Developer Experience: DeterministicNode base class simplifies node creation
These changes reduced ~372-450 lines of code while improving code quality and maintainability. See plans/CODEBASE_IMPROVEMENTS.md for full details.
uv syncCopy .env.example to .env and configure:
# Required: API key for your LLM provider
ANTHROPIC_API_KEY=sk-ant-...
# Optional: Model selection (default: anthropic/claude-sonnet-4-20250514)
POLICY_EVAL_MODEL=anthropic/claude-sonnet-4-20250514
# Optional: Confidence thresholds
POLICY_EVAL_CONFIDENCE_HIGH=0.8 # Above this = high confidence
POLICY_EVAL_CONFIDENCE_LOW=0.5 # Below this = needs review| Variable | Default | Description |
|---|---|---|
POLICY_EVAL_MODEL |
anthropic/claude-sonnet-4-20250514 |
LiteLLM model identifier |
POLICY_EVAL_TEMPERATURE |
0.0 |
LLM temperature for evaluation |
POLICY_EVAL_CONFIDENCE_HIGH |
0.8 |
High confidence threshold |
POLICY_EVAL_CONFIDENCE_LOW |
0.5 |
Low confidence threshold (below = needs review) |
POLICY_EVAL_MAX_RETRIES |
3 |
Max retry attempts per LLM call |
POLICY_EVAL_RETRY_WAIT |
2 |
Seconds between retries |
POLICY_EVAL_CACHE_ENABLED |
true |
Enable LLM response caching |
POLICY_EVAL_CACHE_TTL |
3600 |
Cache TTL in seconds (0 = no expiration) |
POLICY_EVAL_CACHE_DIR |
.cache |
Directory for cache files |
POLICY_EVAL_THROTTLE_ENABLED |
false |
Enable rate limiting |
POLICY_EVAL_THROTTLE_RPM |
60 |
Max requests per minute |
PHOENIX_ENABLED |
false |
Enable Arize Phoenix tracing |
PHOENIX_COLLECTOR_ENDPOINT |
http://localhost:6007 |
Phoenix collector URL |
PHOENIX_PROJECT_NAME |
policyflow |
Project name in Phoenix UI |
CLASSIFIER_MODEL |
POLICY_EVAL_MODEL |
Default model for ClassifierNode |
DATA_EXTRACTOR_MODEL |
POLICY_EVAL_MODEL |
Default model for DataExtractorNode |
SENTIMENT_MODEL |
POLICY_EVAL_MODEL |
Default model for SentimentNode |
SAMPLER_MODEL |
POLICY_EVAL_MODEL |
Default model for SamplerNode |
GENERATE_MODEL |
POLICY_EVAL_MODEL |
Model for generate-dataset command |
ANALYZE_MODEL |
POLICY_EVAL_MODEL |
Model for analyze command |
HYPOTHESIZE_MODEL |
POLICY_EVAL_MODEL |
Model for hypothesize command |
OPTIMIZE_MODEL |
POLICY_EVAL_MODEL |
Model for optimize command |
OPENAI_API_BASE |
- | OpenAI-compatible endpoint (for LMStudio) |
PolicyFlow supports configuring different models at multiple levels:
Node Type Defaults: Configure different models for different node types
CLASSIFIER_MODEL=anthropic/claude-sonnet-4-20250514
SENTIMENT_MODEL=anthropic/claude-haiku-3-5-20250318 # Use faster model for sentiment
DATA_EXTRACTOR_MODEL=anthropic/claude-opus-4-5-20251101 # Use powerful model for extractionCLI Task Defaults: Configure different models for benchmark operations
GENERATE_MODEL=anthropic/claude-opus-4-5-20251101 # Use powerful model for generation
ANALYZE_MODEL=anthropic/claude-sonnet-4-20250514 # Use balanced model for analysisLocal Models (LMStudio): Use OpenAI-compatible local models
OPENAI_API_BASE=http://localhost:1234/v1
CLASSIFIER_MODEL=openai/llama-3-8b
SENTIMENT_MODEL=openai/mistral-7bModel Selection Priority (highest to lowest):
- Explicit parameter in workflow.yaml or CLI
--modelflag - Type-specific env var (e.g.,
CLASSIFIER_MODEL,GENERATE_MODEL) - Global default (
POLICY_EVAL_MODEL) - Hardcoded fallback (
anthropic/claude-sonnet-4-20250514)
uv run policyflow [COMMAND] [OPTIONS]uv run policyflow parse [OPTIONS]| Option | Short | Description |
|---|---|---|
--policy PATH |
-p |
Path to policy markdown file (required) |
--model TEXT |
-m |
LiteLLM model identifier |
--save-workflow PATH |
Save parsed workflow to YAML file | |
--save-normalized PATH |
Save intermediate normalized policy to YAML | |
--format TEXT |
Output format: pretty or yaml (default: pretty) |
Examples:
# Display policy structure
uv run policyflow parse -p policy.md
# Save workflow for later use
uv run policyflow parse -p policy.md --save-workflow workflow.yaml
# Save both normalized and workflow files
uv run policyflow parse -p policy.md --save-normalized norm.yaml --save-workflow workflow.yaml
# Output as YAML
uv run policyflow parse -p policy.md --format yamluv run policyflow eval [OPTIONS]| Option | Short | Description |
|---|---|---|
--policy PATH |
-p |
Path to policy markdown file |
--workflow PATH |
-w |
Path to pre-parsed workflow YAML (alternative to --policy) |
--input TEXT |
-i |
Text to evaluate |
--input-file PATH |
-f |
File containing text to evaluate |
--model TEXT |
-m |
LiteLLM model identifier (e.g., openai/gpt-4o) |
--format TEXT |
Output format: pretty, yaml, or minimal (default: pretty) |
|
--save-workflow PATH |
Save parsed workflow to YAML file for reuse |
Examples:
# Evaluate inline text
uv run policyflow eval -p policy.md -i "text to evaluate"
# Evaluate from file
uv run policyflow eval -p policy.md -f input.txt
# Use a pre-parsed workflow (faster for repeated evaluations)
uv run policyflow eval -w workflow.yaml -i "text to evaluate"
# Use a different model and save the workflow
uv run policyflow eval -p policy.md -i "text" -m openai/gpt-4o --save-workflow workflow.yaml
# Get minimal output (just pass/fail and confidence)
uv run policyflow eval -p policy.md -i "text" --format minimaluv run policyflow batch [OPTIONS]| Option | Short | Description |
|---|---|---|
--policy PATH |
-p |
Path to policy markdown file |
--workflow PATH |
-w |
Path to pre-parsed workflow YAML |
--inputs PATH |
YAML file with inputs list (required) | |
--output PATH |
-o |
Output YAML file (required) |
--model TEXT |
-m |
LiteLLM model identifier |
Input file format (YAML):
# List of strings
- "First text to evaluate"
- "Second text to evaluate"
# Or list of objects
- text: "First text to evaluate"
- input: "Second text to evaluate"Examples:
# Batch evaluate from YAML
uv run policyflow batch -p policy.md --inputs texts.yaml -o results.yaml
# Use pre-parsed workflow for speed
uv run policyflow batch -w workflow.yaml --inputs texts.yaml -o results.yamlRun with uv run python your_script.py or in a uv run python REPL:
from policyflow import evaluate
result = evaluate(
input_text="Based on your risk profile, I recommend buying XYZ",
policy_path="policy.md"
)
# Overall result
print(f"Policy satisfied: {result.policy_satisfied}")
print(f"Confidence: {result.overall_confidence:.0%}")
print(f"Needs review: {result.needs_review}")
# Per-clause breakdown
for cr in result.clause_results:
status = "MET" if cr.met else "NOT MET"
print(f"{cr.clause_name}: {status} ({cr.confidence:.0%})")from policyflow import evaluate, WorkflowConfig, ConfidenceGateConfig
config = WorkflowConfig(
model="openai/gpt-4o",
temperature=0.0,
confidence_gate=ConfidenceGateConfig(
high_threshold=0.9, # Stricter high confidence
low_threshold=0.6 # More lenient low threshold
)
)
result = evaluate(
input_text="...",
policy_path="policy.md",
config=config
)The main package exports the following:
| Function | Description |
|---|---|
evaluate() |
Main entry point - evaluate text against a policy |
parse_policy() |
Parse policy markdown into a ParsedWorkflowPolicy object |
normalize_policy() |
Parse policy into normalized structure (step 1) |
generate_workflow_from_normalized() |
Generate workflow from normalized policy (step 2) |
get_config() |
Get current WorkflowConfig from environment |
| Class | Description |
|---|---|
DynamicWorkflowBuilder |
Workflow runner for evaluating text against a parsed workflow |
WorkflowConfig |
Configuration for evaluation (model, retries, cache, etc.) |
ConfidenceGateConfig |
Confidence threshold configuration |
| Model | Description |
|---|---|
NormalizedPolicy |
Normalized policy with sections and clauses |
ParsedWorkflowPolicy |
Parsed workflow with hierarchy |
EvaluationResult |
Complete evaluation result |
ClauseResult |
Result for a single clause |
Clause |
A single clause from a policy |
Section |
A section containing clauses |
| Enum | Values | Description |
|---|---|---|
LogicOperator |
ALL, ANY |
How criteria combine (AND/OR) |
ConfidenceLevel |
HIGH, MEDIUM, LOW |
Confidence classification |
ClauseType |
REQUIREMENT, DEFINITION, CONDITION, EXCEPTION, REFERENCE |
Clause type |
| Utility | Description |
|---|---|
YAMLMixin |
Mixin providing to_yaml(), from_yaml(), save_yaml(), load_yaml() |
The evaluator uses a two-step parsing process:
Policy.md → Normalize → NormalizedPolicy (YAML)
│
▼
Generate Workflow from Normalized
│
▼
ParsedWorkflowPolicy (YAML)
- nodes with clause_X_X IDs
- hierarchy mapping
│
▼
┌──────────────────────┐
│ DynamicWorkflow │
│ │
│ Executes nodes │
│ based on routes │
└──────────────────────┘
│
▼
EvaluationResult
EvaluationResult:
policy_satisfied: bool # Overall pass/fail
overall_confidence: float # 0.0-1.0
confidence_level: str # "high", "medium", "low"
needs_review: bool # Human review recommended?
low_confidence_clauses: [] # IDs needing attention
clause_results: [
ClauseResult:
clause_id: str
clause_name: str
met: bool
reasoning: str
confidence: float
sub_results: [...] # Nested clause results
]For more control over the evaluation process:
from policyflow import parse_policy, DynamicWorkflowBuilder, WorkflowConfig
# Parse policy (uses two-step process internally)
policy_text = open("policy.md").read()
parsed_workflow = parse_policy(policy_text)
# Create workflow builder and run evaluations
config = WorkflowConfig()
builder = DynamicWorkflowBuilder(parsed_workflow, config)
texts = ["First text to evaluate", "Second text to evaluate"]
results = [builder.run(text) for text in texts]from policyflow.parser import normalize_policy, generate_workflow_from_normalized
from policyflow.models import NormalizedPolicy
# Step 1: Normalize
normalized = normalize_policy(open("policy.md").read())
normalized.save_yaml("normalized.yaml")
# Review/edit normalized.yaml if needed...
# Step 2: Generate workflow
normalized = NormalizedPolicy.load_yaml("normalized.yaml")
workflow = generate_workflow_from_normalized(normalized)
workflow.save_yaml("workflow.yaml")All data models support YAML serialization via YAMLMixin:
from policyflow import parse_policy, evaluate
# Save parsed workflow for reuse
workflow = parse_policy(open("policy.md").read())
workflow.save_yaml("workflow.yaml")
# Save evaluation results
result = evaluate(input_text="...", policy_path="policy.md")
result.save_yaml("evaluation_result.yaml")
# Load from YAML
from policyflow.models import ParsedWorkflowPolicy, EvaluationResult
workflow = ParsedWorkflowPolicy.load_yaml("workflow.yaml")
result = EvaluationResult.load_yaml("evaluation_result.yaml")The workflow system includes node types for building evaluation pipelines:
| Node | Description |
|---|---|
LLMNode |
Base node for LLM-powered evaluation |
ConfidenceGateNode |
Routes based on confidence thresholds |
TransformNode |
Transforms input text (lowercase, truncate, etc.) |
LengthGateNode |
Routes based on text length |
KeywordScorerNode |
Scores text based on keyword presence |
PatternMatchNode |
Matches text against regex patterns |
DataExtractorNode |
Extracts structured data from text |
SamplerNode |
Runs multiple evaluations for consensus |
ClassifierNode |
Classifies text into categories |
SentimentNode |
Analyzes text sentiment |
Access nodes via:
from policyflow.nodes import (
PatternMatchNode,
ClassifierNode,
# ... etc
)Enable LLM tracing with Arize Phoenix:
# Install tracing dependencies
uv pip install -e ".[tracing]"
# Start Phoenix
docker-compose up -d phoenix
# Enable tracing and run
PHOENIX_ENABLED=true uv run policyflow eval -p policy.md -i "text"
# View traces at http://localhost:6007See plans/ARIZE_PHOENIX.md for full documentation.
Policyflow includes a comprehensive benchmarking system for measuring and improving workflow accuracy.
# Generate test dataset from normalized policy
uv run policyflow generate-dataset --policy normalized.yaml --output golden_dataset.yaml
# Run benchmark against the dataset
uv run policyflow benchmark --workflow workflow.yaml --dataset golden_dataset.yaml --output report.yaml
# Analyze failures and get improvement recommendations
uv run policyflow analyze --report report.yaml --workflow workflow.yaml --output analysis.yaml
# Generate hypotheses for improvement
uv run policyflow hypothesize --analysis analysis.yaml --workflow workflow.yaml --output hypotheses.yaml
# Or run the full improvement loop at once
uv run policyflow improve --workflow workflow.yaml --dataset golden_dataset.yaml
# Quick test with limited data (1 test case, 1 iteration)
uv run policyflow improve --workflow workflow.yaml --dataset golden_dataset.yaml --limit 1 --max-iterations 1# Run optimization with budget constraints
uv run policyflow optimize --workflow workflow.yaml --dataset golden_dataset.yaml \
--max-iterations 10 \
--target 0.95 \
--output optimized_workflow.yaml
# Quick test with subset of test cases
uv run policyflow optimize --workflow workflow.yaml --dataset golden_dataset.yaml \
--limit 5 --max-iterations 1from policyflow.benchmark import (
load_golden_dataset,
SimpleBenchmarkRunner,
BenchmarkConfig,
create_analyzer,
create_hypothesis_generator,
HillClimbingOptimizer,
OptimizationBudget,
)
# Load dataset and workflow
dataset = load_golden_dataset("golden_dataset.yaml")
workflow = load_workflow("workflow.yaml")
# Run benchmark
runner = SimpleBenchmarkRunner(BenchmarkConfig())
report = runner.run(workflow, dataset.test_cases)
print(f"Accuracy: {report.metrics.overall_accuracy:.2%}")
# Analyze failures (with optional LLM enhancement)
analyzer = create_analyzer(mode="hybrid", model="anthropic/claude-sonnet-4-20250514")
analysis = analyzer.analyze(report, workflow)
# Generate improvement hypotheses
generator = create_hypothesis_generator(mode="hybrid", model="anthropic/claude-sonnet-4-20250514")
hypotheses = generator.generate(analysis, workflow)
for h in hypotheses:
print(f"- [{h.change_type}] {h.description}")- Golden Dataset Generation: Template-based and LLM-enhanced test case generation
- Comprehensive Metrics: Per-criterion accuracy, precision, recall, F1, and confidence calibration
- Failure Analysis: Rule-based and LLM-enhanced pattern detection
- Hypothesis Generation: Actionable improvement suggestions with template and LLM modes
- Automated Optimization: Hill-climbing optimizer with configurable budget constraints
- Experiment Tracking: YAML-based tracking with history and comparison
See plans/BENCHMARK_SYSTEM.md for full documentation.
Install dev dependencies:
uv sync --extra devRun tests:
# Run all tests
uv run pytest
# Run with verbose output
uv run pytest -v
# Run a specific test file
uv run pytest tests/test_workflow_builder.py
# Run tests matching a pattern
uv run pytest -k "confidence"The test suite covers:
- Node types: Confidence gating, pattern matching, classification, etc.
- Workflow builder: Validation, max iterations, routing
Tests use mocked LLM responses to run quickly without API calls.
- PocketFlow - LLM workflow framework
- LiteLLM - Model-agnostic LLM calls
- Jinja2 - Prompt template management
- Pydantic - Data validation
- Typer + Rich - CLI
- python-dotenv - Environment configuration
- Arize Phoenix - LLM observability (optional)