Skip to content

looksystems/playground-policyflow

Repository files navigation

Policyflow

Warning

This is an experimental/learning project only - generated entirely by claude code.

An LLM-powered compliance evaluation framework that automatically parses structured policy documents (in markdown) and evaluates any text against the extracted criteria. The system uses AI to intelligently extract requirements, sub-criteria, and logical relationships from policies, then builds dynamic evaluation workflows that provide granular pass/fail results with confidence scores and reasoning for each criterion.

Ideal for financial regulation compliance, content moderation, contract analysis, or any domain requiring automated policy enforcement with explainable, auditable results.

Features

  • Generic: Works with any policy document in markdown format
  • Two-Step Parsing: Normalizes policy then generates workflow for auditability
  • Explainable: Node IDs match clause numbers for full traceability
  • Model-agnostic: Uses LiteLLM to support 100+ LLM providers
  • Configurable: Environment-based configuration with .env support

Recent Improvements

PolicyFlow recently underwent significant architectural improvements to reduce boilerplate and enhance maintainability:

  • Reduced Boilerplate: @node_schema decorator eliminates ~85% of node definition boilerplate
  • Improved Abstractions: Extracted CacheManager and RateLimiter for better separation of concerns
  • Cleaner Configuration: Migrated to pydantic-settings with cross-field validation
  • Enhanced Testing: Added 100 new tests (496 total), all passing
  • Better Developer Experience: DeterministicNode base class simplifies node creation

These changes reduced ~372-450 lines of code while improving code quality and maintainability. See plans/CODEBASE_IMPROVEMENTS.md for full details.

Installation

uv sync

Configuration

Copy .env.example to .env and configure:

# Required: API key for your LLM provider
ANTHROPIC_API_KEY=sk-ant-...

# Optional: Model selection (default: anthropic/claude-sonnet-4-20250514)
POLICY_EVAL_MODEL=anthropic/claude-sonnet-4-20250514

# Optional: Confidence thresholds
POLICY_EVAL_CONFIDENCE_HIGH=0.8   # Above this = high confidence
POLICY_EVAL_CONFIDENCE_LOW=0.5    # Below this = needs review

All Environment Variables

Variable Default Description
POLICY_EVAL_MODEL anthropic/claude-sonnet-4-20250514 LiteLLM model identifier
POLICY_EVAL_TEMPERATURE 0.0 LLM temperature for evaluation
POLICY_EVAL_CONFIDENCE_HIGH 0.8 High confidence threshold
POLICY_EVAL_CONFIDENCE_LOW 0.5 Low confidence threshold (below = needs review)
POLICY_EVAL_MAX_RETRIES 3 Max retry attempts per LLM call
POLICY_EVAL_RETRY_WAIT 2 Seconds between retries
POLICY_EVAL_CACHE_ENABLED true Enable LLM response caching
POLICY_EVAL_CACHE_TTL 3600 Cache TTL in seconds (0 = no expiration)
POLICY_EVAL_CACHE_DIR .cache Directory for cache files
POLICY_EVAL_THROTTLE_ENABLED false Enable rate limiting
POLICY_EVAL_THROTTLE_RPM 60 Max requests per minute
PHOENIX_ENABLED false Enable Arize Phoenix tracing
PHOENIX_COLLECTOR_ENDPOINT http://localhost:6007 Phoenix collector URL
PHOENIX_PROJECT_NAME policyflow Project name in Phoenix UI
CLASSIFIER_MODEL POLICY_EVAL_MODEL Default model for ClassifierNode
DATA_EXTRACTOR_MODEL POLICY_EVAL_MODEL Default model for DataExtractorNode
SENTIMENT_MODEL POLICY_EVAL_MODEL Default model for SentimentNode
SAMPLER_MODEL POLICY_EVAL_MODEL Default model for SamplerNode
GENERATE_MODEL POLICY_EVAL_MODEL Model for generate-dataset command
ANALYZE_MODEL POLICY_EVAL_MODEL Model for analyze command
HYPOTHESIZE_MODEL POLICY_EVAL_MODEL Model for hypothesize command
OPTIMIZE_MODEL POLICY_EVAL_MODEL Model for optimize command
OPENAI_API_BASE - OpenAI-compatible endpoint (for LMStudio)

Multi-Level Model Configuration

PolicyFlow supports configuring different models at multiple levels:

Node Type Defaults: Configure different models for different node types

CLASSIFIER_MODEL=anthropic/claude-sonnet-4-20250514
SENTIMENT_MODEL=anthropic/claude-haiku-3-5-20250318  # Use faster model for sentiment
DATA_EXTRACTOR_MODEL=anthropic/claude-opus-4-5-20251101  # Use powerful model for extraction

CLI Task Defaults: Configure different models for benchmark operations

GENERATE_MODEL=anthropic/claude-opus-4-5-20251101  # Use powerful model for generation
ANALYZE_MODEL=anthropic/claude-sonnet-4-20250514   # Use balanced model for analysis

Local Models (LMStudio): Use OpenAI-compatible local models

OPENAI_API_BASE=http://localhost:1234/v1
CLASSIFIER_MODEL=openai/llama-3-8b
SENTIMENT_MODEL=openai/mistral-7b

Model Selection Priority (highest to lowest):

  1. Explicit parameter in workflow.yaml or CLI --model flag
  2. Type-specific env var (e.g., CLASSIFIER_MODEL, GENERATE_MODEL)
  3. Global default (POLICY_EVAL_MODEL)
  4. Hardcoded fallback (anthropic/claude-sonnet-4-20250514)

Usage

CLI

uv run policyflow [COMMAND] [OPTIONS]

Commands

parse - Parse policy into executable workflow
uv run policyflow parse [OPTIONS]
Option Short Description
--policy PATH -p Path to policy markdown file (required)
--model TEXT -m LiteLLM model identifier
--save-workflow PATH Save parsed workflow to YAML file
--save-normalized PATH Save intermediate normalized policy to YAML
--format TEXT Output format: pretty or yaml (default: pretty)

Examples:

# Display policy structure
uv run policyflow parse -p policy.md

# Save workflow for later use
uv run policyflow parse -p policy.md --save-workflow workflow.yaml

# Save both normalized and workflow files
uv run policyflow parse -p policy.md --save-normalized norm.yaml --save-workflow workflow.yaml

# Output as YAML
uv run policyflow parse -p policy.md --format yaml
eval - Evaluate text against a policy
uv run policyflow eval [OPTIONS]
Option Short Description
--policy PATH -p Path to policy markdown file
--workflow PATH -w Path to pre-parsed workflow YAML (alternative to --policy)
--input TEXT -i Text to evaluate
--input-file PATH -f File containing text to evaluate
--model TEXT -m LiteLLM model identifier (e.g., openai/gpt-4o)
--format TEXT Output format: pretty, yaml, or minimal (default: pretty)
--save-workflow PATH Save parsed workflow to YAML file for reuse

Examples:

# Evaluate inline text
uv run policyflow eval -p policy.md -i "text to evaluate"

# Evaluate from file
uv run policyflow eval -p policy.md -f input.txt

# Use a pre-parsed workflow (faster for repeated evaluations)
uv run policyflow eval -w workflow.yaml -i "text to evaluate"

# Use a different model and save the workflow
uv run policyflow eval -p policy.md -i "text" -m openai/gpt-4o --save-workflow workflow.yaml

# Get minimal output (just pass/fail and confidence)
uv run policyflow eval -p policy.md -i "text" --format minimal
batch - Batch evaluate multiple inputs
uv run policyflow batch [OPTIONS]
Option Short Description
--policy PATH -p Path to policy markdown file
--workflow PATH -w Path to pre-parsed workflow YAML
--inputs PATH YAML file with inputs list (required)
--output PATH -o Output YAML file (required)
--model TEXT -m LiteLLM model identifier

Input file format (YAML):

# List of strings
- "First text to evaluate"
- "Second text to evaluate"

# Or list of objects
- text: "First text to evaluate"
- input: "Second text to evaluate"

Examples:

# Batch evaluate from YAML
uv run policyflow batch -p policy.md --inputs texts.yaml -o results.yaml

# Use pre-parsed workflow for speed
uv run policyflow batch -w workflow.yaml --inputs texts.yaml -o results.yaml

Python API

Run with uv run python your_script.py or in a uv run python REPL:

from policyflow import evaluate

result = evaluate(
    input_text="Based on your risk profile, I recommend buying XYZ",
    policy_path="policy.md"
)

# Overall result
print(f"Policy satisfied: {result.policy_satisfied}")
print(f"Confidence: {result.overall_confidence:.0%}")
print(f"Needs review: {result.needs_review}")

# Per-clause breakdown
for cr in result.clause_results:
    status = "MET" if cr.met else "NOT MET"
    print(f"{cr.clause_name}: {status} ({cr.confidence:.0%})")

Custom Configuration

from policyflow import evaluate, WorkflowConfig, ConfidenceGateConfig

config = WorkflowConfig(
    model="openai/gpt-4o",
    temperature=0.0,
    confidence_gate=ConfidenceGateConfig(
        high_threshold=0.9,  # Stricter high confidence
        low_threshold=0.6    # More lenient low threshold
    )
)

result = evaluate(
    input_text="...",
    policy_path="policy.md",
    config=config
)

API Reference

The main package exports the following:

Functions

Function Description
evaluate() Main entry point - evaluate text against a policy
parse_policy() Parse policy markdown into a ParsedWorkflowPolicy object
normalize_policy() Parse policy into normalized structure (step 1)
generate_workflow_from_normalized() Generate workflow from normalized policy (step 2)
get_config() Get current WorkflowConfig from environment

Classes

Class Description
DynamicWorkflowBuilder Workflow runner for evaluating text against a parsed workflow
WorkflowConfig Configuration for evaluation (model, retries, cache, etc.)
ConfidenceGateConfig Confidence threshold configuration

Data Models

Model Description
NormalizedPolicy Normalized policy with sections and clauses
ParsedWorkflowPolicy Parsed workflow with hierarchy
EvaluationResult Complete evaluation result
ClauseResult Result for a single clause
Clause A single clause from a policy
Section A section containing clauses

Enums

Enum Values Description
LogicOperator ALL, ANY How criteria combine (AND/OR)
ConfidenceLevel HIGH, MEDIUM, LOW Confidence classification
ClauseType REQUIREMENT, DEFINITION, CONDITION, EXCEPTION, REFERENCE Clause type

Utilities

Utility Description
YAMLMixin Mixin providing to_yaml(), from_yaml(), save_yaml(), load_yaml()

Architecture

The evaluator uses a two-step parsing process:

Policy.md → Normalize → NormalizedPolicy (YAML)
                              │
                              ▼
              Generate Workflow from Normalized
                              │
                              ▼
              ParsedWorkflowPolicy (YAML)
               - nodes with clause_X_X IDs
               - hierarchy mapping
                              │
                              ▼
              ┌──────────────────────┐
              │  DynamicWorkflow     │
              │                      │
              │  Executes nodes      │
              │  based on routes     │
              └──────────────────────┘
                              │
                              ▼
                EvaluationResult

Output Structure

EvaluationResult:
  policy_satisfied: bool      # Overall pass/fail
  overall_confidence: float   # 0.0-1.0
  confidence_level: str       # "high", "medium", "low"
  needs_review: bool          # Human review recommended?
  low_confidence_clauses: []  # IDs needing attention
  clause_results: [
    ClauseResult:
      clause_id: str
      clause_name: str
      met: bool
      reasoning: str
      confidence: float
      sub_results: [...]      # Nested clause results
  ]

Advanced Usage

Direct Workflow Control

For more control over the evaluation process:

from policyflow import parse_policy, DynamicWorkflowBuilder, WorkflowConfig

# Parse policy (uses two-step process internally)
policy_text = open("policy.md").read()
parsed_workflow = parse_policy(policy_text)

# Create workflow builder and run evaluations
config = WorkflowConfig()
builder = DynamicWorkflowBuilder(parsed_workflow, config)

texts = ["First text to evaluate", "Second text to evaluate"]
results = [builder.run(text) for text in texts]

Working with Normalized Policies

from policyflow.parser import normalize_policy, generate_workflow_from_normalized
from policyflow.models import NormalizedPolicy

# Step 1: Normalize
normalized = normalize_policy(open("policy.md").read())
normalized.save_yaml("normalized.yaml")

# Review/edit normalized.yaml if needed...

# Step 2: Generate workflow
normalized = NormalizedPolicy.load_yaml("normalized.yaml")
workflow = generate_workflow_from_normalized(normalized)
workflow.save_yaml("workflow.yaml")

YAML Serialization

All data models support YAML serialization via YAMLMixin:

from policyflow import parse_policy, evaluate

# Save parsed workflow for reuse
workflow = parse_policy(open("policy.md").read())
workflow.save_yaml("workflow.yaml")

# Save evaluation results
result = evaluate(input_text="...", policy_path="policy.md")
result.save_yaml("evaluation_result.yaml")

# Load from YAML
from policyflow.models import ParsedWorkflowPolicy, EvaluationResult
workflow = ParsedWorkflowPolicy.load_yaml("workflow.yaml")
result = EvaluationResult.load_yaml("evaluation_result.yaml")

Available Node Types

The workflow system includes node types for building evaluation pipelines:

Node Description
LLMNode Base node for LLM-powered evaluation
ConfidenceGateNode Routes based on confidence thresholds
TransformNode Transforms input text (lowercase, truncate, etc.)
LengthGateNode Routes based on text length
KeywordScorerNode Scores text based on keyword presence
PatternMatchNode Matches text against regex patterns
DataExtractorNode Extracts structured data from text
SamplerNode Runs multiple evaluations for consensus
ClassifierNode Classifies text into categories
SentimentNode Analyzes text sentiment

Access nodes via:

from policyflow.nodes import (
    PatternMatchNode,
    ClassifierNode,
    # ... etc
)

Observability (Optional)

Enable LLM tracing with Arize Phoenix:

# Install tracing dependencies
uv pip install -e ".[tracing]"

# Start Phoenix
docker-compose up -d phoenix

# Enable tracing and run
PHOENIX_ENABLED=true uv run policyflow eval -p policy.md -i "text"

# View traces at http://localhost:6007

See plans/ARIZE_PHOENIX.md for full documentation.

Benchmarking & Self-Improvement

Policyflow includes a comprehensive benchmarking system for measuring and improving workflow accuracy.

Quick Start

# Generate test dataset from normalized policy
uv run policyflow generate-dataset --policy normalized.yaml --output golden_dataset.yaml

# Run benchmark against the dataset
uv run policyflow benchmark --workflow workflow.yaml --dataset golden_dataset.yaml --output report.yaml

# Analyze failures and get improvement recommendations
uv run policyflow analyze --report report.yaml --workflow workflow.yaml --output analysis.yaml

# Generate hypotheses for improvement
uv run policyflow hypothesize --analysis analysis.yaml --workflow workflow.yaml --output hypotheses.yaml

# Or run the full improvement loop at once
uv run policyflow improve --workflow workflow.yaml --dataset golden_dataset.yaml

# Quick test with limited data (1 test case, 1 iteration)
uv run policyflow improve --workflow workflow.yaml --dataset golden_dataset.yaml --limit 1 --max-iterations 1

Automated Optimization

# Run optimization with budget constraints
uv run policyflow optimize --workflow workflow.yaml --dataset golden_dataset.yaml \
    --max-iterations 10 \
    --target 0.95 \
    --output optimized_workflow.yaml

# Quick test with subset of test cases
uv run policyflow optimize --workflow workflow.yaml --dataset golden_dataset.yaml \
    --limit 5 --max-iterations 1

Python API

from policyflow.benchmark import (
    load_golden_dataset,
    SimpleBenchmarkRunner,
    BenchmarkConfig,
    create_analyzer,
    create_hypothesis_generator,
    HillClimbingOptimizer,
    OptimizationBudget,
)

# Load dataset and workflow
dataset = load_golden_dataset("golden_dataset.yaml")
workflow = load_workflow("workflow.yaml")

# Run benchmark
runner = SimpleBenchmarkRunner(BenchmarkConfig())
report = runner.run(workflow, dataset.test_cases)
print(f"Accuracy: {report.metrics.overall_accuracy:.2%}")

# Analyze failures (with optional LLM enhancement)
analyzer = create_analyzer(mode="hybrid", model="anthropic/claude-sonnet-4-20250514")
analysis = analyzer.analyze(report, workflow)

# Generate improvement hypotheses
generator = create_hypothesis_generator(mode="hybrid", model="anthropic/claude-sonnet-4-20250514")
hypotheses = generator.generate(analysis, workflow)

for h in hypotheses:
    print(f"- [{h.change_type}] {h.description}")

Features

  • Golden Dataset Generation: Template-based and LLM-enhanced test case generation
  • Comprehensive Metrics: Per-criterion accuracy, precision, recall, F1, and confidence calibration
  • Failure Analysis: Rule-based and LLM-enhanced pattern detection
  • Hypothesis Generation: Actionable improvement suggestions with template and LLM modes
  • Automated Optimization: Hill-climbing optimizer with configurable budget constraints
  • Experiment Tracking: YAML-based tracking with history and comparison

See plans/BENCHMARK_SYSTEM.md for full documentation.

Testing

Install dev dependencies:

uv sync --extra dev

Run tests:

# Run all tests
uv run pytest

# Run with verbose output
uv run pytest -v

# Run a specific test file
uv run pytest tests/test_workflow_builder.py

# Run tests matching a pattern
uv run pytest -k "confidence"

The test suite covers:

  • Node types: Confidence gating, pattern matching, classification, etc.
  • Workflow builder: Validation, max iterations, routing

Tests use mocked LLM responses to run quickly without API calls.

Tech Stack

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors