docs: add configuration and governance guides

haasonsaas · haasonsaas · commit 34314f94e8ff · 2025-10-16T19:48:18.000-07:00
diff --git a/README.md b/README.md
@@ -222,6 +222,9 @@ The same process works against forks or sandboxes—helpful when validating new
 ## Additional Documentation
 
 - [CI Integration Guide](docs/ci-integration.md) – Configure GitHub Actions, upload SARIF, archive decision bundles, and adapt the workflow to other CI systems.
+- [Governance & Risk Model](docs/governance-and-risk-model.md) – Understand decision flow, thresholds, and tuning guidance.
+- [Configuration Reference](docs/configuration.md) – Environment variables grouped by subsystem with defaults and usage tips.
+- [Detector Authoring Guide](docs/detector-authoring.md) – Build custom detectors, register modules, and manage rule packs.
 - [SARIF Reporting](docs/sarif-reporting.md) – Understand the SARIF 2.1.0 output, severity mapping, and customization hooks.
 - [DSSE Decision Bundles](docs/dsse-decision-bundles.md) – Inspect the envelope schema, verify signatures, and integrate with transparency logs.
 
diff --git a/docs/README.md b/docs/README.md
@@ -3,5 +3,8 @@
 This directory contains task-focused guides that go deeper than the root `README.md`.
 
 - [CI Integration Guide](ci-integration.md) — Automate Provenance evaluations in GitHub Actions and other CI pipelines, upload SARIF findings, and archive DSSE bundles.
+- [Governance & Risk Model](governance-and-risk-model.md) — Learn how policy decisions are made and how to tune thresholds.
+- [Configuration Reference](configuration.md) — Environment variables grouped by subsystem with defaults and usage tips.
+- [Detector Authoring Guide](detector-authoring.md) — Extend Provenance with custom detectors and rule packs.
 - [SARIF Reporting](sarif-reporting.md) — Understand the SARIF 2.1.0 output and tailor it for downstream scanners.
 - [DSSE Decision Bundles](dsse-decision-bundles.md) — Inspect the DSSE envelope, verify signatures, and extend transparency workflows.
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -0,0 +1,86 @@
+# Configuration Reference
+
+The service is configured via environment variables (with `uvicorn` expecting uppercase snake-case). This reference groups related settings, documents defaults, and explains how they influence behaviour.
+
+## Core Service
+
+| Variable | Default | Description |
+| --- | --- | --- |
+| `PROVENANCE_SERVICE_HOST` | `0.0.0.0` | Bind address for the API server. |
+| `PROVENANCE_SERVICE_PORT` | `8000` | HTTP port for FastAPI. |
+| `PROVENANCE_SERVICE_BASE_URL` | `http://localhost:8000` | External URL used when generating links in API responses. |
+| `PROVENANCE_API_V1_PREFIX` | `/v1` | Prefix applied to API routes. |
+| `PROVENANCE_API_TOKEN` | unset | Shared secret for simple token auth on ingestion endpoints. Use a stronger mechanism (e.g., OAuth) in production. |
+
+## Data Stores
+
+| Variable | Default | Description |
+| --- | --- | --- |
+| `PROVENANCE_REDIS_URL` | `redis://localhost:6379/0` | Primary datastore for analyses, findings, and decisions. |
+| `PROVENANCE_REDIS_PASSWORD` | unset | Password for secured Redis deployments. |
+| `PROVENANCE_TIMESERIES_BACKEND` | `file` | Destination for analytics events: `file`, `clickhouse`, `snowflake`, `bigquery`, or `off`. |
+| `PROVENANCE_TIMESERIES_PATH` | `data/timeseries_events.jsonl` | File path used when backend is `file`. |
+| `PROVENANCE_CLICKHOUSE_URL` | unset | HTTP endpoint for ClickHouse when selected as backend. |
+| `PROVENANCE_SNOWFLAKE_ACCOUNT` | unset | Snowflake account identifier (when backend is `snowflake`). |
+| `PROVENANCE_BIGQUERY_DATASET` | unset | Dataset name for BigQuery backend. |
+
+## Governance & Risk
+
+See [Governance & Risk Model](governance-and-risk-model.md) for detailed context.
+
+| Variable | Default | Description |
+| --- | --- | --- |
+| `PROVENANCE_BLOCK_ON_UNKNOWN` | `false` | Block analyses with unattributed lines. |
+| `PROVENANCE_RISK_HIGH_SEVERITY_THRESHOLD` | `1` | Warn threshold for high severity findings. |
+| `PROVENANCE_POLICY_WARN_THRESHOLDS` | `{}` | JSON mapping of category → warn threshold. |
+| `PROVENANCE_POLICY_BLOCK_THRESHOLDS` | `{}` | JSON mapping of category → block threshold. |
+| `PROVENANCE_DEFAULT_POLICY_VERSION` | `2024-06-01` | Version string embedded in decisions. |
+| `PROVENANCE_DECISION_SIGNING_KEY` | unset | Base64 Ed25519 private key for DSSE signing. |
+| `PROVENANCE_DECISION_KEY_ID` | `"decision-key"` | Label for the signing key. |
+
+## Detectors & Provenance
+
+| Variable | Default | Description |
+| --- | --- | --- |
+| `PROVENANCE_DETECTOR_MODULE_PATHS` | unset | Comma-separated list of Python modules that register additional detectors. |
+| `PROVENANCE_SEMGREP_RULES_PATH` | `app/detection_rules/semgrep_rules.yml` | Default Semgrep ruleset used by the built-in detector. |
+| `PROVENANCE_AGENT_PUBLIC_KEYS` | `{}` | Mapping of agent IDs to Ed25519 public keys (JSON). Enables cryptographic attribution of changed lines. |
+| `PROVENANCE_PROVENANCE_MARKERS` | `{}` | Optional hints for matching agent markers in commit messages. |
+
+## GitHub Integration
+
+| Variable | Default | Description |
+| --- | --- | --- |
+| `PROVENANCE_GITHUB_TOKEN` | unset | Personal access token or GitHub App installation token for enrichment. |
+| `PROVENANCE_GITHUB_APP_ID` | unset | GitHub App identifier (when using app-based auth). |
+| `PROVENANCE_GITHUB_APP_PRIVATE_KEY` | unset | Base64 encoded private key for the GitHub App. |
+| `PROVENANCE_GITHUB_WEBHOOK_SECRET` | unset | Shared secret for webhook verification if you extend the service to receive GitHub events. |
+
+## Observability
+
+| Variable | Default | Description |
+| --- | --- | --- |
+| `PROVENANCE_OTEL_ENABLED` | `false` | Enable OpenTelemetry metrics/exporters. |
+| `PROVENANCE_OTEL_EXPORTER` | `console` | Exporter target (`console`, `prometheus`, etc.). Additional dependencies might be required. |
+| `PROVENANCE_OTEL_ENDPOINT` | unset | Collector endpoint for OTLP exporters. |
+| `PROVENANCE_PROMETHEUS_PORT` | `9000` | Port to expose Prometheus metrics when exporter is `prometheus`. |
+
+## Analytics Windows & Defaults
+
+| Variable | Default | Description |
+| --- | --- | --- |
+| `PROVENANCE_ANALYTICS_DEFAULT_WINDOW` | `7d` | Default rolling window for analytics endpoints. |
+| `PROVENANCE_ANALYTICS_DEFAULT_METRIC` | `code_volume` | Fallback metric when none provided. |
+
+## CI / GitHub Action
+
+| Variable | Default | Description |
+| --- | --- | --- |
+| `PROVENANCE_WRITE_RESPONSE_PATH` | unset | When set in CI, the GitHub Action writes the decision payload to this path for downstream steps. |
+| `PROVENANCE_TRACE` | `0` | Enable verbose logging from the composite action’s HTTP client. |
+
+## Secrets Handling Tips
+
+- Store sensitive values (API tokens, signing keys) in secret managers or CI secrets, not in plaintext environment files.
+- When using JSON-based settings (e.g., threshold mappings), prefer compact JSON strings: `{"secrets":1,"code_execution":2}` to avoid parsing surprises.
+- Mount configuration files via Kubernetes secrets or Docker Compose `.env` files; the app uses `pydantic` settings, so environment variables are automatically parsed.
diff --git a/docs/detector-authoring.md b/docs/detector-authoring.md
@@ -0,0 +1,96 @@
+# Detector Authoring Guide
+
+Provenance ships with a Semgrep-based detector and a few built-in heuristics. This guide explains how to add custom detectors, package rule packs, and test them locally.
+
+## Detector Anatomy
+
+Detectors inherit from `BaseDetector` (`app/services/detection.py`). Each detector implements:
+
+- `detect(lines: list[ChangedLine]) -> list[Finding]`
+- `capabilities() -> dict` (metadata describing rule packs, digests, etc.)
+
+Detectors operate on the normalized `ChangedLine` models extracted from pull-request diffs, so they include file path, line number, language, attribution data, and more.
+
+## Registering Detectors
+
+Add your modules to `PROVENANCE_DETECTOR_MODULE_PATHS`. Each module must expose a `register_detectors()` function returning `BaseDetector` instances.
+
+Example module (`my_detectors.py`):
+
+```python
+from app.services.detection import BaseDetector
+
+class BanEvalDetector(BaseDetector):
+    def detect(self, lines):
+        findings = []
+        for line in lines:
+            if "eval(" in (line.content or ""):
+                findings.append(
+                    self.build_finding(
+                        line=line,
+                        rule_id="custom.eval.ban",
+                        message="Avoid eval; it executes arbitrary code.",
+                        category="code_execution",
+                        severity="high",
+                    )
+                )
+        return findings
+
+    def capabilities(self):
+        return {
+            "rule_id": "custom.eval.ban",
+            "description": "Flags eval usage in any language",
+        }
+
+def register_detectors() -> list[BaseDetector]:
+    return [BanEvalDetector()]
+```
+
+Export the module path via environment variable:
+
+```bash
+export PROVENANCE_DETECTOR_MODULE_PATHS="my_detectors"
+```
+
+On startup, the detection service imports each module and registers the returned detectors.
+
+## Semgrep Rule Packs
+
+- Default rules live at `app/detection_rules/semgrep_rules.yml`.
+- Replace or extend the pack by setting `PROVENANCE_SEMGREP_RULES_PATH` to a different file or directory (Semgrep understands directories and URLs).
+- To depend on Semgrep-managed registries (`semgrep --config p/somepack`), mount the `.semgrep` auth config and update the detector initialization logic if additional authentication is needed.
+
+### Adding Custom Rules
+
+1. Write rules in YAML (either inline or separate files).
+2. Run `semgrep --config app/detection_rules/semgrep_rules.yml --json` to preview findings.
+3. Ensure rule IDs follow a namespaced convention (`org.package.rule`) to avoid collisions.
+4. Document rules with `message`, `metadata`, and `severity` to enrich findings and SARIF output.
+
+## Testing Detectors
+
+- Unit tests: Add fixtures under `tests/test_detection.py` or create new suites to cover specific detectors. Use `ChangedLine` instances to simulate diffs.
+- Integration tests: Extend `tests/test_api_endpoints.py` to submit payloads that exercise new rules and assert on findings and governance outcomes.
+- Local runs: Use `scripts/provenance_client.py` (if available) or the GitHub Action script to submit a diff to a dev instance.
+
+## Capabilities Endpoint
+
+`GET /v1/detectors/capabilities` aggregates metadata from all registered detectors. Ensure `capabilities()` returns informative fields:
+
+```python
+return {
+    "rule_id": "custom.eval.ban",
+    "display_name": "Ban Eval in Python",
+    "sha256": "<rule pack digest>",
+    "config_path": "detectors/my_rules.yml",
+    "last_updated": "2024-07-20T15:00:00Z",
+}
+```
+
+This helps auditors confirm which rule packs were active during an analysis.
+
+## Performance Considerations
+
+- Detectors run synchronously today. For heavy workloads, consider batching queries or offloading to subprocesses.
+- Avoid network calls during detection; enrichments should happen before ingestion or after governance to keep evaluation latency predictable.
+- Use caching if rules require expensive initialization (e.g., loading ML models). Store the cache on the detector instance.
diff --git a/docs/governance-and-risk-model.md b/docs/governance-and-risk-model.md
@@ -0,0 +1,79 @@
+# Governance & Risk Model
+
+This document explains how Provenance evaluates risk, determines allow/warn/block decisions, and which configuration knobs influence the outcome.
+
+## Evaluation Flow
+
+1. **Provenance Coverage** – Each changed line submitted with the analysis is examined for an agent attribution. Coverage metrics (total, attributed, unknown) inform enforcement when `PROVENANCE_BLOCK_ON_UNKNOWN=true`.
+2. **Finding Aggregation** – Detectors produce findings with categories and severity levels. Governance summarizes totals, per-category counts, and severity buckets.
+3. **Policy Thresholds** – Outcomes are derived by comparing summaries against configured thresholds and default heuristics.
+4. **Review & Commit Signals** – When GitHub enrichment is enabled, governance inspects review overrides and force-push activity to adjust the outcome and emit alerts.
+5. **Decision Bundling** – The final decision, risk summary, and inputs digest are wrapped in a DSSE envelope (optionally signed).
+
+## Policy Outcomes
+
+The decision pipeline evaluates conditions in priority order:
+
+1. **Unknown Provenance** – If `PROVENANCE_BLOCK_ON_UNKNOWN=true` and any line lacks attribution, the outcome is `block` with rationale “Unknown agents detected…”.
+2. **Critical Findings** – Any `critical` severity finding forces `block`.
+3. **High Severity Threshold** – If `risk_high_severity_threshold` is reached (default: `1` high finding), the outcome falls to `warn`.
+4. **Category Thresholds** – `PROVENANCE_POLICY_BLOCK_THRESHOLDS` and `PROVENANCE_POLICY_WARN_THRESHOLDS` map finding categories to numeric limits (e.g., `{"secrets": 1}`). When exceeded, outcomes escalate to `block` or `warn`.
+5. **Review Overrides / Force Pushes** – GitHub metadata can escalate to `warn` or `block` if bot reviews were bypassed or force-pushes landed after approval.
+6. **Default Allow** – If none of the above are triggered, the analysis is `allow`.
+
+The rationale captures the first trigger encountered to keep explanations concise.
+
+## Configuration Reference
+
+| Setting | Default | Description |
+| --- | --- | --- |
+| `PROVENANCE_BLOCK_ON_UNKNOWN` | `false` | Block analyses when any changed line lacks agent attribution. |
+| `PROVENANCE_RISK_HIGH_SEVERITY_THRESHOLD` | `1` | Number of `high` findings that trigger a `warn`. |
+| `PROVENANCE_POLICY_WARN_THRESHOLDS` | `{}` | JSON mapping of finding category → warn threshold (inclusive). |
+| `PROVENANCE_POLICY_BLOCK_THRESHOLDS` | `{}` | JSON mapping of finding category → block threshold (inclusive). |
+| `PROVENANCE_DECISION_SIGNING_KEY` | unset | Base64 Ed25519 private key. Enables signing of DSSE bundles. |
+| `PROVENANCE_DECISION_KEY_ID` | `"decision-key"` | Optional key identifier embedded in signature records. |
+| `PROVENANCE_DEFAULT_POLICY_VERSION` | `2024-06-01` | Version stamp included in decisions for audit tracking. |
+
+> The full environment variable list lives in [Configuration Reference](configuration.md). This table highlights the governance-specific controls.
+
+## Risk Summary Schema
+
+Every decision exports a `risk_summary` block:
+
+```json
+{
+  "findings_total": 3,
+  "findings_by_category": {"code_execution": 2, "secrets": 1},
+  "findings_by_severity": {"high": 2, "critical": 1},
+  "coverage": {
+    "total_lines": 22,
+    "attributed_lines": 18,
+    "unknown_line_count": 4,
+    "coverage_percent": 81.82
+  },
+  "bot_block_overrides": 1,
+  "bot_block_resolved": 1,
+  "force_push_after_approval": true
+}
+```
+
+- `coverage` quantifies attribution confidence and feeds both alerting and DSSE payloads.
+- Optional GitHub metadata fields (`bot_block_overrides`, etc.) appear when enrichment is enabled.
+
+## Weighted Risk Score (Planned)
+
+The roadmap includes a composite risk score that blends severity, coverage, and review heuristics. Upcoming changes will add:
+
+- `risk_score` – Numeric index (0–100) aggregating weighted factors.
+- `score_breakdown` – Component contributions (e.g., `{"coverage": 20, "severity": 50, "review": 10}`).
+- Configurable weights via `PROVENANCE_RISK_WEIGHTS`.
+
+Once implemented, governance decisions will still rely on hard thresholds for blocking, but the score will enrich analytics views and downstream automation.
+
+## Tuning Guidance
+
+1. **Start Conservative** – Block on critical findings, warn on high severity bursts, and observe review overrides before enforcing attribution coverage.
+2. **Iterate on Categories** – Align category thresholds with detector packs (e.g., treat “secrets” differently from “lint”).
+3. **Use DSSE Bundles** – Signatures provide a tamper-evident record of enforcement logic. Verify bundles in CI to ensure configuration drift doesn’t silently relax policies.
+4. **Monitor Analytics** – `/v1/analytics/summary` and `/v1/analytics/agents/behavior` reveal whether thresholds are too aggressive or lenient.