Skip to content

Commit 34314f9

Browse files
committed
docs: add configuration and governance guides
1 parent 5560cc9 commit 34314f9

File tree

5 files changed

+267
-0
lines changed

5 files changed

+267
-0
lines changed

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -222,6 +222,9 @@ The same process works against forks or sandboxes—helpful when validating new
222222
## Additional Documentation
223223

224224
- [CI Integration Guide](docs/ci-integration.md) – Configure GitHub Actions, upload SARIF, archive decision bundles, and adapt the workflow to other CI systems.
225+
- [Governance & Risk Model](docs/governance-and-risk-model.md) – Understand decision flow, thresholds, and tuning guidance.
226+
- [Configuration Reference](docs/configuration.md) – Environment variables grouped by subsystem with defaults and usage tips.
227+
- [Detector Authoring Guide](docs/detector-authoring.md) – Build custom detectors, register modules, and manage rule packs.
225228
- [SARIF Reporting](docs/sarif-reporting.md) – Understand the SARIF 2.1.0 output, severity mapping, and customization hooks.
226229
- [DSSE Decision Bundles](docs/dsse-decision-bundles.md) – Inspect the envelope schema, verify signatures, and integrate with transparency logs.
227230

docs/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,8 @@
33
This directory contains task-focused guides that go deeper than the root `README.md`.
44

55
- [CI Integration Guide](ci-integration.md) — Automate Provenance evaluations in GitHub Actions and other CI pipelines, upload SARIF findings, and archive DSSE bundles.
6+
- [Governance & Risk Model](governance-and-risk-model.md) — Learn how policy decisions are made and how to tune thresholds.
7+
- [Configuration Reference](configuration.md) — Environment variables grouped by subsystem with defaults and usage tips.
8+
- [Detector Authoring Guide](detector-authoring.md) — Extend Provenance with custom detectors and rule packs.
69
- [SARIF Reporting](sarif-reporting.md) — Understand the SARIF 2.1.0 output and tailor it for downstream scanners.
710
- [DSSE Decision Bundles](dsse-decision-bundles.md) — Inspect the DSSE envelope, verify signatures, and extend transparency workflows.

docs/configuration.md

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# Configuration Reference
2+
3+
The service is configured via environment variables (with `uvicorn` expecting uppercase snake-case). This reference groups related settings, documents defaults, and explains how they influence behaviour.
4+
5+
## Core Service
6+
7+
| Variable | Default | Description |
8+
| --- | --- | --- |
9+
| `PROVENANCE_SERVICE_HOST` | `0.0.0.0` | Bind address for the API server. |
10+
| `PROVENANCE_SERVICE_PORT` | `8000` | HTTP port for FastAPI. |
11+
| `PROVENANCE_SERVICE_BASE_URL` | `http://localhost:8000` | External URL used when generating links in API responses. |
12+
| `PROVENANCE_API_V1_PREFIX` | `/v1` | Prefix applied to API routes. |
13+
| `PROVENANCE_API_TOKEN` | unset | Shared secret for simple token auth on ingestion endpoints. Use a stronger mechanism (e.g., OAuth) in production. |
14+
15+
## Data Stores
16+
17+
| Variable | Default | Description |
18+
| --- | --- | --- |
19+
| `PROVENANCE_REDIS_URL` | `redis://localhost:6379/0` | Primary datastore for analyses, findings, and decisions. |
20+
| `PROVENANCE_REDIS_PASSWORD` | unset | Password for secured Redis deployments. |
21+
| `PROVENANCE_TIMESERIES_BACKEND` | `file` | Destination for analytics events: `file`, `clickhouse`, `snowflake`, `bigquery`, or `off`. |
22+
| `PROVENANCE_TIMESERIES_PATH` | `data/timeseries_events.jsonl` | File path used when backend is `file`. |
23+
| `PROVENANCE_CLICKHOUSE_URL` | unset | HTTP endpoint for ClickHouse when selected as backend. |
24+
| `PROVENANCE_SNOWFLAKE_ACCOUNT` | unset | Snowflake account identifier (when backend is `snowflake`). |
25+
| `PROVENANCE_BIGQUERY_DATASET` | unset | Dataset name for BigQuery backend. |
26+
27+
## Governance & Risk
28+
29+
See [Governance & Risk Model](governance-and-risk-model.md) for detailed context.
30+
31+
| Variable | Default | Description |
32+
| --- | --- | --- |
33+
| `PROVENANCE_BLOCK_ON_UNKNOWN` | `false` | Block analyses with unattributed lines. |
34+
| `PROVENANCE_RISK_HIGH_SEVERITY_THRESHOLD` | `1` | Warn threshold for high severity findings. |
35+
| `PROVENANCE_POLICY_WARN_THRESHOLDS` | `{}` | JSON mapping of category → warn threshold. |
36+
| `PROVENANCE_POLICY_BLOCK_THRESHOLDS` | `{}` | JSON mapping of category → block threshold. |
37+
| `PROVENANCE_DEFAULT_POLICY_VERSION` | `2024-06-01` | Version string embedded in decisions. |
38+
| `PROVENANCE_DECISION_SIGNING_KEY` | unset | Base64 Ed25519 private key for DSSE signing. |
39+
| `PROVENANCE_DECISION_KEY_ID` | `"decision-key"` | Label for the signing key. |
40+
41+
## Detectors & Provenance
42+
43+
| Variable | Default | Description |
44+
| --- | --- | --- |
45+
| `PROVENANCE_DETECTOR_MODULE_PATHS` | unset | Comma-separated list of Python modules that register additional detectors. |
46+
| `PROVENANCE_SEMGREP_RULES_PATH` | `app/detection_rules/semgrep_rules.yml` | Default Semgrep ruleset used by the built-in detector. |
47+
| `PROVENANCE_AGENT_PUBLIC_KEYS` | `{}` | Mapping of agent IDs to Ed25519 public keys (JSON). Enables cryptographic attribution of changed lines. |
48+
| `PROVENANCE_PROVENANCE_MARKERS` | `{}` | Optional hints for matching agent markers in commit messages. |
49+
50+
## GitHub Integration
51+
52+
| Variable | Default | Description |
53+
| --- | --- | --- |
54+
| `PROVENANCE_GITHUB_TOKEN` | unset | Personal access token or GitHub App installation token for enrichment. |
55+
| `PROVENANCE_GITHUB_APP_ID` | unset | GitHub App identifier (when using app-based auth). |
56+
| `PROVENANCE_GITHUB_APP_PRIVATE_KEY` | unset | Base64 encoded private key for the GitHub App. |
57+
| `PROVENANCE_GITHUB_WEBHOOK_SECRET` | unset | Shared secret for webhook verification if you extend the service to receive GitHub events. |
58+
59+
## Observability
60+
61+
| Variable | Default | Description |
62+
| --- | --- | --- |
63+
| `PROVENANCE_OTEL_ENABLED` | `false` | Enable OpenTelemetry metrics/exporters. |
64+
| `PROVENANCE_OTEL_EXPORTER` | `console` | Exporter target (`console`, `prometheus`, etc.). Additional dependencies might be required. |
65+
| `PROVENANCE_OTEL_ENDPOINT` | unset | Collector endpoint for OTLP exporters. |
66+
| `PROVENANCE_PROMETHEUS_PORT` | `9000` | Port to expose Prometheus metrics when exporter is `prometheus`. |
67+
68+
## Analytics Windows & Defaults
69+
70+
| Variable | Default | Description |
71+
| --- | --- | --- |
72+
| `PROVENANCE_ANALYTICS_DEFAULT_WINDOW` | `7d` | Default rolling window for analytics endpoints. |
73+
| `PROVENANCE_ANALYTICS_DEFAULT_METRIC` | `code_volume` | Fallback metric when none provided. |
74+
75+
## CI / GitHub Action
76+
77+
| Variable | Default | Description |
78+
| --- | --- | --- |
79+
| `PROVENANCE_WRITE_RESPONSE_PATH` | unset | When set in CI, the GitHub Action writes the decision payload to this path for downstream steps. |
80+
| `PROVENANCE_TRACE` | `0` | Enable verbose logging from the composite action’s HTTP client. |
81+
82+
## Secrets Handling Tips
83+
84+
- Store sensitive values (API tokens, signing keys) in secret managers or CI secrets, not in plaintext environment files.
85+
- When using JSON-based settings (e.g., threshold mappings), prefer compact JSON strings: `{"secrets":1,"code_execution":2}` to avoid parsing surprises.
86+
- Mount configuration files via Kubernetes secrets or Docker Compose `.env` files; the app uses `pydantic` settings, so environment variables are automatically parsed.

docs/detector-authoring.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Detector Authoring Guide
2+
3+
Provenance ships with a Semgrep-based detector and a few built-in heuristics. This guide explains how to add custom detectors, package rule packs, and test them locally.
4+
5+
## Detector Anatomy
6+
7+
Detectors inherit from `BaseDetector` (`app/services/detection.py`). Each detector implements:
8+
9+
- `detect(lines: list[ChangedLine]) -> list[Finding]`
10+
- `capabilities() -> dict` (metadata describing rule packs, digests, etc.)
11+
12+
Detectors operate on the normalized `ChangedLine` models extracted from pull-request diffs, so they include file path, line number, language, attribution data, and more.
13+
14+
## Registering Detectors
15+
16+
Add your modules to `PROVENANCE_DETECTOR_MODULE_PATHS`. Each module must expose a `register_detectors()` function returning `BaseDetector` instances.
17+
18+
Example module (`my_detectors.py`):
19+
20+
```python
21+
from app.services.detection import BaseDetector
22+
23+
class BanEvalDetector(BaseDetector):
24+
def detect(self, lines):
25+
findings = []
26+
for line in lines:
27+
if "eval(" in (line.content or ""):
28+
findings.append(
29+
self.build_finding(
30+
line=line,
31+
rule_id="custom.eval.ban",
32+
message="Avoid eval; it executes arbitrary code.",
33+
category="code_execution",
34+
severity="high",
35+
)
36+
)
37+
return findings
38+
39+
def capabilities(self):
40+
return {
41+
"rule_id": "custom.eval.ban",
42+
"description": "Flags eval usage in any language",
43+
}
44+
45+
def register_detectors() -> list[BaseDetector]:
46+
return [BanEvalDetector()]
47+
```
48+
49+
Export the module path via environment variable:
50+
51+
```bash
52+
export PROVENANCE_DETECTOR_MODULE_PATHS="my_detectors"
53+
```
54+
55+
On startup, the detection service imports each module and registers the returned detectors.
56+
57+
## Semgrep Rule Packs
58+
59+
- Default rules live at `app/detection_rules/semgrep_rules.yml`.
60+
- Replace or extend the pack by setting `PROVENANCE_SEMGREP_RULES_PATH` to a different file or directory (Semgrep understands directories and URLs).
61+
- To depend on Semgrep-managed registries (`semgrep --config p/somepack`), mount the `.semgrep` auth config and update the detector initialization logic if additional authentication is needed.
62+
63+
### Adding Custom Rules
64+
65+
1. Write rules in YAML (either inline or separate files).
66+
2. Run `semgrep --config app/detection_rules/semgrep_rules.yml --json` to preview findings.
67+
3. Ensure rule IDs follow a namespaced convention (`org.package.rule`) to avoid collisions.
68+
4. Document rules with `message`, `metadata`, and `severity` to enrich findings and SARIF output.
69+
70+
## Testing Detectors
71+
72+
- Unit tests: Add fixtures under `tests/test_detection.py` or create new suites to cover specific detectors. Use `ChangedLine` instances to simulate diffs.
73+
- Integration tests: Extend `tests/test_api_endpoints.py` to submit payloads that exercise new rules and assert on findings and governance outcomes.
74+
- Local runs: Use `scripts/provenance_client.py` (if available) or the GitHub Action script to submit a diff to a dev instance.
75+
76+
## Capabilities Endpoint
77+
78+
`GET /v1/detectors/capabilities` aggregates metadata from all registered detectors. Ensure `capabilities()` returns informative fields:
79+
80+
```python
81+
return {
82+
"rule_id": "custom.eval.ban",
83+
"display_name": "Ban Eval in Python",
84+
"sha256": "<rule pack digest>",
85+
"config_path": "detectors/my_rules.yml",
86+
"last_updated": "2024-07-20T15:00:00Z",
87+
}
88+
```
89+
90+
This helps auditors confirm which rule packs were active during an analysis.
91+
92+
## Performance Considerations
93+
94+
- Detectors run synchronously today. For heavy workloads, consider batching queries or offloading to subprocesses.
95+
- Avoid network calls during detection; enrichments should happen before ingestion or after governance to keep evaluation latency predictable.
96+
- Use caching if rules require expensive initialization (e.g., loading ML models). Store the cache on the detector instance.

docs/governance-and-risk-model.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Governance & Risk Model
2+
3+
This document explains how Provenance evaluates risk, determines allow/warn/block decisions, and which configuration knobs influence the outcome.
4+
5+
## Evaluation Flow
6+
7+
1. **Provenance Coverage** – Each changed line submitted with the analysis is examined for an agent attribution. Coverage metrics (total, attributed, unknown) inform enforcement when `PROVENANCE_BLOCK_ON_UNKNOWN=true`.
8+
2. **Finding Aggregation** – Detectors produce findings with categories and severity levels. Governance summarizes totals, per-category counts, and severity buckets.
9+
3. **Policy Thresholds** – Outcomes are derived by comparing summaries against configured thresholds and default heuristics.
10+
4. **Review & Commit Signals** – When GitHub enrichment is enabled, governance inspects review overrides and force-push activity to adjust the outcome and emit alerts.
11+
5. **Decision Bundling** – The final decision, risk summary, and inputs digest are wrapped in a DSSE envelope (optionally signed).
12+
13+
## Policy Outcomes
14+
15+
The decision pipeline evaluates conditions in priority order:
16+
17+
1. **Unknown Provenance** – If `PROVENANCE_BLOCK_ON_UNKNOWN=true` and any line lacks attribution, the outcome is `block` with rationale “Unknown agents detected…”.
18+
2. **Critical Findings** – Any `critical` severity finding forces `block`.
19+
3. **High Severity Threshold** – If `risk_high_severity_threshold` is reached (default: `1` high finding), the outcome falls to `warn`.
20+
4. **Category Thresholds**`PROVENANCE_POLICY_BLOCK_THRESHOLDS` and `PROVENANCE_POLICY_WARN_THRESHOLDS` map finding categories to numeric limits (e.g., `{"secrets": 1}`). When exceeded, outcomes escalate to `block` or `warn`.
21+
5. **Review Overrides / Force Pushes** – GitHub metadata can escalate to `warn` or `block` if bot reviews were bypassed or force-pushes landed after approval.
22+
6. **Default Allow** – If none of the above are triggered, the analysis is `allow`.
23+
24+
The rationale captures the first trigger encountered to keep explanations concise.
25+
26+
## Configuration Reference
27+
28+
| Setting | Default | Description |
29+
| --- | --- | --- |
30+
| `PROVENANCE_BLOCK_ON_UNKNOWN` | `false` | Block analyses when any changed line lacks agent attribution. |
31+
| `PROVENANCE_RISK_HIGH_SEVERITY_THRESHOLD` | `1` | Number of `high` findings that trigger a `warn`. |
32+
| `PROVENANCE_POLICY_WARN_THRESHOLDS` | `{}` | JSON mapping of finding category → warn threshold (inclusive). |
33+
| `PROVENANCE_POLICY_BLOCK_THRESHOLDS` | `{}` | JSON mapping of finding category → block threshold (inclusive). |
34+
| `PROVENANCE_DECISION_SIGNING_KEY` | unset | Base64 Ed25519 private key. Enables signing of DSSE bundles. |
35+
| `PROVENANCE_DECISION_KEY_ID` | `"decision-key"` | Optional key identifier embedded in signature records. |
36+
| `PROVENANCE_DEFAULT_POLICY_VERSION` | `2024-06-01` | Version stamp included in decisions for audit tracking. |
37+
38+
> The full environment variable list lives in [Configuration Reference](configuration.md). This table highlights the governance-specific controls.
39+
40+
## Risk Summary Schema
41+
42+
Every decision exports a `risk_summary` block:
43+
44+
```json
45+
{
46+
"findings_total": 3,
47+
"findings_by_category": {"code_execution": 2, "secrets": 1},
48+
"findings_by_severity": {"high": 2, "critical": 1},
49+
"coverage": {
50+
"total_lines": 22,
51+
"attributed_lines": 18,
52+
"unknown_line_count": 4,
53+
"coverage_percent": 81.82
54+
},
55+
"bot_block_overrides": 1,
56+
"bot_block_resolved": 1,
57+
"force_push_after_approval": true
58+
}
59+
```
60+
61+
- `coverage` quantifies attribution confidence and feeds both alerting and DSSE payloads.
62+
- Optional GitHub metadata fields (`bot_block_overrides`, etc.) appear when enrichment is enabled.
63+
64+
## Weighted Risk Score (Planned)
65+
66+
The roadmap includes a composite risk score that blends severity, coverage, and review heuristics. Upcoming changes will add:
67+
68+
- `risk_score` – Numeric index (0–100) aggregating weighted factors.
69+
- `score_breakdown` – Component contributions (e.g., `{"coverage": 20, "severity": 50, "review": 10}`).
70+
- Configurable weights via `PROVENANCE_RISK_WEIGHTS`.
71+
72+
Once implemented, governance decisions will still rely on hard thresholds for blocking, but the score will enrich analytics views and downstream automation.
73+
74+
## Tuning Guidance
75+
76+
1. **Start Conservative** – Block on critical findings, warn on high severity bursts, and observe review overrides before enforcing attribution coverage.
77+
2. **Iterate on Categories** – Align category thresholds with detector packs (e.g., treat “secrets” differently from “lint”).
78+
3. **Use DSSE Bundles** – Signatures provide a tamper-evident record of enforcement logic. Verify bundles in CI to ensure configuration drift doesn’t silently relax policies.
79+
4. **Monitor Analytics**`/v1/analytics/summary` and `/v1/analytics/agents/behavior` reveal whether thresholds are too aggressive or lenient.

0 commit comments

Comments
 (0)