Minimal standalone Python CLI to run the same evaluation pipeline as the AuditAgent benchmark, but able to work against an external data repo as well. It:
- Targets an external data root with folders like:
auditagent/
,baseline/
,repos/
,source_of_truth/
- Reads scan results from
<data_root>/<scan_source>/<repo>_results.json
(e.g.,auditagent/
orbaseline/
) - Reads source-of-truth findings from
<data_root>/source_of_truth/<repo>.json
- Evaluates per-batch with the same prompt, running
ITERATIONS
per batch (default: 3 viaconfig.py
) - Post-processes partial matches and appends false positives
- Writes results to
<output_root>/<repo>_results.json
(configured inconfig.py
)
- Python 3.12+ recommended
- API keys as environment variables:
OPENAI_API_KEY
- Optional (telemetry):
LANGFUSE_HOST
,LANGFUSE_PUBLIC_KEY
,LANGFUSE_SECRET_KEY
,LANGFUSE_USER_ID
python -m venv .venv
. .venv/Scripts/activate # Windows Git Bash / PowerShell equivalent
pip install -e . # or: pip install -e .[dev]
All runtime options are set in scoring_algo/settings.py
(env prefix SCORING_
):
REPOS_TO_RUN
: list of repo names (without.json
) to evaluateMODEL
: OpenAI model name (must be inSUPPORTED_MODELS
)ITERATIONS
: number of LLM runs per batch prompt (default: 3)BATCH_SIZE
: number of scan findings per batch (default: 10)SCAN_SOURCE
: which folder under data-root to read scan results from (auditagent
orbaseline
)DATA_ROOT
: base directory containingauditagent/
,baseline/
,repos/
,source_of_truth/
OUTPUT_ROOT
: directory where<repo>_results.json
will be writtenDEBUG_PROMPT
: whether to write the rendered prompt beside results
Notes on paths:
- If
DATA_ROOT
orOUTPUT_ROOT
are relative, they resolve relative to thescoring_algo/
package directory.
Subcommands are available via Typer CLI:
scoring-algo evaluate [--no-telemetry] [--log-level INFO]
The runner validates the presence of: <DATA_ROOT>/<SCAN_SOURCE>/<repo>_results.json
and <DATA_ROOT>/source_of_truth/<repo>.json
. Results are written to <OUTPUT_ROOT>/<repo>_results.json
.
Generate a Markdown report from existing results in OUTPUT_ROOT
(or any benchmarks folder) without re-running evaluation. When --out
is relative, it is written inside --benchmarks
:
scoring-algo-report --benchmarks ./benchmarks --scan-root ./data/baseline --out REPORT.md
Module alternative:
python -m scoring_algo.generate_report --benchmarks ./benchmarks --scan-root ./data/baseline --out REPORT.md
python -m venv .venv
. .venv/Scripts/activate
pip install -e .
setx OPENAI_API_KEY YOUR_KEY_HERE # Windows permanent env; or set in .env
scoring-algo evaluate --no-telemetry --log-level INFO
scoring-algo report --benchmarks ./benchmarks --scan-root ./data/baseline --out REPORT.md
For each truth finding (source_of_truth/<repo>.json
):
- The junior report is split into batches of
BATCH_SIZE
in original order. - For each batch:
- The prompt includes the single truth finding and the current batch of junior findings.
- The LLM is called
ITERATIONS
times and responses are aggregated by majority:- 2-of-3 exact matches → select a matching response
- 2-of-3 partial matches → select a partial response
- 2-of-3 false (neither match nor partial) → select a false response
- With 3 iterations, a 1 exact + 1 partial + 1 false tie resolves to partial
- Otherwise the first response is used as fallback
- If the consensus is a true match, it is returned immediately for this truth. The matched junior finding is removed from future comparisons (one-to-one mapping).
- Otherwise, the algorithm keeps the first partial found (if any) as the current best for this truth.
- After all batches, if no exact match was found, the best partial (if any) is used; otherwise, a representative non-match is recorded.
Post-processing and false positives:
- Partials reusing a junior index already used by a true match are suppressed. Multiple partials pointing to the same junior index are de-duplicated (only the first remains).
- All remaining junior findings not used by matches/partials are appended as false positives, except severities
Info
andBest Practices
.
Output format:
- Results are an array of
EvaluatedFinding
with fields:is_match
,is_partial_match
,is_fp
,explanation
,severity_from_junior_auditor
,severity_from_truth
,index_of_finding_from_junior_auditor
, andfinding_description_from_junior_auditor
.