paul-rl/godot-issue-triager

Milestones

Sprint 5 - Reproducibility + Code Freeze
## Goal Harden the repo for reproducibility and freeze code ## Deliverables - Reproducible runbook - README documents exact steps to regenerate results - Minimal-command pipeline to produce final tables from cached artifacts - Versioning + determinism - Dataset split artifacts committed or checksummed and referenced. - Config files captured for every reported run - Seeds pinned where applicable, dependencies pinned - Sanity checks - Split integrity checks - Label mapping verification - Schema compliance tests - Repo cleanup - Remove dead code paths, unify config names, ensure consistent logging ## Done when - A fresh environment can reproduce the final results using the runbook - All report figures regenrate from scripts and match stored artifacts - Code is frozen at end of sprint
Due by April 10, 2026
0% complete0 open 0 closed
Sprint 4 - Robustness + Final Experiment Matrix on Test Split
## Goal Lock the final results on the time-based test split and demonstrate robustness to noisy/partial issue text. ## Deliverables - Final experiment matrix - TF-IDF baseline evaluated on the final split - Structured LLM evaluated with the selected operating point - Comparison table generated directly from run artifacts - Robustness suite - 2-4 perturbations implemented, examples: - remove title - truncate body - strip code blocks/logs - drop first/last K tokens - For each perturbation, measure performance change and prediction stability - Report-ready outputs - Final results table (baseline vs structured LLM) - One robustness plot/table showing degradation/stability ## Done when - Single command regenerate: - baseline metrics - final structured LLM metrics - final tables/plots - Robustness artifacts exist for each perturbation with clear metrics and a short interpretation note
Due by April 3, 2026
0% complete0 open 0 closed
Sprint 3 - Coverage-accuracy tradeoff and operating point
## Goal Quantify abstention behavior and select a justified operating point (N, aggregation policy, confidence threshold) using coverage-accuracy tradeoff curves. ## Deliverables - Threshold sweep experiment - Automated sweep over confidence thresholds - For each threshold: compute non-abstained % and performance (micro/macro F1) - Outputs saved as a sweep artifact and plotted - Operating point selection - Choose final settings based on sweep results - Error analysis set - Curated set of 10-20 representative errors with: - issue excerpt (sanitized) - predicted vs true labels - brief failure type categorization ## Done when - A sweep command reproduces the coverage-accuracy curve and writes artifacts. - A single chosen config run reproduces the selected operating point metrics - Error analysis file exists with categorized examples suitable for inclusion in the final report.
Due by March 20, 2026
0% complete0 open 0 closed
Sprint 2 - Structured LLM v1 (N-sampling, aggregation, confidence, repair, abstention)
## Goal Extend the structured LLM triager to match proposal: multi-sample generation, aggregation, confidence estimation, schema repair/retry, and abstention signaling. ## Deliverables - N-sampled structured gens - Configurable N gens per issue - All gens stored as JSON artifacts -Aggregation strat - Deterministic aggregation from N sample into a single triage record: - categorical fields: majority vote - multi-label fields: vote threshold, top-k, union strategy (UNSURE) - Document aggregation policy. - Confidence computation - Field-level and overall confidence derived from sample agreement - Confidence included in the final triage record according to schema. - Schema validation + repair/retry - JSON schema validation for every sample - Automatic repair path for invalid JSON - Metrics: first-pass valid %, post-repair valid %, repair success % - Abstention - `needs_human_traige` set based on confidence thresholds - Threshold policy documented - Evaluation integration - Structured LLM v1 evaluated w/ same harness as TF-IDF - Results saved w/ run configs captured ## Done when - Single command produces: - aggregated triage JSON for eval se - first-pass valid %, post-repair valid %, repair success % - routing metrics via shared evaluator - All persisted artifacts are JSON - Single run config fully reproduces reported metrics from cached gens
Overdue by 11 day(s)
•
Due by March 6, 2026
0% complete0 open 0 closed
Sprint 1 - Baselines + metric harness
## Goal: Deliver a reproducible experiment loop that produces preliminary results for: - TF-IDF baseline router - a structured LLM triager that always emits schema-valid JSON ## Deliverables: - Dataset + label contract: - Issue ingestion pipeline produces a normalized dataset artifact. - Routing label space defined, remaining mapped to other. - Leakage-prone labels excluded. - Time-based split artifacts generated (oldest 80% tune, newest 20% test), with counts written to a metadata file. - TF-IDF baseline (multi-label routing) - TF-IDF vectorizer over title + body, - Linear classifier configured for the routing label format. - Baseline training + evaluation integrated into experiment harness. - Structured LLM v0 - Prompt requires a single JSON object matching the project schema - JSON schema validation enforced - Normalization/mapping layer converts model outputs to canonical labels - Cached generations saved (JSON) with per-issue metadata - Evaluation harness - Metric computation on TF-IDF and structured LLM outputs - Metrics saved as machine-readable artifacts, plus human-readable summary - Reproducible runs supported via pinned split files and fixed seeds where possible ## Done when: - A single command runs end-to-end to produce: - TF-IDF metrics on the frozen split (micro/macro F1, top-k accuracy) - Structured LLM v0 metrics on a defined evaluation set - Structured LLM first-pass schema validity % - All reported results are reproducible from a clean checkout using the frozen split artifacts and cached outputs.
Overdue by 25 day(s)
•
Due by February 20, 2026
•0/3 issues closed
0% complete3 open 0 closed
Sprint 0: Setup & Data Contract
Deliverables: - Data fetcher for ~10k issues from GitHub -> one normalized dataset format - Label processing rules implemented: - choose top N(15-25) topic:* by frequency + min support (>=200) - map low-freq topic:* -> other - extract: - platform:* - issue type - impact flags - Time-based split (oldest 80% tune / newest 20% test) produced and versioned Completion criteria: Singular command produces {train, val, test}.json + label vocab file
Overdue by 1 month(s)
•
Due by February 6, 2026
•1/4 issues closed
25% complete3 open 1 closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Milestones

Sprint 5 - Reproducibility + Code Freeze

Sprint 4 - Robustness + Final Experiment Matrix on Test Split

Sprint 3 - Coverage-accuracy tradeoff and operating point

Sprint 2 - Structured LLM v1 (N-sampling, aggregation, confidence, repair, abstention)

Sprint 1 - Baselines + metric harness

Sprint 0: Setup & Data Contract

Milestones

List view

Sprint 5 - Reproducibility + Code Freeze

Sprint 4 - Robustness + Final Experiment Matrix on Test Split

Sprint 3 - Coverage-accuracy tradeoff and operating point

Sprint 2 - Structured LLM v1 (N-sampling, aggregation, confidence, repair, abstention)

Sprint 1 - Baselines + metric harness

Sprint 0: Setup & Data Contract