The Tool That Turned on Itself: AI-Slop-Detector v2.9.0 v2.9.1 #37
flamehaven01
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
1. v2.9.0 — Just one more thing
After shipping v2.8.0, I looked at the codebase and had the thought that's never good:
"This is almost there. Just one more thing."
Three "one more things" later, v2.9.0 was done.
1) Problem 1: The tool had no memory
Every run produced a score. That score disappeared.
You had no way to know if a file was getting better or worse. No way to know if the AI had been touching the same file repeatedly, each time nudging the deficit score up a little further.
Most linters (static analysis tools that check code for problems without running it) work this way — scan, report, forget. For tracking AI-generated code quality over time, that's not enough. The direction of change matters as much as the score itself.
The fix: SQLite auto-recording on every run. (SQLite is a lightweight file-based database — no server required, just a single file on disk.)
Two things worth noting:
file_hashmeans we only write a new row when the file content actually changes. You're not logging every invocation — you're logging every meaningful change.git_commitandgit_branchare captured automatically. So when you look at a quality regression (a point where the score got worse), you can tie it to the exact commit that introduced it.The DB also auto-migrates on first run using safe
ALTER TABLE(a standard SQL command that adds new columns without destroying existing data) — no manual schema management if you're upgrading from an older version.$ slop-detector my_module.py --show-history my_module.py — Last 10 runs ────────────────────────────────────────────── 2026-03-08 09:12 deficit=18.4 grade=B ← today (worse) 2026-03-07 22:41 deficit=12.1 grade=A 2026-03-07 14:03 deficit=11.8 grade=A ────────────────────────────────────────────── [!] Regression detected: +6.3 over last 3 runsYou're no longer looking at a score. You're looking at a direction.
2) Problem 2: AI confidently imports packages that don't exist
This one surprised me more than it should have.
Syntactically valid. Plausible-looking. Broken at runtime.
AI models generate package names that sound right. They don't verify the packages exist. This is one of the more insidious failure modes — code review usually focuses on logic, not on whether the imports resolve.
Existing unused-import detectors won't catch this either. They check whether the import is used in the file. We check whether the import exists in the environment.
The fix: a four-layer resolution index.
builtin_module_namesfind_specreturnsNonestdlib_module_names_thread,_collections_abc, internal stdlibpackages_distributionsPIL≠Pillow,cv2≠opencv-pythonfind_specfallbackOne important detail: relative imports are excluded by design.
A relative import (like
from . import utils) references another file within the same project — it's not a third-party package and can't be phantom. Regex would accidentally flag these as missing packages. AST parsing knows the difference because it reads the actual syntax (node.level > 0means "this import is relative"), not just the text.The detector errs toward false negatives (missing a real problem) on resolution errors rather than false positives (wrongly flagging valid code). In plain terms: if the tool isn't sure whether a package exists, it assumes it does. A false alarm on a legitimate import would erode trust in the tool faster than a missed phantom.
3) Problem 3: ML accuracy of 1.000 is a bug, not a result
After integrating the ML pipeline in v2.8.0:
In machine learning, Accuracy measures how often the model is right overall, Precision measures how rarely it raises false alarms, Recall measures how rarely it misses real problems, and F1 is a combined score. Getting 1.000 on all four simultaneously is not impressive — it is a red flag.
Not a success. A data leakage problem. Data leakage means the model was accidentally given information during training that it wouldn't have in real use — making the results look far better than they actually are.
Labels (the "correct answers" used to train the model) were generated from
deficit_score >= 30. Features (the input signals the model learns from) wereldr_score,inflation_score, andddc_score— the exact components that sum to producedeficit_score. The model learned to reproduce an addition formula. We trained a calculator.2. The v2.9.0 fixes
Real codebase data. CodeSearchNet (a public dataset of 500k Python functions scraped from open-source repositories) has structure and patterns that our synthetic generator can't produce. Even with self-supervised labels — meaning we let our own math engine label the data rather than humans — a genuinely different distribution (variety of code styles and patterns) forces the model to generalize beyond simply memorizing our formula.
History data as training signal. Files that go from
deficit=42todeficit=0across real development runs represent longitudinal change that no generator can produce. We added--export-historyspecifically so this data can feed back into the ML pipeline.slop-detector --export-history training_data.jsonl # → JSONL ready for DatasetLoader.load_jsonl()One other change:
RandomForestClassifiernow trains withclass_weight="balanced". Real slop rate in public codebases is around 4%. Without balancing, the model just predicts "clean" for everything and gets 96% accuracy — which is useless.3. v2.9.1 — We ran it on ourselves
A few hours after shipping v2.9.0, we did the obvious thing: ran
slop-detector --project src/on the detector's own source.It found three files above the threshold.
We had to fix our own code. Here's what that looked like.
We also updated the README during this process to reflect what the self-inspection made obvious:
It didn't flag those functions because AI wrote them. It flagged them because they were too complex. That's the whole idea.
1) cli.py (53.5 → 29.1): five god functions decomposed
print_rich_report,main,generate_markdown_report,generate_text_report, and_handle_outputeach exceeded the complexity-10 / 50-line threshold that triggers thegod_functionpattern. A "god function" is a function that tries to do everything — too long, too complex, impossible to test or reason about in isolation.Extracted 9 single-responsibility helpers:
print_rich_report()_build_rich_summary_tables,_build_rich_files_table,_render_rich_project,_build_single_file_content,_append_pattern_issues_rich,_render_rich_single_filemain()_build_arg_parser,_evaluate_ci_gate,_run_optional_features_handle_output()_write_filegenerate_markdown_report()_md_summary_section,_md_test_evidence_section,_md_findings_sectionmain()complexity dropped from 25 → 14. Every function now fits within limits. The irony of a slop detector with god functions wasn't subtle.2) registry.py (39.5 → clean):
globalstatement + DDC false positiveTwo separate issues flagged here.
The
globalstatement.registry.pyused a lazy-init singleton pattern — meaning: a single shared instance of an object, created only on first use, stored in a global variable. The pattern detector flagsglobalstatements as structural anti-patterns (they create hidden shared state that makes code harder to test and reason about) — and correctly so. We replaced it with eager module-level initialization (creating the object once when the module loads), which also happens to be thread-safe.DDC false positive. DDC (Dependency-to-Code ratio) measures what fraction of imported packages are actually used.
BasePatternwas imported only for type annotations — hints that tell other developers and type-checking tools what data types a function expects, but which are never executed at runtime. The usage checker correctly skips annotation-only usage when calculating DDC, which meant it scored the import as 0% used — a false alarm.The fix: move annotation-only imports under
if TYPE_CHECKING:— a Python convention that says "only process this import when a type checker is running, not at runtime."DDC now recognizes them as type-checking imports, excluded from the usage ratio by design.
3) question_generator.py (30.0 → clean): same DDC fix + Python 3.8 compatibility
Same
TYPE_CHECKINGfix forFileAnalysis.Additionally: the file had drifted to Python 3.10+ union syntax — writing
int | Noneinstead ofOptional[int]to mean "this can be an integer or nothing." The newer syntax is cleaner, but it breaks on Python 3.8 and 3.9. Converted back toOptional[int],Optional[str],List[Q]. The project targetspython >= 3.8and that constraint has to hold.4) The final numbers
188 tests. Zero regressions. Under 5 seconds.
4. What this looks like in practice
5. Honest limitations
Flamehaven's working principle is straightforward: honesty first, then trust. So before the final thoughts, here is what this tool cannot honestly claim yet.
1) This is a one-person project, and the default weights reflect that.
The core scoring formula:
These numbers were not derived statistically. They were tuned against Flamehaven's internal codebase and hardcoded. They work well for what we build. For a framework like Django or Spring — where boilerplate (repetitive structural code that the framework requires, not that the developer chose) is a given — the
ldrweight of 0.40 will likely generate false positives (incorrectly flagging legitimate code as problematic). This is a known and honest limitation of a tool built by one developer against one architecture.The mitigation exists:
.slopconfig.yamllets you adjust weights per project. But the burden is currently on the user to know they need to. That's not good enough for a general-purpose gate, and we know it.2) The ML pipeline took a different path — and we think it's the right one.
The tool includes an optional machine learning layer that can score files alongside the main math-based engine. The honest problem: the training labels were generated by the math engine itself (
deficit_score >= 30= slop). The features fed to the ML model are the exact components that produce that score. In practice, the model learned to approximate the formula it was trained on — not to detect anything independently. We trained a calculator.The obvious fix would be: pull real-world code from a large public dataset (we tried CodeSearchNet — 500k Python functions), label it "AI-generated" vs "human-written," and train a classifier. Accuracy goes up. But here's why we didn't go that route:
The entire premise of this tool is that authorship doesn't matter. A god function written by a senior engineer at 2am is just as problematic as one generated by Copilot. If we train a model to detect "AI code," we're working directly against our own stated philosophy:
So instead, we redirected the ML toward a different question entirely — not "who wrote this?" but "is this getting worse?"
A file that goes from
deficit=12todeficit=28across three editing sessions is a signal worth catching — regardless of whether a human or an AI made those edits. That's the signal we want the ML to learn from: real usage history, not authorship labels.The ML module stays
EXPERIMENTALuntil enough history data accumulates from real usage to train against. We are not there yet. We will say so plainly.3) JavaScript and TypeScript analysis is shallower than Python — and that matters.
The tool's core strength is that it reads code structure, not code text. For Python, this means parsing the actual Abstract Syntax Tree (AST) — the logical skeleton of the code — to measure things like how deeply nested a function is, or how complex its control flow is. Text-based tools (like standard linters) can miss these structural problems entirely.
For JavaScript and TypeScript, we achieve the same depth only when an optional library called
tree-sitteris installed. Without it, the analyzer falls back to regex pattern matching:Regex can find suspicious text patterns. It cannot accurately measure nesting depth or cognitive complexity (how hard a function is for a human brain to follow — the number of branches, loops, and conditions that must be held in mind simultaneously). This means a JS/TS file with deeply nested god functions may score clean if
tree-sitteris absent — which directly contradicts the promise that this engine reads structure, not appearance.It is the most technically inconsistent part of the current codebase. Patch is targeted for v2.9.2.
4) There is no wild benchmark.
188 tests pass at 100%. Every one of them was written against fixtures we created — hand-crafted worst-case examples designed specifically to be caught. There is no published result of running this tool against Django, FastAPI, NumPy, or any large external codebase we did not build.
Until that exists, "this tool works" means "this tool works on code that looks like ours." That is a meaningful claim. It is not a universal one.
Wild benchmark against at least one major open-source Python project is the first prerequisite before v3.0.
6. Final thoughts
We shipped v2.9.0 to give the tool memory. We shipped v2.9.1 because the tool used that memory against us.
And somewhere in the middle of patching our own god functions, the README update started to feel less like a tagline and more like a description of what actually happened:
A lot of "AI code quality" tooling frames the question as did a human write this or did an AI? We spent two releases proving that's the wrong question.
cli.pyhad five god functions. The tool found them. Whether Copilot wrote them or we did at 2am is beside the point.The score is the score.
Over time, the history database becomes a longitudinal record — a timeline of how code quality actually moves, not just what it scores at a single point — of how code evolves under AI-assisted development. The ML pipeline now has a path to train on real signals — not synthetic ones we constructed ourselves. And the self-inspection result is the clearest proof of concept we've had: the math found the problems, and we fixed them.
That's the whole idea. It was always the whole idea.
7. Repository & Documentation
*VS Code Extension — Install directly from the marketplace
GitHub — open source, zero core dependencies.
pip install ai-slop-detector
What part of your codebase would you least want a structural analyzer to look at? That's probably where you should start.
Beta Was this translation helpful? Give feedback.
All reactions