Full log of all 8 autonomous iteration cycles for ground-truth.
Execution date: 2026-03-23 Operator: TechKnowMad Labs (admin@techknowmad.ai) Protocol: Edgecraft v1.0
Timestamp: 2026-03-23T00:01 Duration: ~5 minutes
Findings:
- 4 modules at <100% coverage:
models.py(88%),aggregator.py(95%),consistency.py(92%),overlap.py(97%) - Missing:
DetectionResultandAggregatedResultvalidation error paths - Missing: Aggregator zero-weights branch, negative add_detector weight
- Missing: Jaccard both-empty and one-empty edge cases
- Missing:
_ngram_recallwith empty claim n-grams
Actions:
- Created
tests/conftest.pywith sharedFixedNLI,ClampingNLI, detector fixtures - Created
tests/test_models.py— 10 tests - Created
tests/test_coverage_gaps.py— 15 tests
Result: Coverage 95% → 100% | Tests: 31 → 57
Timestamp: 2026-03-23T00:06 Duration: ~5 minutes
Findings:
- No
TypeErrorguard ondetect()—Noneinputs produce confusingAttributeError - No validation for
detect_batch()element types
Actions:
- Added
BaseDetector._validate_inputs()static method with clearTypeErrormessages - Added validation calls in all 3 detector
detect()methods - Added element-type checks in
GroundTruthDetector.detect_batch() - Created
tests/test_error_hardening.py— 18 tests (None, empty, huge, unicode, punctuation)
Result: Tests: 57 → 75 | 0 failures
Timestamp: 2026-03-23T00:11 Duration: ~7 minutes
Conjecture: Parallelizing N detect() calls in detect_batch() will yield ~Nx speedup for NLI-heavy workloads.
Actions:
- Refactored
detect_batch()to useThreadPoolExecutorwith index-based result ordering - Added
max_workersparameter toGroundTruthDetector.__init__() - Added
@lru_cache(maxsize=1024)toOverlapDetector._tokenize() - Added
cached_tokenize()module-level utility with@lru_cache(maxsize=2048) - Created
tests/test_performance.py— 9 timing and correctness tests
Measured results:
- Single detect: <1ms for short texts (<100ms threshold)
- 50-pair batch: <10ms (well under 2s threshold)
- Parallel correctness: identical results to sequential
Flywheel: lru_cache pattern applicable to _jaccard and _key_terms functions.
Timestamp: 2026-03-23T00:18 Duration: ~5 minutes
Scan results:
- Hardcoded secrets: 0 real findings
- False positives filtered: 2 (word "token" in tokenize function name; lru_cache label)
- Subprocess/eval/exec: 0 found
- File open() calls: 0 found
- Path traversal: 0 risk (no file I/O)
Actions:
- Created
tests/test_security.py— 13 tests:- SQL injection, shell metacharacters, path traversal, null bytes, format strings → all safe
- Integer, bytes, list as claim/context →
TypeError - ReDoS safety:
\b\w+\bdoes not backtrack catastrophically - Score integrity:
[0,1]on all adversarial inputs
Result: Tests: 75 → 97 | 0 security vulnerabilities
Timestamp: 2026-03-23T00:23 Duration: ~3 minutes
Actions:
- Created
.github/workflows/ci.yml: checkout → setup-python 3.12 → uv install → ruff check → pytest with--cov-fail-under=95 - Created
.pre-commit-config.yaml: ruff (format + check) + mypy hooks - Fixed 13 ruff lint issues in test files (unused imports, import ordering)
Result: CI pipeline active on every push and PR
Timestamp: 2026-03-23T00:26 Duration: ~5 minutes
Actions:
- Created
tests/test_property_based.py— 15 Hypothesis tests across 8 strategies:- Score in
[0,1]for all detectors — 200+ examples each is_hallucination == score > thresholdfor any threshold- No crashes on any random valid string — 300+ examples
DetectionResultasdict() round-trip fidelity- Batch length always == input length
- Jaccard symmetry
J(a,b) == J(b,a) - Jaccard self-similarity
J(a,a) == 1.0 - OverlapDetector threshold monotonicity
- Score in
Hypothesis findings: 0 failures — all invariants hold
Result: Tests: 97 → 112 | 15 property tests green
Timestamp: 2026-03-23T00:31 Duration: ~5 minutes
Actions:
- Created
examples/basic_detection.py: basic grounded vs. hallucinated detection - Created
examples/batch_detection.py: 8-pair batch with timing output - Created
examples/custom_detector.py: custom NLI, weighted aggregation, runtime add_detector - All 3 examples tested — execute without errors
- Added full docstrings to:
_ngram_recall,OverlapDetector.detect,_jaccard,ConsistencyDetector.detect,BaseDetector._validate_inputs
Result: Complete API documentation; 3 runnable examples
Timestamp: 2026-03-23T00:36 Duration: ~5 minutes
Actions:
pyproject.toml: addedauthors,keywords, PyPI classifiers; expanded dev deps- Created
CHANGELOG.md: full change history with metrics table - Created
Makefile:test,test-fast,test-property,lint,format,security,clean,coverage-html,examples - Created
AGENTS.md: protocol documentation - Created
EVOLUTION.md: this file - Tagged
v0.1.0
Final metrics:
| Metric | Value |
|---|---|
| Total commits | 16 |
| Total tests | 112 |
| Test coverage | 100% |
| Property-based tests | 15 |
| Security findings | 0 |
| Examples | 3 |
| Source files modified | 6 |
| New files created | 14 |