- Rust-accelerated scorer backend wired into
CoherenceScorer(scorer_backend="rust") - WebSocket multiplexed streaming (concurrent sessions per connection, cancel, backpressure)
- VectorBackend entry-point registry (
register_vector_backend,get_vector_backend,list_vector_backends) - Tenant-isolated VectorStores via
TenantRouter.get_vector_store() /v1/tenants/{tenant_id}/vector-factsREST endpoint- ONNX INT8/FP16 quantization shipped in v2.3.0 (retired from planned)
- StreamingKernel wired into
CoherenceAgent.stream()for unified token-level oversight - gRPC incremental streaming with per-chunk coherence scores
- CLI multi-worker config propagation (
--workers) - ONNX batch config wiring end-to-end (
onnx_pathin DirectorConfig) - Prompt content removed from logs — HMAC audit hashing
guard()duck-type detection for OpenAI/Anthropic SDK interceptorstrict_modereject inCoherenceScorer.review()- Domain profiles: medical, finance, legal, creative, customer_support, summarization
- Discord bot + CI/release webhook automation
- E2E benchmark context leakage fix (per-sample store isolation)
- InputSanitizer additive scoring model fix
- Domain-specific scoring profiles (medical, finance, legal, creative, customer_support, summarization)
strict_modereject in CoherenceScorerguard()SDK interceptor with duck-type provider detection- Persistent stats backend (SQLite)
- AGPL §13
/v1/sourcecompliance endpoint
- Lite scorer backend (
scorer_backend="lite") — word overlap + negation heuristics, ~0.5ms/pair - Multi-turn conversation tracking (
ConversationSession) with cross-turn divergence blending - ONNX GPU batch optimization (
OnnxDynamicBatcher) with IO binding for zero-copy transfers - Plugin architecture for scorer backends (
ScorerBackendABC + entry-point registry) - gRPC transport (
proto/director.proto,--transport grpcon CLI) - Multi-GPU sharding (
ShardedNLIScorer) with round-robin device routing - Security audit preparation: threat model, SBOM generation, Hypothesis fuzz tests,
InputSanitizerhardening - Public API freeze:
__all__on all modules, deprecated aliases emitDeprecationWarning
- API autodoc pages for DirectorConfig, Enterprise, InputSanitizer
- Troubleshooting guide, enterprise guide, streaming cadence examples
- Validation rules section in scorer reference
score_every_n,adaptive,max_cadenceon StreamingKernel + AsyncStreamingKernel- Runtime validation on threshold, soft_limit, w_logic, w_fact
- Streaming overhead benchmark (tokens/sec by cadence)
- Enterprise modules lazy-loaded via
__getattr__ [enterprise]optional dependency group + pytest marker
director-ai benchCLI subcommand (--dataset, --seed, --output)scorer_backend="hybrid"mode (NLI + LLM judge)- Architecture deep-dive, production checklist, threshold tuning docs
- PineconeBackend, WeaviateBackend, QdrantBackend
- Bandit + Semgrep SAST in CI
- Case-sensitivity fix in GroundTruthStore
- LLM judge error handling hardened
- SafetyKernel hard_limit validation
- Thread-safe OTel setup
- Histogram bucket_counts O(n log n) optimization
- Simplified public API:
guard()as the primary interface; enterprise behinddirector_ai.enterprise - Adaptive threshold calibration:
director-ai tunewith labeled data → optimal threshold + weights - Remove deprecated 1.x aliases: all 6 deprecated methods removed; 1.x class name aliases already removed in 2.x
- Drop Python 3.10: minimum Python 3.11 for
ExceptionGroupandTaskGroupsupport
- Fix NLI confidence margin calculation —
nli_marginnever computed, hybrid escalation broken - LLM judge verdict caching (LRU keyed on prompt+response hash) to avoid redundant API calls
- Retry with exponential back-off on transient LLM API failures
- Escalation-rate telemetry via
metrics.counter("llm_judge_escalations") - Run hybrid-mode E2E benchmark on HaluEval (300 traces) and publish numbers
PostgresAuditSink.log()implementation with async connection pooling (asyncpg)- Schema migration framework (version-tracked DDL with forward-only migrations)
RedisGroundTruthStore.retrieve_context()implementation with Redis Vector Search (RediSearch)- Redis connection pooling, TTL management, batch
add_many()/retrieve_batch()
- CI pipeline:
wasm-pack buildin wheels.yml, publish.wasm+ JS glue to npm - Browser integration tutorial (vanilla JS + webpack example)
- Benchmark script exists (
benchmarks/wasm_overhead_bench.py) but no production WASM code ships
- PyO3 0.23 → 0.24 upgrade (unblocks Python 3.14 wheels)
- SIMD vectorization of backfire-ssgf micro-cycle inner loop (
std::simdorpacked_simd2)
- Run RAGTruth + FreshQA full-scale GPU benchmark, publish results in BENCHMARK_REPORT.md
- Cross-platform latency profiling (Windows/macOS/Linux) with memory + GC overhead
- Quantify PyO3 FFI overhead (Rust-native vs Python-via-FFI round-trip)
- FAISS backend (in-process dense search for edge/offline deployments)
- Elasticsearch backend (hybrid BM25 + dense retrieval)
- Fix
quickstartCLI scaffolding brokenasyncio.run()on sync methods - Implement
BatchProcessor.process_batch_async()— docstring-advertised method missing
- Add
__aiter__to Bedrock/Gemini/Cohere guarded stream wrappers (parity with OpenAI/Anthropic) - Add
async aadd()/aquery()defaults onVectorBackendABC for non-blocking server use - Parallelize
AnthropicProvider/HuggingFaceProvidermulti-candidate requests
- Add
LiteScorer.review()returning(bool, CoherenceScore)to matchCoherenceScorerinterface
- Validate
reranker_model/embedding_modelnon-empty when feature enabled - Warn on unknown YAML keys in
DirectorConfig.from_yaml()
- End-to-end
scorer.review(session=...)cross-turn divergence test review_batch()ordering, partial failure, and timeout testsbuild_store()withvector_backend="sentence-transformer"branch test
- Version bump to 3.3.0 in pyproject.toml,
__init__.py, CITATION.cff - CHANGELOG.md entries for v3.1.0, v3.2.0, v3.3.0
- Deprecated 1.x alias table removed from PUBLIC_API.md
- Generated
director_pb2.py/director_pb2_grpc.pyfrom proto/director.proto - Removed SimpleNamespace fallback; fail-fast if protobuf stubs missing
CoherenceAgent.aprocess()async counterpart- CLI
--chunk-sizevalidation (reject <= 0) cors_originsdefault changed from"*"to""(require explicit config)--cors-originsflag ondirector-ai serve- 8 new tests in test_v330_hardening.py (1927 total, 0 failures)
- H_logical and H_factual parallelised via
ThreadPoolExecutor(~40% latency reduction) CoherenceScorer.review_batch()— coalesced batch NLI (2 GPU calls when NLI available)BatchProcessor.review_batch()delegates to scorer with serial fallbackReviewQueue— server-level continuous batching for/v1/reviewwith flush window- Config fields:
review_queue_enabled,review_queue_max_batch,review_queue_flush_timeout_ms - TensorRT path verified deployment-ready (no code changes needed)
- Async hygiene: 5 sync→async fixes in server.py, sessions lock, OTel lazy init
- 1966 tests, 0 failures
- Local DeBERTa-v3-base binary judge replaces LLM judge for borderline NLI escalation (F1=0.915, latency ~15ms vs 1.3–14.2s, zero API cost)
- Summarization FPR reduced from 95% to 25.5% (three-phase fix):
- Phase 1: MIN inner aggregation (95% → 60%)
- Phase 2:
premise_ratio=0.85+ logic aggregation bug fix (60% → 42.5%) - Phase 3:
w_logic=0(eliminate h_logic==h_fact duplication),_use_prompt_as_premise=True(bypass lossy vector store),trimmed_meanouter aggregation (42.5% → 25.5%)
- Summarization profile:
w_logic=0.0, w_fact=1.0,coherence_threshold=0.15,nli_fact_outer_agg="trimmed_mean",nli_use_prompt_as_premise=True _heuristic_coherenceshort-circuits logical divergence whenW_LOGIC < 1e-9- Configurable
nli_fact_retrieval_top_kandnli_use_prompt_as_premiseconfig fields - Summarization FPR diagnostic benchmark (
benchmarks/summarization_fpr_diag.py) workflow_dispatchadded topublish.ymlanddocker.yml(fix GITHUB_TOKEN anti-loop)- 2038 tests, 0 failures
- Dialogue FPR: 97.5% → 4.5% via bidirectional NLI + baseline calibration
_detect_task_type()classifies dialogue via speaker-turn regex_dialogue_factual_divergence()scores both directions, applies baseline calibration- Logical divergence skipped for dialogue (entailment is meaningless)
- Diagnostic benchmark:
benchmarks/dialogue_fpr_diag.py(4 baseline configs)
- Summarization FPR: 25.5% → 10.5% via bidirectional NLI + baseline=0.20
_summarization_factual_divergence()scores source→summary and summary→source, takes min- Baseline calibration:
adjusted = max(0, (raw - 0.20) / 0.80) nli_summarization_baselineconfig field (default 0.20)- Bidirectional FPR diagnostic benchmark (
benchmarks/summarization_fpr_diag.py) - 13 new tests (
tests/test_summarization_bidir.py) - 2065 tests, 0 failures
- Summarization FPR: 10.5% → 2.0% via Layer C (claim decomposition + coverage scoring)
NLIScorer.score_claim_coverage()decomposes summaries into atomic claims- Config:
nli_claim_coverage_enabled,nli_claim_support_threshold(0.6),nli_claim_coverage_alpha(0.4) ScoringEvidenceincludesclaim_coverage,per_claim_divergences,claims- 21 new tests, 2084 total
- Sentence-level attribution:
ClaimAttributionmaps claims to source sentences - Cost transparency:
ScoringEvidence.token_count,estimated_cost_usd - Domain benchmarks: medical_eval (MedNLI + PubMedQA), legal_eval (ContractNLI + CUAD), finance_eval (FinanceBench)
- Fine-tuning pipeline:
finetune_nli(),FinetuneConfig,FinetuneResult, CLIdirector-ai finetune - TensorRT export:
export_tensorrt(), CLIdirector-ai export --format tensorrt - ONNX CUDA: 4.5ms/pair median (2.4x faster than PyTorch)
- Distill smaller NLI model (DeBERTa-base from FactCG-Large teacher + hybrid labels) — deferred, 22/23 fine-tunes hurt (catastrophic forgetting)
- ReviewQueue adaptive flushing (dynamic max_batch based on request rate) — deferred to v3.10+
- Per-task-type adaptive thresholds (+0.86pp balanced accuracy over global baseline)
- AggreFact evaluation pipeline with cached scoring + per-dataset sweep
- GPU benchmarking infrastructure (UpCloud L40S)
- 32 fine-tuning experiments (LoRA, distillation) — all negative, documented
- Dataset-type classifier (RF-20-d6, 77.08% BA, +1.22pp over global)
- Classifier auto-discovery from bundled model when
adaptive_threshold_enabled=True - Threshold hierarchy: global → per-task-type → per-dataset (most specific wins)
- Security: proxy HMAC auth, HTTPS enforcement, tenant-to-API-key binding, session ownership, WebSocket info leak fix
- Documentation: why-director-ai, migration-v2-v3, glossary, runbooks
- Secret removal from git history (expired HF token)
- B615 HF revision pinning: pin
revision="<sha>"on all 11from_pretrained()calls to prevent supply-chain risk - SPDX
AGPL-3.0-or-laterheaders on all source files - ModernBERT-large (8192 tokens) as alternate NLI backend — only path to >78% BA
- Stripe Checkout page + HMAC-SHA256 license key generation
- B608 ChromaDB parameterised filters (low priority, no user input reaches these)