feat(semantic): [alpha build] provider-aware typed embeddings, reranking, diagnostics, and eval harness#87
Conversation
Add scripts, docs, Dockerfile, and package.json scripts for Docker-based Rust validation (fmt/check/clippy/test) so Windows users without MSVC Build Tools can still validate Rust code. - scripts/docker-rust.ps1: PowerShell script supporting fmt/check/clippy/ test/validate/shell tasks with persistent Docker volumes - Dockerfile.rust: minimal Rust image with rustfmt + clippy pre-installed - docs/docker-rust-validation.md: full usage and design documentation - package.json: 6 new docker:rust:* convenience scripts Design: Linux-target validation via rust:1-bookworm, persistent cargo volumes for caching, fail-fast sequential validation.
…rough, fingerprint upgrade
…or pruning, write-lock sync
…pgrade, invalidation tests
- SemanticFilePolicy config struct with include_code/include_docs/ include_configs/binary_detection/generated_file_detection/globs - parse_semantic_files_config handler in configure.rs - File policy evaluation: should_index_file(), is_generated_file(), is_config_file(), is_docs_file() - Docs chunker: collect_docs_chunks() with heading-based splitting for markdown, splitting by file for other doc types - collect_chunks routes doc files through docs chunker, skips binary/generated/config files per policy - SemanticIndexFingerprint extended with file_policy_hash and docs_chunker_version; diff() triggers rebuild on policy change - build_with_progress/refresh_stale_files accept &SemanticFilePolicy - compute_file_policy_hash() deterministic hash of policy fields - Re-export SemanticFilePolicy from semantic_index module - All test callers updated with &SemanticFilePolicy::default()
…iority ordering, backoff - CancellationToken (Arc<AtomicU64> generation counter) for cooperative build cancellation on reconfigure - Cancel old semantic index builds instead of detaching when config changes - Priority file ordering: README/docs first, then core source, then tests, then rest - Embedding backoff: exponential retry with jitter for remote provider rate limits - SemanticIndexStatus::Partial variant with completeness percentage for partial builds - Search reports partial index state during cold start - Phase-boundary cancellation checks between model init, disk read, incremental refresh, and full rebuild
Add Perplexity backend with InputMode::DocumentChunks support for contextualized embedding where chunks carry document-level context. - SemanticBackend::Perplexity variant with config, profile, engine - DocumentChunks/PerDocumentChunks/DocumentEmbeddings structs - embed_document_chunks() routes Perplexity to grouped embedding API - build_with_progress_contextualized() groups chunks by document - Wire configure.rs to branch on input_mode: DocumentChunks - SemanticEmbeddingModel::input_mode() public accessor - EmbeddingModelProfile with contextualized_supported guard - Response validation: index continuity, missing documents, dimension
…to trait-backed module Bead: aft-t6p.12 Extracts Vec<EmbeddingEntry> storage and search from SemanticIndexSnapshot into a VectorStore trait with FlatF32VectorStore implementation. This decouples the storage layer from the lifecycle logic and prepares for alternative backends (binary Hamming, approximate ANN). Key changes: - vector_store.rs: VectorStore trait + ScoredChunk/PruneStats types - FlatF32VectorStore: flat scan with cosine similarity (preserves existing behaviour exactly) - FlatBinaryHammingVectorStore: forward-looking Hamming-search impl - SemanticIndexSnapshot delegates search/len/prune/entries to store - Fixed dimension-sync bug where set_dimension updated the snapshot dimension but not the store dimension, causing search to return 0 - EmbeddingEntry and IndexedFileMetadata made pub for trait compatibility
On Windows, use copyFileSync for the binary replacement (which overwrites the target — renameSync fails with EEXIST). If it fails, the original binary at binaryPath is preserved. The temp file cleanup is now wrapped in its own try/catch so a cleanup failure does NOT propagate as a download failure — the binary was already successfully placed at binaryPath. Addresses PR cortexkit#69 cubic review finding P2.
Implement bead aft-t6p.24: file identity manifest + vector ownership records. Changes: - **FileRecord struct**: identity record with content_hash, size_bytes, mtime, language, document_kind, inclusion_policy_hash, indexed_at - **file_manifest on SemanticIndexSnapshot**: HashMap<PathBuf, FileRecord> tracking which files produced which vectors, enabling precise stale-vector pruning when files are edited, deleted, or excluded - **V8 serialization format**: extends V7 with per-entry chunk_hash (after each vector) and file manifest block (after all entry vectors). Full backward compatibility with V1-V7 reads. - **chunk_hash on EmbeddingEntry**: deterministic hash of chunk content fields for tracing which version of a chunk produced a stored vector - **compute_chunk_hash**: blake3-based deterministic hash - **build_manifest_from_store helper**: populates file_manifest from store's file_metadata, called in all builder functions (build_from_chunks, build_with_progress_contextualized, refresh_stale_files) and from_bytes for V1-V7 cache migration - **next_chunk_id, fingerprint_string**: forward-looking fields on snapshot for future unique ID assignment and fingerprint tracking
…rmalization, and model profiles Adds aft-t6p.20 (Typed embedding vector representation + storage-strategy resolution): - TypedVector (source-side) and StoredVector (persisted) enums with DenseF32, DenseInt8, BinaryPacked, and Quantized variants - StorageStrategy (NativeF32, DecodeNormalizeF32, BinaryPacked) - VectorKind enum for runtime type tagging - DistanceMetric (Cosine, DotProduct, Euclidean, Hamming) - NormalizationPolicy (AlreadyNormalized, NormalizeOnInsertQuery, NotApplicable) - EmbeddingModelProfile fields: source_vector_kind, stored_vector_kind, metric, normalization - convert_vector() / validate_compatible() on EmbeddingModelProfile - blake3 dependency for chunk hashing
… + dummy base_url for Perplexity profile test Two fixes for `fingerprint_invalidation_tests`: - Mock HTTP server now lowercases header names before matching Content-Length (reqwest/hyper sends lowercase `content-length:`). - `base64_int8_profile_from_config_selects_correctly` test provides a dummy `base_url` for the Perplexity backend (required by `from_config`). Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
- Add StorageStrategy::BinaryPacked variant for packed-bit vector storage - Add EmbeddingModelProfile::perplexity_binary() with BinaryPacked → Hamming path - Wire from_config to select perplexity_binary profile when Base64Binary encoding - Implement parse_embedding_value for Base64Binary (decode → 0.0/1.0 f32 vec) - Implement into_stored for TypedVector::BinaryPacked (requires BinaryPacked strategy) - Update validate_config and validate_compatible to accept Base64Binary+BinaryPacked - Replace old "not yet supported" test with parse_embedding_value_base64_binary_succeeds - 886/893 tests pass (7 pre-existing Docker failures) Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
Add semantic_diagnostics module with SearchDiagnostics, SearchPipelineType, SearchWarning, SearchMetricsCollector, PhaseTimer, score_statistics, top1_margin. Instrument handle_semantic_search with per-phase timing and warning collection. Wire SearchMetricsCollector into AppContext. 17 new tests, 902/910 lib tests pass (8 pre-existing Docker failures). Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
- Add SemanticDiagnosticsLogger with file append, rotation (50 MB), and retention cleanup (file-deletion based on mtime) - Add SearchDiagnosticsEvent struct for JSONL serialization with raw_query redaction (opt-in via include_raw_queries) and snippet placeholder (include_snippets) - Add config fields: jsonl_logging, jsonl_path, include_raw_queries, include_snippets, retention_days to SemanticBackendConfig - Add lazy-init diagnostics_logger on AppContext with resolve_diagnostics_log_path helper (env var → project root → ~/.cache) - Wire JSONL record into handle_semantic_search diagnostics block - 4 new tests: raw query redaction, raw query inclusion, disk write verification, missing-file recovery - 907/914 lib tests pass (7 pre-existing Docker failures) Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
…rch output Add DiagnosticsOutputMode enum (Off/Minimal/Verbose) and output_mode field to SemanticBackendConfig. Implement format_diagnostics_prefix() for Minimal (warnings only) and Verbose (scores + latency + warnings) output modes. Wire into handle_semantic_search response text. 4 new tests, 25 diagnostics tests total. 910/918 lib tests pass (8 pre-existing Docker failures). Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
Add optional reranking via OpenAI-compatible chat endpoint. When enabled, aft_search overfetches candidates, sends them to a reranker model, and re-sorts by relevance. Falls back gracefully on any error. - Add RerankConfig fields to SemanticBackendConfig (rerank_enabled, rerank_model, rerank_base_url, rerank_api_key_env, rerank_timeout_ms, rerank_max_candidates) - Create semantic_rerank.rs with RerankerClient, RerankOutcome enum, and rerank_candidates function - Add RerankerFailure warning variant to SearchWarning - Wire reranking into handle_semantic_search (overfetch → rerank → re-sort) - Add rerank_latency_ms to SearchDiagnostics and SearchDiagnosticsEvent - Include rerank latency in verbose diagnostics output - 6 unit tests for reranker parsing, skip conditions, and failure handling All 25 diagnostics + 6 reranker tests pass. 917/924 total tests pass (7 pre-existing Docker infrastructure failures).
Add 40+ unit tests to fingerprint_invalidation_tests covering: - SemanticBackendConfig deserialization (minimal, all-fields, defaults) - EmbeddingModelProfile validation for all encoding types - TypedVector conversion and StoredVector roundtrip - convert_vector and validate_compatible rejection paths - Distance metric auto-resolution for f32/int8/binary - base64_int8 signed int8 decode correctness - Template hashing, enum roundtrips, resolve helpers Minor: add #[derive(Debug)] to StoredVector for test ergonomics. Closes aft-t6p.6.1
Add 6 new tests to fingerprint_invalidation_tests covering: - file_policy_hash mismatch triggers rebuild - docs_chunker_version mismatch triggers rebuild - multi-field changes still trigger rebuild - rebuild+query_prompt: rebuild wins - only query_prompt change: ClearQueryCache - non-fingerprint field changes: NoChange Total: 22 fingerprint tests. Closes aft-t6p.6.2
Add 29 tests covering: - is_generated_file: protobuf, minified, dist, build, generated, dart - is_doc_extension and is_config_extension validation - classify_semantic_file for code/doc/config - collect_docs_chunks markdown heading splitting - SemanticFilePolicy defaults and builtin globs - FileRecord field population - build_manifest_from_store construction and cleanup Closes aft-t6p.6.3
… tests Add 23 tests covering: - FlatF32VectorStore: search, empty, dimension mismatch, CRUD, prune, stats - FlatBinaryHammingVectorStore: search, ranking, prune, delete, stats - hamming_distance and popcount64 correctness - Binary decode: byte-aligned, non-byte-aligned, padding, error Closes aft-t6p.6.4
Add 8 tests covering: - SemanticIndexLifecycle: cold start, set/get, failed+error, all variants - SemanticIndexSnapshot: search ranking, immutability after clone - VectorStore: prune_stale_vectors, prune_orphans Closes aft-t6p.6.5
Add 10 tests covering: - HybridRerank pipeline type display - Metrics collector: window size 1, cache hit rate, zero result rate, low confidence rate, latency percentiles - Diagnostics output mode defaults - Warning formatting: minimal (all variants, verifies suppressed), verbose (all 9 variants) - SearchWarning serde roundtrip for all 8 variants Closes aft-t6p.6.6
Add 4 tests covering: - Concurrent snapshot clones produce independent results - Concurrent read threads see identical data via Arc - Mutex contention across 10 threads does not deadlock - Arc strong_count tracks clone/drop correctly Closes aft-t6p.6.7
Add 6 tests covering: - Trust file atomic write (no tmp files left behind) - Multiple projects trusted independently - Untrust is idempotent - Trust state survives reload (serde roundtrip) - Nonexistent project path is untrusted (fail-closed) Closes aft-t6p.6.8
The validate_compatible_rejects_binary_stored_with_cosine_metric test was missing source_vector_kind: BinaryPacked, causing the first match block to fail with 'unsupported source→stored vector conversion' instead of reaching the metric compatibility check.
Add local retrieval evaluation harness for measuring semantic search quality. New files: - crates/aft/src/semantic_eval.rs — pure-logic module with: - EvalCase, EvalResult, EvalSummary structs - JSONL parser (tolerates blank lines and comments) - path_matches() — cross-platform suffix matching - symbol_matches() — Rust/other-language symbol normalization - score_case() — per-case recall@k and MRR scoring - score_suite() — aggregate metrics across a suite - crates/aft/src/commands/semantic_eval.rs — handler wiring: - Reads .aft/semantic-eval.jsonl, returns EvalSummary as JSON - Supports top_k override and include_per_case toggle - Returns tri-state response per AFT honest reporting convention Wiring: - crates/aft/src/lib.rs: pub mod semantic_eval - crates/aft/src/commands/mod.rs: pub mod semantic_eval - crates/aft/src/main.rs: dispatch semantic_eval command Tests: 44 tests passing (parser, matcher, scorer, handler)
Add semantic_doctor command that produces a SemanticHealthReport gathering: - Config summary (backend, model, dimensions, metric, prompts, rerank) - Index state (lifecycle, entry count, dimension, fingerprint freshness) - Search quality metrics (p50/p95 latency, zero-result/low-confidence rates) - Provider connectivity (optional probe) - Active warnings and actionable suggestions New files: - crates/aft/src/semantic_doctor.rs — HealthStatus, ConfigSummary, IndexSummary, MetricsSummary, ProviderSummary, Suggestion, SemanticHealthReport structs with Serialize and Display impls - crates/aft/src/commands/semantic_doctor.rs — command handler with optional probe_provider param, suggestion generation for disabled/ building/failed/ready states, 7 handler tests + 6 model tests Wiring: - crates/aft/src/lib.rs: pub mod semantic_doctor - crates/aft/src/commands/mod.rs: pub mod semantic_doctor - crates/aft/src/main.rs: dispatch "semantic_doctor" command Also: fix semantic_eval temp directory race condition (atomic counter). Tests: 14 semantic_doctor + 44 semantic_eval passing, check+clippy+fmt clean.
Extend the semantic_index_info section of the status command to include: - Search quality metrics (total_queries, p50/p95 latency, zero_result_rate, low_confidence_rate, embedding_failure_rate, lexical_failure_rate) - Rerank status (rerank_enabled, rerank_model) - Diagnostics state (diagnostics_enabled, prompt_active) The TUI/status surfaces can now show pipeline health without a separate semantic_doctor call. Metrics are zero when no queries have been recorded. Tests: status + semantic_doctor tests passing, check+clippy+fmt clean.
- Add 3 new tests: markdown-fence parsing, snippet truncation, max_candidates limit - Fix missing-ID append: semantic_search now appends missing indices in original order - Add max_candidate_chars config field (default 2500) to SemanticBackendConfig - Use config.rerank_max_candidate_chars instead of hardcoded 200 in reranker - Update all test configs with new field Bead: aft-t6p.2.1
There was a problem hiding this comment.
4 issues found across 107 files
Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.
Re-trigger cubic
Remove .beads/, .qartez/, .claude/, .omo/, .kiro/, .lean-ctx/ from the branch. These are local agent working directories that should not be distributed. Add them to .gitignore to prevent future accidents. Addresses cubic review comments on PR cortexkit#87.
There was a problem hiding this comment.
1 issue found across 69 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name=".gitignore">
<violation number="1" location=".gitignore:95">
P2: Inconsistent .gitignore pattern: `omo/` should likely be `.omo/` to match the hidden tooling directory convention used by all other entries in this block.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| .beads/ | ||
| .qartez/ | ||
| .claude/ | ||
| omo/ |
There was a problem hiding this comment.
P2: Inconsistent .gitignore pattern: omo/ should likely be .omo/ to match the hidden tooling directory convention used by all other entries in this block.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .gitignore, line 95:
<comment>Inconsistent .gitignore pattern: `omo/` should likely be `.omo/` to match the hidden tooling directory convention used by all other entries in this block.</comment>
<file context>
@@ -87,3 +87,11 @@ benchmarks/aft-search/.bench/
+.beads/
+.qartez/
+.claude/
+omo/
+.kiro/
+.lean-ctx/
</file context>
Remove .alfonso/, agents.md, beads-data-*.jsonl, magic-context-*.md, biome.json_ from the branch. Add them to .gitignore to prevent future inclusion in PRs.
Restore .alfonso/ from main (it exists upstream). Keep agents.md, beads-data-*.jsonl, magic-context-*.md, biome.json_ removed and gitignored since they don't exist on main.
|
Source code for semantic search functionality for public preview. Here's imlementation plans for sprints under this epic (in gastown beads format): |
1. Fix duplicate entries in reranked output (greptile P1) - Add !used[i] check in filter_map to prevent duplicate indices - File: crates/aft/src/commands/semantic_search.rs 2. Strip markdown fences from LLM reranker responses (greptile P1) - Many chat models wrap JSON in code fences - Add strip_markdown_fences() helper applied before parsing - File: crates/aft/src/semantic_rerank.rs 3. Align TypeScript enum values with Rust serde (qubic P1) - SemanticBackendEnum: add perplexity variant - SemanticOutputEncodingEnum: float, base64_int8, base64_binary - SemanticStorageStrategyEnum: native_f32, decode_normalize_f32, binary_packed - SemanticInputModeEnum: flat_texts, document_chunks - SemanticDistanceMetricEnum: auto, cosine, dot_product, euclidean, hamming - File: packages/opencode-plugin/src/config.ts
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/opencode-plugin/src/config.ts">
<violation number="1" location="packages/opencode-plugin/src/config.ts:40">
P2: Semantic enum literals were renamed without backward-compatibility aliases or migration, breaking existing configs that use old values.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| const SemanticBackendEnum = z.enum(["fastembed", "openai_compatible", "ollama", "perplexity"]); | ||
|
|
||
| /** Output encoding mode for embeddings. */ | ||
| const SemanticOutputEncodingEnum = z.enum(["float", "base64_int8", "base64_binary"]); |
There was a problem hiding this comment.
P2: Semantic enum literals were renamed without backward-compatibility aliases or migration, breaking existing configs that use old values.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/opencode-plugin/src/config.ts, line 40:
<comment>Semantic enum literals were renamed without backward-compatibility aliases or migration, breaking existing configs that use old values.</comment>
<file context>
@@ -34,19 +34,19 @@ const CheckerEnum = z.enum([
/** Output encoding mode for embeddings. */
-const SemanticOutputEncodingEnum = z.enum(["float", "binary", "ubinary", "int8", "uint8"]);
+const SemanticOutputEncodingEnum = z.enum(["float", "base64_int8", "base64_binary"]);
/** Storage strategy for embedding vectors. */
</file context>
…s, retry, diagnostics Add three features to build_with_progress_contextualized: 1. Oversized document handling: split_oversized_document() partitions documents exceeding DEFAULT_MAX_CHUNKS_PER_DOCUMENT (100) into sub-groups, preserving chunk order with synthetic '(part N)' titles. 2. Retry logic: embed_document_group_with_retry() wraps each document group with exponential backoff (3 retries, 1s base, 8s cap), only retrying transient errors (rate limits, timeouts, server errors). Failed groups are skipped with a warning instead of aborting the entire build. 3. Diagnostics: ContextualizedBuildDiagnostics struct tracks documents_processed, chunks_embedded, rejected_oversized, retried_groups, failed_groups, and max_chunks_in_document. Summary logged via slog_info! at build completion.
Coverage: - chunks grouped by source document (multi-file) - chunk order preserved within each document - wrong chunk count in response fails loudly - unknown file path in response fails - dimension mismatch fails with specific error - stale-vector pruning after contextualized index + refresh - Perplexity backend defaults to DocumentChunks input mode - Fastembed backend verifies FlatTexts for contrast - oversized document is split into sub-groups (>100 chunks) - empty file set produces empty index - retry on transient errors (429 rate limit) - non-transient errors are NOT retried - progress callback reports correct done/total counts
CRITICAL fixes: - cosine_similarity: guard NaN from zero-norm vectors + clamp to [-1,1] - semantic_search: remove unconditional Ready status overwrite (search must not change lifecycle state) - reranker: add out-of-bounds index warning when LLM returns indices exceeding candidate count HIGH fixes: - build_embed_text: remove duplicate name: field in embed text format - split_large_chunk: fix end_line for final sub-chunk (was using chunk.start_line + total_lines instead of chunk_start + current_lines) - strip_markdown_fences: robust fence stripping with language tag handling and proper closing-fence detection - rejected_oversized: actually increment counter when documents are split MEDIUM fixes: - SemanticCancellationToken: use Acquire/Release ordering instead of Relaxed for cross-thread generation counter - semantic_search: validate non-empty query before processing
parse_semantic_config previously only handled 6 fields (backend, model, base_url, api_key_env, timeout_ms, max_batch_size). Now it also parses: dimensions, output_encoding, input_mode, storage_strategy, distance_metric, query_prompt_template, document_prompt_template, diagnostics_enabled, low_confidence_threshold, output_mode, rerank_enabled, rerank_model, rerank_base_url, rerank_api_key_env, rerank_timeout_ms, rerank_max_candidates, rerank_max_candidate_chars. Note: the TS plugin's getStrippedSemanticKeys() intentionally strips these fields from PROJECT config (untrusted) as a security boundary. They can still be set from USER config (trusted). The Rust side now correctly accepts all fields when the plugin sends them.
There was a problem hiding this comment.
1 issue found across 5 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/aft/src/commands/configure.rs">
<violation number="1" location="crates/aft/src/commands/configure.rs:342">
P1: `rerank_base_url` is parsed without SSRF validation, unlike `base_url`</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| .to_string() | ||
| .into(); | ||
| } | ||
| if let Some(raw) = obj.get("rerank_base_url") { |
There was a problem hiding this comment.
P1: rerank_base_url is parsed without SSRF validation, unlike base_url
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/aft/src/commands/configure.rs, line 342:
<comment>`rerank_base_url` is parsed without SSRF validation, unlike `base_url`</comment>
<file context>
@@ -230,6 +230,150 @@ fn parse_semantic_config(
+ .to_string()
+ .into();
+ }
+ if let Some(raw) = obj.get("rerank_base_url") {
+ semantic.rerank_base_url = raw
+ .as_str()
</file context>
Also parse: jsonl_logging, jsonl_path, include_raw_queries, include_snippets, retention_days, metrics_window_size.
There was a problem hiding this comment.
2 issues found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/aft/src/commands/configure.rs">
<violation number="1" location="crates/aft/src/commands/configure.rs:382">
P2: `semantic.jsonl_path` lacks path validation/normalization, unlike other path configs in this file (`validate_storage_dir`, `parse_lsp_paths_extra`) which enforce absolute paths and reject `..` traversal. This creates path-injection risk for downstream JSONL diagnostics writes.</violation>
<violation number="2" location="crates/aft/src/commands/configure.rs:402">
P2: `semantic.retention_days` uses lossy `u64 -> u32` cast with silent overflow instead of explicit validation.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| "configure: semantic.jsonl_logging must be a boolean".to_string() | ||
| })?; | ||
| } | ||
| if let Some(raw) = obj.get("jsonl_path") { |
There was a problem hiding this comment.
P2: semantic.jsonl_path lacks path validation/normalization, unlike other path configs in this file (validate_storage_dir, parse_lsp_paths_extra) which enforce absolute paths and reject .. traversal. This creates path-injection risk for downstream JSONL diagnostics writes.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/aft/src/commands/configure.rs, line 382:
<comment>`semantic.jsonl_path` lacks path validation/normalization, unlike other path configs in this file (`validate_storage_dir`, `parse_lsp_paths_extra`) which enforce absolute paths and reject `..` traversal. This creates path-injection risk for downstream JSONL diagnostics writes.</comment>
<file context>
@@ -374,6 +374,42 @@ fn parse_semantic_config(
+ "configure: semantic.jsonl_logging must be a boolean".to_string()
+ })?;
+ }
+ if let Some(raw) = obj.get("jsonl_path") {
+ semantic.jsonl_path = if raw.is_null() {
+ None
</file context>
| semantic.retention_days = raw.as_u64().ok_or_else(|| { | ||
| "configure: semantic.retention_days must be an unsigned integer".to_string() | ||
| })? as u32; |
There was a problem hiding this comment.
P2: semantic.retention_days uses lossy u64 -> u32 cast with silent overflow instead of explicit validation.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/aft/src/commands/configure.rs, line 402:
<comment>`semantic.retention_days` uses lossy `u64 -> u32` cast with silent overflow instead of explicit validation.</comment>
<file context>
@@ -374,6 +374,42 @@ fn parse_semantic_config(
+ })?;
+ }
+ if let Some(raw) = obj.get("retention_days") {
+ semantic.retention_days = raw.as_u64().ok_or_else(|| {
+ "configure: semantic.retention_days must be an unsigned integer".to_string()
+ })? as u32;
</file context>
| semantic.retention_days = raw.as_u64().ok_or_else(|| { | |
| "configure: semantic.retention_days must be an unsigned integer".to_string() | |
| })? as u32; | |
| let v = raw.as_u64().ok_or_else(|| { | |
| "configure: semantic.retention_days must be an unsigned integer".to_string() | |
| })?; | |
| semantic.retention_days = u32::try_from(v) | |
| .map_err(|_| "configure: semantic.retention_days is too large".to_string())?; |
Summary
Semantic search in AFT moves from a minimal embedding-and-cosine prototype to a provider-capability-aware retrieval subsystem with typed vectors, optional reranking, background lifecycle management, diagnostics, and evaluation tooling. This is a public preview — the feature is functional and tested (~93 new tests) but expects iteration based on real-world feedback.
What changed
The upgrade touches the full semantic pipeline — config, indexing, retrieval, diagnostics, and observability — without breaking the default
fastembedexperience.Typed vector representations
Vectors are no longer opaque f32 blobs. Every stored vector carries explicit type metadata (
DenseF32,Int8SourceDecoded,BinaryPacked) and is paired with its source kind so the correct distance metric is selected automatically. Binary packed vectors use Hamming search (native bitwise XOR + popcount) instead of cosine, which is both faster and semantically correct for quantized embeddings. This unlocks Perplexity'sbase64_binaryandbase64_int8output modes alongside standard dense providers.Provider capability profiles
Each embedding backend (fastembed, OpenAI-compatible, Ollama, Perplexity) declares what it supports: output encoding, distance metric, dimension range, max batch size. The config layer validates combinations at configure time — you cannot accidentally request binary vectors through a cosine-only provider. Profiles also carry fingerprint fields so switching providers triggers a clean index rebuild rather than silent corruption.
Fingerprint-driven index lifecycle
A
SemanticIndexFingerprintcaptures every dimension that affects index correctness: backend, model, base_url, dimension, chunking_version, output_encoding, storage_strategy, vector kinds, normalization, and prompt hashes.diff()classifies changes asRebuild(structural — re-embed everything),ClearQueryCache(query prompts changed — invalidate cached results only), orNone. This replaces the previous "delete and hope" invalidation with precise, explainable rebuild decisions.Non-blocking cold start
Index builds run in a background thread with cooperative cancellation (
SemanticCancellationTokenviaAtomicU64generation counter). The build checks the generation before each embedding batch and exits early when a reconfigure arrives. Priority ordering ensures high-value files (recently edited, high PageRank) get embedded first. Exponential backoff handles transient provider failures without blocking the session.Stale-vector pruning
When files are edited, deleted, moved, excluded, or re-included, the index tracks which vectors are stale and prunes them during the next refresh cycle. Every vector record carries file/chunk ownership metadata (file path, version, chunk hash, index fingerprint) so pruning is traceable and deterministic.
File policy and docs chunking
A configurable file policy controls which files enter the index (include globs, exclude globs, max file size, max chunk count). The docs chunker splits Markdown and documentation files into semantic sections before embedding, improving recall for documentation-shaped queries.
Reranking pipeline
Optional reranking via any OpenAI-compatible
/v1/rerankor chat-completion endpoint. The pipeline sends initial retrieval candidates to a reranker, parses the response (supporting multiple JSON shapes), and reorders results with safe fallback — if the reranker fails, the original cosine-similarity order is returned unchanged. Config fields:rerank.enabled,rerank.model,rerank.base_url,rerank.api_key_env,rerank.max_candidates.Search pipeline metrics and diagnostics
Every
aft_searchcall records timing, cache hits/misses, result counts, and reranker fallback events. Metrics are exposed through thestatuscommand and through JSONL diagnostic logs for offline analysis. TheDiagnosticsOutputModeconfig controls verbosity in tool output (compact|verbose|off).Semantic doctor
semantic_doctoris a health-check command that reports config summary, index summary, metrics summary, provider summary, and actionable suggestions. Use it to verify that the index is healthy, the provider is reachable, and the configuration is consistent.Semantic eval harness
semantic_evalruns a JSONL-defined evaluation suite against the semantic index. Each case specifies a query, expected paths, expected symbols, and top-k. The harness computes recall@k and MRR (Mean Reciprocal Rank) for quantifying retrieval quality across config changes.Status integration
The
statuscommand now includes semantic health metrics: lifecycle state, entry count, dimension, total queries, cache hit ratio, average query time, and provider info. The OpenCode TUI sidebar surfaces these alongside the existing index state.Config trust boundary
backend,base_url, andapi_key_envare user-only fields — project-levelaft.jsonccannot inject these. A hostile repository cannot redirect embeddings at an attacker-controlled endpoint or exfiltrate API keys. The plugin logs a warning when it strips a project-level setting.Contextualized document-chunk embedding (partial)
Initial support for Perplexity-style document/chunk grouped embedding — chunks from the same source document are batched together rather than flattened. Oversized document handling and retry logic are still in progress (see roadmap).
How to test
Default fastembed (zero-config)
Verify: results appear with
source: semanticorsource: hybridtags. Status shows[index: ready]after build completes.Provider switching
Verify: index rebuilds automatically on next session start. Status shows new provider/model.
Reranking
{ "semantic_search": true, "semantic": { "backend": "openai_compatible", "model": "text-embedding-3-small", "base_url": "https://api.openai.com/v1", "api_key_env": "OPENAI_API_KEY" }, "rerank": { "enabled": true, "model": "rerank-english-v3.0", "base_url": "https://api.cohere.com", "api_key_env": "COHERE_API_KEY" } }Verify: search results show reranker-sorted order. Disable reranker — results fall back to cosine order.
Semantic doctor
aft_search({ "query": "test" }) # trigger index build if cold # Then check health via status command or semantic_doctorVerify: health report shows ConfigSummary, IndexSummary, MetricsSummary, ProviderSummary.
Eval harness
Verify: returns recall@k and MRR scores.
Test coverage
~93 tests across 8 test sub-tasks covering:
Roadmap
Still in progress or planned for follow-up:
Architecture notes
Key new modules:
crates/aft/src/semantic_rerank.rs— reranking pipeline with safe fallbackcrates/aft/src/semantic_diagnostics.rs— JSONL diagnostic loggingcrates/aft/src/semantic_doctor.rs— health-check report generationcrates/aft/src/semantic_eval.rs— evaluation harness (JSONL parser, scoring)crates/aft/src/vector_store.rs— VectorStore trait with DenseF32 and BinaryPacked implementationscrates/aft/src/commands/semantic_doctor.rs— doctor command handlercrates/aft/src/commands/semantic_eval.rs— eval command handlerModified significantly:
crates/aft/src/semantic_index.rs— lifecycle management, fingerprint-driven invalidation, non-blocking build, stale pruning, typed vectorscrates/aft/src/config.rs— provider profiles, rerank config, trust boundary fieldscrates/aft/src/commands/status.rs— semantic health metricscrates/aft/src/commands/semantic_search.rs— reranking integration, diagnostics output modeNeed help on this PR? Tag
/codesmithwith what you need. Autofix is disabled.Summary by cubic
Upgrades semantic search to a provider-aware pipeline with typed vectors, reranking, contextualized document-chunk embedding, partial-ready querying, and built-in diagnostics/eval. Adds Perplexity support and Hamming search for binary/int8, and hardens lifecycle, metrics, and config.
New Features
base64_binary/base64_int8.status,semantic_doctor, andsemantic_evalsurface semantic health.Bug Fixes
max_candidate_charssupport.jsonl_logging,jsonl_path,include_raw_queries,include_snippets,retention_days, andmetrics_window_size; TypeScript enums aligned with Rust.Written for commit d204e2d. Summary will update on new commits.
Greptile Summary
This PR upgrades the semantic search subsystem from a minimal prototype to a full provider-aware retrieval pipeline, introducing typed vectors, fingerprint-driven index lifecycle management, optional reranking, background build with cooperative cancellation, and diagnostic tooling.
semantic_index.rs,vector_store.rs): addsVectorStoreabstraction,EmbeddingModelProfilefor provider capability validation,SemanticIndexFingerprintwithdiff()for precise rebuild decisions, and stale-vector pruning. Cancellation token uses correctAcquire/Releaseordering;cosine_similarityguards zero-norm vectors and clamps output.semantic_rerank.rs,semantic_search.rs): optional OpenAI-compatible reranker with safe fallback and markdown-fence stripping. A field/method naming collision ondiagnostics_enabledsilently disables JSONL logging whenjsonl_logging: trueis set withoutdiagnostics_enabled: true.config.ts): new enum schemas and trust-boundarymergeSemanticConfig; rerank and diagnostics fields are absent fromSemanticConfigSchemaand stripped by Zod before reaching Rust.Confidence Score: 3/5
Safe to merge with the
diagnostics_enabledfield/method fix applied — without it, any user who enables JSONL logging alone gets silence.The search handler reads the raw bool field instead of the
diagnostics_enabled()method that unifiesdiagnostics_enabled || jsonl_logging, silently breaking JSONL-only logging configs.crates/aft/src/commands/semantic_search.rs(line 51 field vs. method) andpackages/opencode-plugin/src/config.ts(rerank/diagnostics fields absent fromSemanticConfigSchema).Important Files Changed
diagnostics_enabledsilently disables JSONL logging when onlyjsonl_logging: trueis set.diagnostics_enabled()method that ORs the field withjsonl_logging— but callers bypass this method and read the field directly.init_diagnostics_loggeris correct, gated onjsonl_loggingfield.Comments Outside Diff (1)
packages/opencode-plugin/src/config.ts, line 37-54 (link)Several new enum schemas use values that don't align with the Rust serde representation:
SemanticOutputEncodingEnumallows"binary","ubinary","int8","uint8"but RustOutputEncodingdeserializes from"base64_binary"and"base64_int8".SemanticStorageStrategyEnumallows"flat"and"binary_pack"but RustStorageStrategyexpects"native_f32"and"binary_packed".SemanticInputModeEnumincludes"chunk_extracts"and"contextualized"but RustInputModeonly has"flat_texts"and"document_chunks".SemanticDistanceMetricEnumuses"dot"but RustDistanceMetricexpects"dot_product".SemanticBackendEnumis missing the new"perplexity"variant added to Rust.A user who follows the TypeScript autocomplete and picks
output_encoding: "int8"will pass TypeScript validation but receive a deserialization error (or silent fallback to default) from the Rust binary at runtime.Reviews (5): Last reviewed commit: "fix(configure): add missing JSONL/metric..." | Re-trigger Greptile