feat(semantic-cache): two-level cache with on-chain quality trust layer#859
feat(semantic-cache): two-level cache with on-chain quality trust layer#859Mayveskii wants to merge 9 commits intogonka-ai:upgrade-v0.2.11from
Conversation
…, nonce validation Builds on the continuous PoC foundation (GiP gonka-ai#821) with three critical missing components: - Extend PruningState with ContinuousPoCCommitsPrunedEpoch and ContinuousPoCChallengePrunedEpoch fields (pruning_state.proto + .pb.go) - Add GetContinuousPoCCommitsPruner and GetContinuousPoCChallengesPruner to pruning.go using the same Pruner[K,V] pattern as existing collections - Wire into Keeper.Prune(), called from EndBlock every block - Add ContinuousPoCEpochSummaries map to WeightCalculator - GetAllContinuousPoCEpochSummariesForEpoch loads all summaries at settlement - calculateParticipantWeight adds effective_poc_weight to baseCount before applying combinedFactor — disabled if PenaltyApplied is true - ContinuousPoCChallenge type with full Marshal/Unmarshal/Size - IssueContinuousPoCChallenges: called each block, samples commits by ValidationSampleRateBps using app_hash as deterministic entropy - RespondContinuousPoCChallenge: verifies sha256-based Merkle proof; invalid proof or expired challenge zeros the epoch EffectivePocWeight - ExpireContinuousPoCChallenges: zeroes weight for unanswered challenges PR gonka-ai#845 adds ContinuousPocParams to Params but omits MarshalToSizedBuffer, Size, and Unmarshal changes in params.pb.go, so the field would never be persisted. This PR adds the full hand-written codec for ContinuousPoCParams, ContinuousPoCCommit, ContinuousPoCEpochSummary, and ContinuousPoCChallenge in types/continuous_poc.go, and wires field 14 into params.pb.go. Closes gonka-ai#821 Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
…ing state Made-with: Cursor
- Rename proto field 5 from continuous_poc_summaries_pruned_epoch to continuous_poc_challenges_pruned_epoch (naming matched its use) - Add proto field 6 continuous_poc_summaries_pruned_epoch for the ContinuousPoCEpochSummaries collection - Add GetContinuousPoCEpochSummariesPruner keyed on Pair[uint64, AccAddress] and wire it into Keeper.Prune() to prevent summaries accumulating forever - Update PruningState round-trip and backward-compat tests for field 6 Made-with: Cursor
Nodes that produce high-quality inference results and serve them from cache earn CacheQualityWeight — an additive bonus on top of standard PoC weight. This creates an economic feedback loop: better GPU → better results → more reuse → higher weight → more rewards. Two lookup levels: L1 — PromptHash exact-match (sha256 of canonical JSON), O(1), 100% certain. L2 — cosine similarity via all-MiniLM-L6-v2, governance-controlled threshold. MsgFinishInference is sent on every HIT so the node closes the on-chain cycle. Feature disabled by default; activated via CacheQualityParams governance. inference-chain: - cache_quality.proto: proto source for CacheQualityEpochSummary (new) - params.proto: CacheQualityParams added (field 14) - tx.proto: MsgSubmitCacheQualitySummary RPC added - types/cache_quality.go: params and summary types with serialisation - keeper/msg_server_cache_quality.go: submission handler with bounds checks - keeper/pruning.go: CacheQualityEpochSummaries pruner - module/chainvalidation.go: CacheQualityWeight integration at epoch settlement - app/upgrades/v0_2_11: seeds CacheQualityParams defaults on upgrade decentralized-api: - semanticcache/cache.go: SemanticCache with LookupByPromptHash (L1) + Lookup (L2) - semanticcache/memory_store.go: InMemoryCacheStore, zero external dependencies - semanticcache/embedder.go: MLNodeEmbedder + StubEmbedder - semanticcache/quality_reporter.go: QualityReporter submits per-epoch summary - semanticcache/cache_test.go + cache_http_test.go: 20 tests, full matrix - post_chat_handler.go: L1/L2 integration in executor path - main.go: cache initialisation wired to governance params and epoch events - mlnodeclient: Embed() method added to interface and client mlnode: - packages/api/src/api/embed_routes.py: /api/v1/embed endpoint (CPU, fastembed) docs: - docs/specs/semantic-cache.md: two-level architecture, developer simulation Depends on continuous PoC (gonka-ai#856). Closes gonka-ai#821.
|
Quality of computation as an economic incentive: what I had in mind building this PR |
|
The overall idea of keeping a Prompt cache is redundant to the kv-cache implementation of vllm. |
A few clarifications so we're aligned: 1. Where this implementation stands The cache is off by default and gated by 2. KV-cache (vLLM) vs this cache — different layers They're not redundant; they sit at different levels:
So KV-cache optimises "how much work we do per inference"; this cache optimises "whether we run inference at all" for repeat or semantically similar prompts. Both can coexist: KV-cache for requests that do hit the GPU, this cache for requests that don't. 3. Asymmetric advantage The concern that "more traffic → more cache hits → more The main points as I understand them: computations that are identical at a given level should not be recomputed; resources should be allocated instead. A deeper view is outlined in GiP #860 , which proposes classifying computations by quality and incentivizing nodes and users to use them more efficiently overall, without extra financial load. |
Read-only semantic cache counters on the operator-only admin port (:9200). Stats() and HitRate() use atomic.LoadInt64 — zero locks, zero side effects, safe at any poll frequency. Nil-safe when no inference nodes are configured. Intended consumers: DAG epoch-boundary tasks, Prometheus scraper (GiP gonka-ai#840), k8s liveness probes. Not exposed on the public port (:9000). Also removes three residual "Qdrant" references from comments — the default backend is InMemoryCacheStore; Qdrant is not part of this PR. Made-with: Cursor
|
This PR closes the open gap named in tokenomics-v2: "No incentive for model diversity or utilization quality." What it adds
How it works — k8s hosts vs bare-metalThe cache is a DAPI-level feature — it works on any deployment: k8s pod, Docker Compose, bare-metal. The node operator does not need to change anything; the feature activates via governance ( What changes with k8s (GiP #816): nodes deployed with one model per node receive 100% of that model's traffic via Economic scenario matrix (live network data)All calculations use measured baseline from
Formula: Scenarios:
Network impact at scale:
Assumptions and how we verify them:
Operator verification layer — DAG + Prometheus (GiP #816, #840)Self-reported
This directly addresses blizko's asymmetric advantage concern: nodes cannot inflate Economic flow end-to-endStreaming ( Scientist-Validator summary
Next step (GiP #860)
@tcharchian could you add this to the v0.2.11 milestone? It builds directly on the continuous PoC foundation from #856. |
… layer, k8s specialization - GET /admin/v1/cache/stats: atomic read-only counters for DAG/Prometheus - Operator verification layer: Source A (stats) vs B (chain) vs C (Prometheus) - k8s node specialization: hit_rate formula, economic self-reinforcement - L2 text-only limitation documented with multimodal upgrade path - Known Limitations updated with DAG cross-check mitigation Made-with: Cursor
…ance) Merge origin/upgrade-v0.2.11 into feature/gip-semantic-cache-trust-layer. Resolved conflicts: - server.go: keep both semanticCache/qualityReporter and statsStorage fields - main.go: add WithStatsStorage to publicServerOpts alongside semantic cache init - keys.go: upstream 45-46 (ModelLoad/InferenceCount rolling windows) + ours 48-51 (ContinuousPoC + CacheQuality) Made-with: Cursor
Summary
Nodes that produce high-quality inference results and serve them from cache earn
CacheQualityWeight— an additive bonus on top of standard PoC weight. This creates an economic feedback loop: better GPU → better results → more reuse → higher weight → more rewards.The feature is disabled by default and activated via
CacheQualityParamsgovernance.Architecture — Two-Level Cache
sha256(canonical_JSON)— O(1), 100% cryptographically certainSimilarityThresholdBps(default 9700 bps = 97%) via all-MiniLM-L6-v2MsgFinishInferenceis sent on every HIT so the node closes the on-chain cycle and earnsCacheQualityWeightChanges
inference-chain:
cache_quality.proto— proto source forCacheQualityEpochSummary(new)params.proto—CacheQualityParamsadded (field 14)tx.proto—MsgSubmitCacheQualitySummaryRPC addedtypes/cache_quality.go— params + summary types with full serialisationkeeper/msg_server_cache_quality.go— submission handler with security boundsmodule/chainvalidation.go—CacheQualityWeightintegration at epoch settlementapp/upgrades/v0_2_11— seedsCacheQualityParamsdefaults on upgradedecentralized-api:
semanticcache/cache.go—SemanticCachewithLookupByPromptHash(L1) +Lookup(L2)semanticcache/memory_store.go—InMemoryCacheStore, zero external dependenciessemanticcache/embedder.go—MLNodeEmbedder+StubEmbeddersemanticcache/quality_reporter.go— submitsMsgSubmitCacheQualitySummaryper epochsemanticcache/cache_test.go+cache_http_test.go— 20 tests, full validation matrixpost_chat_handler.go— L1/L2 integration in executor pathmain.go— cache initialisation wired to governance params and epoch eventsmlnode:
embed_routes.py—/api/v1/embedendpoint (CPU-only, all-MiniLM-L6-v2, fastembed)Validation
All 20 tests reproducible without GPU, ML-node, or live chain:
TestMatrix_L1_ExactMatchTestMatrix_L1_WrongHashTestMatrix_L2_SemanticHitTestMatrix_L2_BelowThresholdTestMatrix_TTL_EvictionTestMatrix_ModelVersion_InvalidationTestHTTP_L1_HIT_XCacheHeaderTestHTTP_L1_MISS_NoXCacheHeaderTestHTTP_L1_VerifyFail_FallThroughTestHTTP_TTL_Expired_FallThroughTestHTTP_ModelVersion_FallThroughTestHTTP_PublicAPIResponseFormatProtocol Compliance
InMemoryCacheStoreworks on every gonka node out of the boxMsgSubmitCacheQualitySummaryregistered inInferenceOperationKeyPerms— supports Grant→Exec→Revoke authz delegation for automated reporting keysRevokeMLOperationalKeyPermissionsFromAccountadded as Revoke counterpart to the existing Grant functionCacheQualityParamsdefaults when nil (safe for existing chain state after binary swap)Contributes to mlnode optimization (#654) and missed inference reduction (#629).
Addresses on-chain transaction load identified in Inference Scaling discussion (#801).
Depends on continuous PoC (#856).