Skip to content

feat(semantic-cache): two-level cache with on-chain quality trust layer#859

Open
Mayveskii wants to merge 9 commits intogonka-ai:upgrade-v0.2.11from
Mayveskii:feature/gip-semantic-cache-trust-layer
Open

feat(semantic-cache): two-level cache with on-chain quality trust layer#859
Mayveskii wants to merge 9 commits intogonka-ai:upgrade-v0.2.11from
Mayveskii:feature/gip-semantic-cache-trust-layer

Conversation

@Mayveskii
Copy link

Summary

Nodes that produce high-quality inference results and serve them from cache earn CacheQualityWeight — an additive bonus on top of standard PoC weight. This creates an economic feedback loop: better GPU → better results → more reuse → higher weight → more rewards.

The feature is disabled by default and activated via CacheQualityParams governance.

Architecture — Two-Level Cache

User Request
│
▼ handleTransferRequest
  MsgStartInference (async) ──► chain (fee paid, cycle open)
  │
  ▼ handleExecutorRequest
  [L1: PromptHash exact-match] ── HIT ──► verify ResponseHash → MsgFinishInference → CacheQualityWeight
  │ MISS
  [L2: cosine similarity ≥ threshold] ── HIT ──► verify ResponseHash → MsgFinishInference → CacheQualityWeight
  │ MISS
  ▼
  GPU Inference → MsgFinishInference → StoreResult
  • L1: sha256(canonical_JSON) — O(1), 100% cryptographically certain
  • L2: cosine similarity ≥ SimilarityThresholdBps (default 9700 bps = 97%) via all-MiniLM-L6-v2
  • MsgFinishInference is sent on every HIT so the node closes the on-chain cycle and earns CacheQualityWeight

Changes

inference-chain:

  • cache_quality.proto — proto source for CacheQualityEpochSummary (new)
  • params.protoCacheQualityParams added (field 14)
  • tx.protoMsgSubmitCacheQualitySummary RPC added
  • types/cache_quality.go — params + summary types with full serialisation
  • keeper/msg_server_cache_quality.go — submission handler with security bounds
  • module/chainvalidation.goCacheQualityWeight integration at epoch settlement
  • app/upgrades/v0_2_11 — seeds CacheQualityParams defaults on upgrade

decentralized-api:

  • semanticcache/cache.goSemanticCache with LookupByPromptHash (L1) + Lookup (L2)
  • semanticcache/memory_store.goInMemoryCacheStore, zero external dependencies
  • semanticcache/embedder.goMLNodeEmbedder + StubEmbedder
  • semanticcache/quality_reporter.go — submits MsgSubmitCacheQualitySummary per epoch
  • semanticcache/cache_test.go + cache_http_test.go — 20 tests, full validation matrix
  • post_chat_handler.go — L1/L2 integration in executor path
  • main.go — cache initialisation wired to governance params and epoch events

mlnode:

  • embed_routes.py/api/v1/embed endpoint (CPU-only, all-MiniLM-L6-v2, fastembed)

Validation

All 20 tests reproducible without GPU, ML-node, or live chain:

cd decentralized-api && go test ./semanticcache/... -v -count=1
Test Result
TestMatrix_L1_ExactMatch L1 HIT, SimilarityBps=10000
TestMatrix_L1_WrongHash Different PromptHash → MISS
TestMatrix_L2_SemanticHit Cosine ≥ 9700 bps → HIT
TestMatrix_L2_BelowThreshold Orthogonal vector → MISS
TestMatrix_TTL_Eviction Expired entry → MISS after EvictExpired
TestMatrix_ModelVersion_Invalidation Governance upgrade → MISS
TestHTTP_L1_HIT_XCacheHeader X-Cache: HIT, X-Cache-Level: 1, correct body
TestHTTP_L1_MISS_NoXCacheHeader MISS → no X-Cache header
TestHTTP_L1_VerifyFail_FallThrough Tampered ResponseHash → fall through to GPU
TestHTTP_TTL_Expired_FallThrough Epoch 101 > ValidUntilEpoch 100 → MISS
TestHTTP_ModelVersion_FallThrough v1 entry rejected under v2 governance
TestHTTP_PublicAPIResponseFormat Response parseable, sha256 non-empty
8 more unit tests All PASS

Protocol Compliance

  • No external dependencies — InMemoryCacheStore works on every gonka node out of the box
  • Collection prefix collision resolved: ContinuousPoC 48–50, CacheQuality 51
  • Error codes: 1177–1181 (upstream range, does not overlap with 1166–1169)
  • MsgSubmitCacheQualitySummary registered in InferenceOperationKeyPerms — supports Grant→Exec→Revoke authz delegation for automated reporting keys
  • RevokeMLOperationalKeyPermissionsFromAccount added as Revoke counterpart to the existing Grant function
  • Upgrade handler seeds CacheQualityParams defaults when nil (safe for existing chain state after binary swap)

Contributes to mlnode optimization (#654) and missed inference reduction (#629).
Addresses on-chain transaction load identified in Inference Scaling discussion (#801).
Depends on continuous PoC (#856).

cisco added 6 commits March 4, 2026 13:08
…, nonce validation

Builds on the continuous PoC foundation (GiP gonka-ai#821) with three critical
missing components:

- Extend PruningState with ContinuousPoCCommitsPrunedEpoch and
  ContinuousPoCChallengePrunedEpoch fields (pruning_state.proto + .pb.go)
- Add GetContinuousPoCCommitsPruner and GetContinuousPoCChallengesPruner
  to pruning.go using the same Pruner[K,V] pattern as existing collections
- Wire into Keeper.Prune(), called from EndBlock every block

- Add ContinuousPoCEpochSummaries map to WeightCalculator
- GetAllContinuousPoCEpochSummariesForEpoch loads all summaries at settlement
- calculateParticipantWeight adds effective_poc_weight to baseCount before
  applying combinedFactor — disabled if PenaltyApplied is true

- ContinuousPoCChallenge type with full Marshal/Unmarshal/Size
- IssueContinuousPoCChallenges: called each block, samples commits by
  ValidationSampleRateBps using app_hash as deterministic entropy
- RespondContinuousPoCChallenge: verifies sha256-based Merkle proof;
  invalid proof or expired challenge zeros the epoch EffectivePocWeight
- ExpireContinuousPoCChallenges: zeroes weight for unanswered challenges

PR gonka-ai#845 adds ContinuousPocParams to Params but omits MarshalToSizedBuffer,
Size, and Unmarshal changes in params.pb.go, so the field would never be
persisted. This PR adds the full hand-written codec for ContinuousPoCParams,
ContinuousPoCCommit, ContinuousPoCEpochSummary, and ContinuousPoCChallenge
in types/continuous_poc.go, and wires field 14 into params.pb.go.

Closes gonka-ai#821

Made-with: Cursor
- Rename proto field 5 from continuous_poc_summaries_pruned_epoch
  to continuous_poc_challenges_pruned_epoch (naming matched its use)
- Add proto field 6 continuous_poc_summaries_pruned_epoch for the
  ContinuousPoCEpochSummaries collection
- Add GetContinuousPoCEpochSummariesPruner keyed on Pair[uint64, AccAddress]
  and wire it into Keeper.Prune() to prevent summaries accumulating forever
- Update PruningState round-trip and backward-compat tests for field 6

Made-with: Cursor
Nodes that produce high-quality inference results and serve them from cache
earn CacheQualityWeight — an additive bonus on top of standard PoC weight.
This creates an economic feedback loop: better GPU → better results → more
reuse → higher weight → more rewards.

Two lookup levels:
  L1 — PromptHash exact-match (sha256 of canonical JSON), O(1), 100% certain.
  L2 — cosine similarity via all-MiniLM-L6-v2, governance-controlled threshold.

MsgFinishInference is sent on every HIT so the node closes the on-chain cycle.
Feature disabled by default; activated via CacheQualityParams governance.

inference-chain:
- cache_quality.proto: proto source for CacheQualityEpochSummary (new)
- params.proto: CacheQualityParams added (field 14)
- tx.proto: MsgSubmitCacheQualitySummary RPC added
- types/cache_quality.go: params and summary types with serialisation
- keeper/msg_server_cache_quality.go: submission handler with bounds checks
- keeper/pruning.go: CacheQualityEpochSummaries pruner
- module/chainvalidation.go: CacheQualityWeight integration at epoch settlement
- app/upgrades/v0_2_11: seeds CacheQualityParams defaults on upgrade

decentralized-api:
- semanticcache/cache.go: SemanticCache with LookupByPromptHash (L1) + Lookup (L2)
- semanticcache/memory_store.go: InMemoryCacheStore, zero external dependencies
- semanticcache/embedder.go: MLNodeEmbedder + StubEmbedder
- semanticcache/quality_reporter.go: QualityReporter submits per-epoch summary
- semanticcache/cache_test.go + cache_http_test.go: 20 tests, full matrix
- post_chat_handler.go: L1/L2 integration in executor path
- main.go: cache initialisation wired to governance params and epoch events
- mlnodeclient: Embed() method added to interface and client

mlnode:
- packages/api/src/api/embed_routes.py: /api/v1/embed endpoint (CPU, fastembed)

docs:
- docs/specs/semantic-cache.md: two-level architecture, developer simulation

Depends on continuous PoC (gonka-ai#856). Closes gonka-ai#821.
@Mayveskii
Copy link
Author

Mayveskii commented Mar 4, 2026

Quality of computation as an economic incentive: what I had in mind building this PR
When I built this, the cache was the mechanism — but the original intent was broader. tokenomics-v2/bitcoin-reward.md explicitly named it as an open gap: "No incentive for model diversity or utilization quality." CacheQualityWeight is the first concrete implementation of that direction.
The logic is straightforward: a node that earns more for the quality of its result starts optimizing for it. Better inference → more reuse → higher epoch weight → more rewards → better inference. A closed economic loop, not a declaration.
For this loop to work, participants need to be able to send quality signals without friction. The architecture here already supports this, or with minimal additions:
API — inference_id is in every response; feedback comes back as a single tag on the next request through the same OpenAI-compatible contract
SDK — library for popular stacks, wraps the call and collects implicit signals automatically
Widget — embeddable UI component, zero code required from the developer
Implicit signals — DAPI already sees session depth and semantic repetition with no integration needed
Webhook — node pushes an event to the developer app when inference closes, app returns outcome automatically
CLI / authz — MsgSubmitCacheQualitySummary is already registered in InferenceOperationKeyPerms with full Grant→Exec→Revoke flow for node operators
CacheQualityParams + epoch settlement are generic enough to carry any quality signal beyond cache hits. This is the foundation for the utilization bonuses tokenomics-v2 outlined as the next step but didn't implement.
Would love to hear whether extending CacheQualityParams toward a pluggable quality axis registry makes sense as a follow-on GiP, or whether that's better scoped separately.

@blizko
Copy link
Collaborator

blizko commented Mar 5, 2026

The overall idea of keeping a Prompt cache is redundant to the kv-cache implementation of vllm.
Rewarding validators for keeping a cache results in asymmetric advantage for validators with higher GPU capacity. Higher GPU capacity => Higher probability of getting requests => Higher cache utilisation => Even higher reward potential

@Mayveskii
Copy link
Author

The overall idea of keeping a Prompt cache is redundant to the kv-cache implementation of vllm. Rewarding validators for keeping a cache results in asymmetric advantage for validators with higher GPU capacity. Higher GPU capacity => Higher probability of getting requests => Higher cache utilisation => Even higher reward potential

A few clarifications so we're aligned:

1. Where this implementation stands

The cache is off by default and gated by CacheQualityParams in governance. What's in this PR is a building block for the broader direction: improving quality of computation inside the network and rewarding it. The "final" design (e.g. pluggable quality axes, other signals beyond cache hits) will build on this. Right now the goal is community feedback on the idea and the direction, not a full production code review. The broader vision (quality axis registry, measurable useful work) is in GiP discussion #860.

2. KV-cache (vLLM) vs this cache — different layers

They're not redundant; they sit at different levels:

  • vLLM KV-cache: caches key–value states inside the model during the forward pass. It speeds up inference when prompts share a prefix (e.g. same system prompt, batched requests). The model still runs; you're reusing intermediate activations.
  • This cache: request/response level. On an L1 (exact PromptHash) or L2 (semantic similarity) hit we return a stored response and do not start inference. No forward pass, no GPU work for that request — just a lookup (hash or embedding index) and a response.

So KV-cache optimises "how much work we do per inference"; this cache optimises "whether we run inference at all" for repeat or semantically similar prompts. Both can coexist: KV-cache for requests that do hit the GPU, this cache for requests that don't.

3. Asymmetric advantage

The concern that "more traffic → more cache hits → more CacheQualityWeight" is valid. That's a reward-distribution effect, not a compute-load effect: serving a cache hit is cheap (no inference). The design already allows mitigating it (feature off by default, governance params, bounds in the submission handler). If we move forward, we can add caps or normalisation so the bonus doesn't dominate. I'd rather lock the mechanism and then tune parameters with governance than block the direction on that.

The main points as I understand them: computations that are identical at a given level should not be recomputed; resources should be allocated instead. A deeper view is outlined in GiP #860 , which proposes classifying computations by quality and incentivizing nodes and users to use them more efficiently overall, without extra financial load.

@Mayveskii Mayveskii mentioned this pull request Mar 6, 2026
Read-only semantic cache counters on the operator-only admin port (:9200).
Stats() and HitRate() use atomic.LoadInt64 — zero locks, zero side effects,
safe at any poll frequency. Nil-safe when no inference nodes are configured.

Intended consumers: DAG epoch-boundary tasks, Prometheus scraper (GiP gonka-ai#840),
k8s liveness probes. Not exposed on the public port (:9000).

Also removes three residual "Qdrant" references from comments — the default
backend is InMemoryCacheStore; Qdrant is not part of this PR.

Made-with: Cursor
@Mayveskii
Copy link
Author

Mayveskii commented Mar 6, 2026

This PR closes the open gap named in tokenomics-v2: "No incentive for model diversity or utilization quality."

What it adds

  • Two-level semantic cache in DAPI: L1 exact-match on PromptHash (SHA-256, cryptographically certain), L2 cosine similarity via MLNodeEmbedder (CPU-only all-MiniLM-L6-v2, 384 dims — no GPU lock)
  • CacheQualityWeight — additive PoC bonus at epoch settlement, governance-controlled via CacheQualityParams (cap: MaxWeightFractionBps = 30%)
  • QualityReporter — submits MsgSubmitCacheQualitySummary once per epoch with CacheReuseCount, OriginalComputeCount, AvgSimilarityBps, EmbeddingModelVersion
  • GET /admin/v1/cache/stats — atomic read-only hit/miss counters on operator-only admin port :9200. Stats() / HitRate() use atomic.LoadInt64 — zero locks, zero side effects, nil-safe.
  • InMemoryCacheStore — default backend, zero external dependencies, no new services
  • Authz delegationMsgSubmitCacheQualitySummary added to InferenceOperationKeyPerms, enabling Grant → Exec → Revoke for operational keys (Unified Permissions #760, tested in Test voting delegation #857)
  • docs/specs/semantic-cache.md — full spec: architecture, chain/API components, operator verification layer, k8s specialization economics, developer simulation, known limitations
  • 20 tests, no GPU, no ML-node, no chain required — per Developer Simulation in docs/specs/semantic-cache.md

How it works — k8s hosts vs bare-metal

The cache is a DAPI-level feature — it works on any deployment: k8s pod, Docker Compose, bare-metal. The node operator does not need to change anything; the feature activates via governance (CacheQualityParams.Enabled).

What changes with k8s (GiP #816): nodes deployed with one model per node receive 100% of that model's traffic via GetRandomExecutor routing (M=1). This maximises cache hit rate organically — no configuration, no tuning. Bare-metal hosts with shared models still benefit, but with proportionally lower hit rates (M nodes sharing traffic → hit_rate/M).

Economic scenario matrix (live network data)

All calculations use measured baseline from gonka.gg/api/public (epoch 190):

  • 75,016 inferences/epoch avg (2,325,522 over 31 epochs)
  • 129 participants, 3,188 GPUs (H100-eq: 3,235)

Formula:

effective_hit_rate = repeat_fraction × (1/M) × (1 - stream_fraction)
reuseCount/epoch  = effective_hit_rate × (75,016 / M)

CacheQualityWeight: additive bonus to baseCount, cap = MaxWeightFractionBps = 30%

Scenarios:

# Scenario M repeat% stream% hit_rate reuseCount/ep GPU saves/ep +Weight
0 Cold start (epoch 1) 0% 0 0 0 +0%
1 Baseline, no cache 0 0 0 +0%
2 Testnet genesis (2 nodes) 2 30% 0% 0.15 11,252 11,252 partial
3 Testnet + streaming 30% 2 30% 30% 0.105 7,876 7,876 lower
4 Shared model (10 nodes) 10 30% 10% 0.027 2,025 20,253 minimal
5 Multi-model node (3 models, M=1 each) 1 30% 0% 0.30×3 67,515 67,515 +30% cap
6 Specialized (1 node, unique model) 1 30% 0% 0.30 22,505 22,505 +30% cap
7 Specialized + high repeat demand 1 60% 0% 0.60 45,010 45,010 +30% cap
8 20% of network specialized (33 nodes) 1 ea 40% 5% 0.38 28,506 940,698 +30% cap
9 50% of network specialized (81 nodes) 1 ea 40% 5% 0.38 28,506 2,308,986 +30% cap

Network impact at scale:

Parameter Scenario 8 (20% specialized) Scenario 9 (50% specialized)
GPU saves/epoch 940,698 2,308,986
GPU saves/year (52 ep) ~48.9M ~120.1M
Operator income delta +30% cap per node +30% cap per node

Assumptions and how we verify them:

Assumption Risk Verification
repeat_fraction = 30-60% Real % unknown — may be lower Live inference with X-Cache headers
stream% = 5-30% If stream > 50% → hit_rate ≈ 0 DAPI logs stream:true count
M=1 for specialization Node may not be unique → M > 1 node-config check on testnet
reuseCount self-reported Community concern @blizko stats(A) vs chain(B) ± 5% via DAG
MaxWeightFractionBps = 30% Confirmed in docs/specs/semantic-cache.md governance default

Operator verification layer — DAG + Prometheus (GiP #816, #840)

Self-reported CacheReuseCount can be cross-checked at each epoch boundary:

[Epoch boundary] → DAG / k8s CronJob / Go ticker
  ├── Source A: GET /admin/v1/cache/stats       → {hits, misses, hit_rate}
  ├── Source B: chain CacheQualityEpochSummary  → reuseCount (self-reported)
  └── Source C: Prometheus scraper (GiP #840)   → independent time-series

  stats(A).hits ≈ chain(B).reuseCount ± 5%   →  self-report validated
  divergence > threshold                      →  operator alert

/admin/v1/cache/stats makes Source A possible. Prometheus exporter (GiP #840) already reads admin API on :9200 — integrates without new infrastructure.

This directly addresses blizko's asymmetric advantage concern: nodes cannot inflate reuseCount without the divergence appearing in Source A vs Source B cross-check, while MaxWeightFractionBps (30%) bounds the maximum gain even without verification. The verification layer is fully reproducible on k8s infrastructure (GiP #816); bare-metal operator verification remains an open question for community discussion.

Economic flow end-to-end

User request → DAPI
  ├── [L1/L2 HIT] → verify ResponseHash (sha256) → serve cached result
  │     └── MsgFinishInference → QualityReporter.RecordReuse
  │           └── epoch boundary → MsgSubmitCacheQualitySummary → chain
  │                 └── chainvalidation.go: CacheQualityWeight added to baseCount
  │                       └── +EpochGroup power → model assignments + Reward Coins
  └── [MISS] → GPU inference → MsgFinishInference → StoreResult(embedding, payload)

Streaming (stream: true) bypasses cache entirely — SSE format cannot be replayed.

Scientist-Validator summary

Layer Status Evidence
Mechanism correctness PROVEN 20/20 tests PASS (0.103s), all fail-paths verified
ResponseHash integrity PROVEN TestHTTP_L1_VerifyFail_FallThrough — tamper → fallthrough, not crash
Governance live update PROVEN TestUpdateCacheParams_ModelVersionChange — immediate invalidation
Admin stats endpoint PROVEN go build ./... clean, nil-safe, atomic reads
Network baseline REAL 75,016 inferences/ep (epoch 190, live API data)
Economic impact projection CALCULATED 9 scenarios with measured baseline, honest assumptions table
Live X-Cache / reuseCount / CacheQualityWeight delta PENDING Requires CacheQualityParams.Enabled = true + tokens for live inference

Next step (GiP #860)

CacheQualityParams + epoch settlement pattern is generic enough to carry additional quality axes (session continuity, explicit feedback, verifiable outcome) without rebuilding the reward layer. See #860.

@tcharchian could you add this to the v0.2.11 milestone? It builds directly on the continuous PoC foundation from #856.

cisco added 2 commits March 6, 2026 04:06
… layer, k8s specialization

- GET /admin/v1/cache/stats: atomic read-only counters for DAG/Prometheus
- Operator verification layer: Source A (stats) vs B (chain) vs C (Prometheus)
- k8s node specialization: hit_rate formula, economic self-reinforcement
- L2 text-only limitation documented with multimodal upgrade path
- Known Limitations updated with DAG cross-check mitigation

Made-with: Cursor
…ance)

Merge origin/upgrade-v0.2.11 into feature/gip-semantic-cache-trust-layer.

Resolved conflicts:
- server.go: keep both semanticCache/qualityReporter and statsStorage fields
- main.go: add WithStatsStorage to publicServerOpts alongside semantic cache init
- keys.go: upstream 45-46 (ModelLoad/InferenceCount rolling windows) + ours 48-51 (ContinuousPoC + CacheQuality)

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants