feat(semantic-cache): two-level cache with on-chain quality trust layer by Mayveskii · Pull Request #859 · gonka-ai/gonka

Mayveskii · 2026-03-04T19:50:07Z

Summary

Nodes that produce high-quality inference results and serve them from cache earn CacheQualityWeight — an additive bonus on top of standard PoC weight. This creates an economic feedback loop: better GPU → better results → more reuse → higher weight → more rewards.

The feature is disabled by default and activated via CacheQualityParams governance.

Architecture — Two-Level Cache

User Request
│
▼ handleTransferRequest
  MsgStartInference (async) ──► chain (fee paid, cycle open)
  │
  ▼ handleExecutorRequest
  [L1: PromptHash exact-match] ── HIT ──► verify ResponseHash → MsgFinishInference → CacheQualityWeight
  │ MISS
  [L2: cosine similarity ≥ threshold] ── HIT ──► verify ResponseHash → MsgFinishInference → CacheQualityWeight
  │ MISS
  ▼
  GPU Inference → MsgFinishInference → StoreResult

L1: sha256(canonical_JSON) — O(1), 100% cryptographically certain
L2: cosine similarity ≥ SimilarityThresholdBps (default 9700 bps = 97%) via all-MiniLM-L6-v2
MsgFinishInference is sent on every HIT so the node closes the on-chain cycle and earns CacheQualityWeight

Changes

inference-chain:

cache_quality.proto — proto source for CacheQualityEpochSummary (new)
params.proto — CacheQualityParams added (field 14)
tx.proto — MsgSubmitCacheQualitySummary RPC added
types/cache_quality.go — params + summary types with full serialisation
keeper/msg_server_cache_quality.go — submission handler with security bounds
module/chainvalidation.go — CacheQualityWeight integration at epoch settlement
app/upgrades/v0_2_11 — seeds CacheQualityParams defaults on upgrade

decentralized-api:

semanticcache/cache.go — SemanticCache with LookupByPromptHash (L1) + Lookup (L2)
semanticcache/memory_store.go — InMemoryCacheStore, zero external dependencies
semanticcache/embedder.go — MLNodeEmbedder + StubEmbedder
semanticcache/quality_reporter.go — submits MsgSubmitCacheQualitySummary per epoch
semanticcache/cache_test.go + cache_http_test.go — 20 tests, full validation matrix
post_chat_handler.go — L1/L2 integration in executor path
main.go — cache initialisation wired to governance params and epoch events

mlnode:

embed_routes.py — /api/v1/embed endpoint (CPU-only, all-MiniLM-L6-v2, fastembed)

Validation

All 20 tests reproducible without GPU, ML-node, or live chain:

cd decentralized-api && go test ./semanticcache/... -v -count=1

Test	Result
`TestMatrix_L1_ExactMatch`	L1 HIT, SimilarityBps=10000
`TestMatrix_L1_WrongHash`	Different PromptHash → MISS
`TestMatrix_L2_SemanticHit`	Cosine ≥ 9700 bps → HIT
`TestMatrix_L2_BelowThreshold`	Orthogonal vector → MISS
`TestMatrix_TTL_Eviction`	Expired entry → MISS after EvictExpired
`TestMatrix_ModelVersion_Invalidation`	Governance upgrade → MISS
`TestHTTP_L1_HIT_XCacheHeader`	X-Cache: HIT, X-Cache-Level: 1, correct body
`TestHTTP_L1_MISS_NoXCacheHeader`	MISS → no X-Cache header
`TestHTTP_L1_VerifyFail_FallThrough`	Tampered ResponseHash → fall through to GPU
`TestHTTP_TTL_Expired_FallThrough`	Epoch 101 > ValidUntilEpoch 100 → MISS
`TestHTTP_ModelVersion_FallThrough`	v1 entry rejected under v2 governance
`TestHTTP_PublicAPIResponseFormat`	Response parseable, sha256 non-empty
8 more unit tests	All PASS

Protocol Compliance

No external dependencies — InMemoryCacheStore works on every gonka node out of the box
Collection prefix collision resolved: ContinuousPoC 48–50, CacheQuality 51
Error codes: 1177–1181 (upstream range, does not overlap with 1166–1169)
MsgSubmitCacheQualitySummary registered in InferenceOperationKeyPerms — supports Grant→Exec→Revoke authz delegation for automated reporting keys
RevokeMLOperationalKeyPermissionsFromAccount added as Revoke counterpart to the existing Grant function
Upgrade handler seeds CacheQualityParams defaults when nil (safe for existing chain state after binary swap)

Contributes to mlnode optimization (#654) and missed inference reduction (#629).
Addresses on-chain transaction load identified in Inference Scaling discussion (#801).
Depends on continuous PoC (#856).

…, nonce validation Builds on the continuous PoC foundation (GiP gonka-ai#821) with three critical missing components: - Extend PruningState with ContinuousPoCCommitsPrunedEpoch and ContinuousPoCChallengePrunedEpoch fields (pruning_state.proto + .pb.go) - Add GetContinuousPoCCommitsPruner and GetContinuousPoCChallengesPruner to pruning.go using the same Pruner[K,V] pattern as existing collections - Wire into Keeper.Prune(), called from EndBlock every block - Add ContinuousPoCEpochSummaries map to WeightCalculator - GetAllContinuousPoCEpochSummariesForEpoch loads all summaries at settlement - calculateParticipantWeight adds effective_poc_weight to baseCount before applying combinedFactor — disabled if PenaltyApplied is true - ContinuousPoCChallenge type with full Marshal/Unmarshal/Size - IssueContinuousPoCChallenges: called each block, samples commits by ValidationSampleRateBps using app_hash as deterministic entropy - RespondContinuousPoCChallenge: verifies sha256-based Merkle proof; invalid proof or expired challenge zeros the epoch EffectivePocWeight - ExpireContinuousPoCChallenges: zeroes weight for unanswered challenges PR gonka-ai#845 adds ContinuousPocParams to Params but omits MarshalToSizedBuffer, Size, and Unmarshal changes in params.pb.go, so the field would never be persisted. This PR adds the full hand-written codec for ContinuousPoCParams, ContinuousPoCCommit, ContinuousPoCEpochSummary, and ContinuousPoCChallenge in types/continuous_poc.go, and wires field 14 into params.pb.go. Closes gonka-ai#821 Made-with: Cursor

Made-with: Cursor

…ing state Made-with: Cursor

- Rename proto field 5 from continuous_poc_summaries_pruned_epoch to continuous_poc_challenges_pruned_epoch (naming matched its use) - Add proto field 6 continuous_poc_summaries_pruned_epoch for the ContinuousPoCEpochSummaries collection - Add GetContinuousPoCEpochSummariesPruner keyed on Pair[uint64, AccAddress] and wire it into Keeper.Prune() to prevent summaries accumulating forever - Update PruningState round-trip and backward-compat tests for field 6 Made-with: Cursor

Nodes that produce high-quality inference results and serve them from cache earn CacheQualityWeight — an additive bonus on top of standard PoC weight. This creates an economic feedback loop: better GPU → better results → more reuse → higher weight → more rewards. Two lookup levels: L1 — PromptHash exact-match (sha256 of canonical JSON), O(1), 100% certain. L2 — cosine similarity via all-MiniLM-L6-v2, governance-controlled threshold. MsgFinishInference is sent on every HIT so the node closes the on-chain cycle. Feature disabled by default; activated via CacheQualityParams governance. inference-chain: - cache_quality.proto: proto source for CacheQualityEpochSummary (new) - params.proto: CacheQualityParams added (field 14) - tx.proto: MsgSubmitCacheQualitySummary RPC added - types/cache_quality.go: params and summary types with serialisation - keeper/msg_server_cache_quality.go: submission handler with bounds checks - keeper/pruning.go: CacheQualityEpochSummaries pruner - module/chainvalidation.go: CacheQualityWeight integration at epoch settlement - app/upgrades/v0_2_11: seeds CacheQualityParams defaults on upgrade decentralized-api: - semanticcache/cache.go: SemanticCache with LookupByPromptHash (L1) + Lookup (L2) - semanticcache/memory_store.go: InMemoryCacheStore, zero external dependencies - semanticcache/embedder.go: MLNodeEmbedder + StubEmbedder - semanticcache/quality_reporter.go: QualityReporter submits per-epoch summary - semanticcache/cache_test.go + cache_http_test.go: 20 tests, full matrix - post_chat_handler.go: L1/L2 integration in executor path - main.go: cache initialisation wired to governance params and epoch events - mlnodeclient: Embed() method added to interface and client mlnode: - packages/api/src/api/embed_routes.py: /api/v1/embed endpoint (CPU, fastembed) docs: - docs/specs/semantic-cache.md: two-level architecture, developer simulation Depends on continuous PoC (gonka-ai#856). Closes gonka-ai#821.

Mayveskii · 2026-03-04T22:15:13Z

Quality of computation as an economic incentive: what I had in mind building this PR
When I built this, the cache was the mechanism — but the original intent was broader. tokenomics-v2/bitcoin-reward.md explicitly named it as an open gap: "No incentive for model diversity or utilization quality." CacheQualityWeight is the first concrete implementation of that direction.
The logic is straightforward: a node that earns more for the quality of its result starts optimizing for it. Better inference → more reuse → higher epoch weight → more rewards → better inference. A closed economic loop, not a declaration.
For this loop to work, participants need to be able to send quality signals without friction. The architecture here already supports this, or with minimal additions:
API — inference_id is in every response; feedback comes back as a single tag on the next request through the same OpenAI-compatible contract
SDK — library for popular stacks, wraps the call and collects implicit signals automatically
Widget — embeddable UI component, zero code required from the developer
Implicit signals — DAPI already sees session depth and semantic repetition with no integration needed
Webhook — node pushes an event to the developer app when inference closes, app returns outcome automatically
CLI / authz — MsgSubmitCacheQualitySummary is already registered in InferenceOperationKeyPerms with full Grant→Exec→Revoke flow for node operators
CacheQualityParams + epoch settlement are generic enough to carry any quality signal beyond cache hits. This is the foundation for the utilization bonuses tokenomics-v2 outlined as the next step but didn't implement.
Would love to hear whether extending CacheQualityParams toward a pluggable quality axis registry makes sense as a follow-on GiP, or whether that's better scoped separately.

blizko · 2026-03-05T08:15:03Z

The overall idea of keeping a Prompt cache is redundant to the kv-cache implementation of vllm.
Rewarding validators for keeping a cache results in asymmetric advantage for validators with higher GPU capacity. Higher GPU capacity => Higher probability of getting requests => Higher cache utilisation => Even higher reward potential

Mayveskii · 2026-03-05T14:34:11Z

The overall idea of keeping a Prompt cache is redundant to the kv-cache implementation of vllm. Rewarding validators for keeping a cache results in asymmetric advantage for validators with higher GPU capacity. Higher GPU capacity => Higher probability of getting requests => Higher cache utilisation => Even higher reward potential

A few clarifications so we're aligned:

1. Where this implementation stands

The cache is off by default and gated by CacheQualityParams in governance. What's in this PR is a building block for the broader direction: improving quality of computation inside the network and rewarding it. The "final" design (e.g. pluggable quality axes, other signals beyond cache hits) will build on this. Right now the goal is community feedback on the idea and the direction, not a full production code review. The broader vision (quality axis registry, measurable useful work) is in GiP discussion #860.

2. KV-cache (vLLM) vs this cache — different layers

They're not redundant; they sit at different levels:

vLLM KV-cache: caches key–value states inside the model during the forward pass. It speeds up inference when prompts share a prefix (e.g. same system prompt, batched requests). The model still runs; you're reusing intermediate activations.
This cache: request/response level. On an L1 (exact PromptHash) or L2 (semantic similarity) hit we return a stored response and do not start inference. No forward pass, no GPU work for that request — just a lookup (hash or embedding index) and a response.

So KV-cache optimises "how much work we do per inference"; this cache optimises "whether we run inference at all" for repeat or semantically similar prompts. Both can coexist: KV-cache for requests that do hit the GPU, this cache for requests that don't.

3. Asymmetric advantage

The concern that "more traffic → more cache hits → more CacheQualityWeight" is valid. That's a reward-distribution effect, not a compute-load effect: serving a cache hit is cheap (no inference). The design already allows mitigating it (feature off by default, governance params, bounds in the submission handler). If we move forward, we can add caps or normalisation so the bonus doesn't dominate. I'd rather lock the mechanism and then tune parameters with governance than block the direction on that.

The main points as I understand them: computations that are identical at a given level should not be recomputed; resources should be allocated instead. A deeper view is outlined in GiP #860 , which proposes classifying computations by quality and incentivizing nodes and users to use them more efficiently overall, without extra financial load.

Read-only semantic cache counters on the operator-only admin port (:9200). Stats() and HitRate() use atomic.LoadInt64 — zero locks, zero side effects, safe at any poll frequency. Nil-safe when no inference nodes are configured. Intended consumers: DAG epoch-boundary tasks, Prometheus scraper (GiP gonka-ai#840), k8s liveness probes. Not exposed on the public port (:9000). Also removes three residual "Qdrant" references from comments — the default backend is InMemoryCacheStore; Qdrant is not part of this PR. Made-with: Cursor

Mayveskii · 2026-03-06T00:50:09Z

This PR closes the open gap named in tokenomics-v2: "No incentive for model diversity or utilization quality."

What it adds

Two-level semantic cache in DAPI: L1 exact-match on PromptHash (SHA-256, cryptographically certain), L2 cosine similarity via MLNodeEmbedder (CPU-only all-MiniLM-L6-v2, 384 dims — no GPU lock)
CacheQualityWeight — additive PoC bonus at epoch settlement, governance-controlled via CacheQualityParams (cap: MaxWeightFractionBps = 30%)
QualityReporter — submits MsgSubmitCacheQualitySummary once per epoch with CacheReuseCount, OriginalComputeCount, AvgSimilarityBps, EmbeddingModelVersion
GET /admin/v1/cache/stats — atomic read-only hit/miss counters on operator-only admin port :9200. Stats() / HitRate() use atomic.LoadInt64 — zero locks, zero side effects, nil-safe.
InMemoryCacheStore — default backend, zero external dependencies, no new services
Authz delegation — MsgSubmitCacheQualitySummary added to InferenceOperationKeyPerms, enabling Grant → Exec → Revoke for operational keys (Unified Permissions #760, tested in Test voting delegation #857)
docs/specs/semantic-cache.md — full spec: architecture, chain/API components, operator verification layer, k8s specialization economics, developer simulation, known limitations
20 tests, no GPU, no ML-node, no chain required — per Developer Simulation in docs/specs/semantic-cache.md

How it works — k8s hosts vs bare-metal

The cache is a DAPI-level feature — it works on any deployment: k8s pod, Docker Compose, bare-metal. The node operator does not need to change anything; the feature activates via governance (CacheQualityParams.Enabled).

What changes with k8s (GiP #816): nodes deployed with one model per node receive 100% of that model's traffic via GetRandomExecutor routing (M=1). This maximises cache hit rate organically — no configuration, no tuning. Bare-metal hosts with shared models still benefit, but with proportionally lower hit rates (M nodes sharing traffic → hit_rate/M).

Economic scenario matrix (live network data)

All calculations use measured baseline from gonka.gg/api/public (epoch 190):

75,016 inferences/epoch avg (2,325,522 over 31 epochs)
129 participants, 3,188 GPUs (H100-eq: 3,235)

Formula:

effective_hit_rate = repeat_fraction × (1/M) × (1 - stream_fraction)
reuseCount/epoch  = effective_hit_rate × (75,016 / M)

CacheQualityWeight: additive bonus to baseCount, cap = MaxWeightFractionBps = 30%

Scenarios:

#	Scenario	M	repeat%	stream%	hit_rate	reuseCount/ep	GPU saves/ep	+Weight
0	Cold start (epoch 1)	—	0%	—	0	0	0	+0%
1	Baseline, no cache	—	—	—	0	0	0	+0%
2	Testnet genesis (2 nodes)	2	30%	0%	0.15	11,252	11,252	partial
3	Testnet + streaming 30%	2	30%	30%	0.105	7,876	7,876	lower
4	Shared model (10 nodes)	10	30%	10%	0.027	2,025	20,253	minimal
5	Multi-model node (3 models, M=1 each)	1	30%	0%	0.30×3	67,515	67,515	+30% cap
6	Specialized (1 node, unique model)	1	30%	0%	0.30	22,505	22,505	+30% cap
7	Specialized + high repeat demand	1	60%	0%	0.60	45,010	45,010	+30% cap
8	20% of network specialized (33 nodes)	1 ea	40%	5%	0.38	28,506	940,698	+30% cap
9	50% of network specialized (81 nodes)	1 ea	40%	5%	0.38	28,506	2,308,986	+30% cap

Network impact at scale:

Parameter	Scenario 8 (20% specialized)	Scenario 9 (50% specialized)
GPU saves/epoch	940,698	2,308,986
GPU saves/year (52 ep)	~48.9M	~120.1M
Operator income delta	+30% cap per node	+30% cap per node

Assumptions and how we verify them:

Assumption	Risk	Verification
repeat_fraction = 30-60%	Real % unknown — may be lower	Live inference with `X-Cache` headers
stream% = 5-30%	If stream > 50% → hit_rate ≈ 0	DAPI logs `stream:true` count
M=1 for specialization	Node may not be unique → M > 1	`node-config` check on testnet
reuseCount self-reported	Community concern @blizko	stats(A) vs chain(B) ± 5% via DAG
MaxWeightFractionBps = 30%	Confirmed in `docs/specs/semantic-cache.md`	governance default

Operator verification layer — DAG + Prometheus (GiP #816, #840)

Self-reported CacheReuseCount can be cross-checked at each epoch boundary:

[Epoch boundary] → DAG / k8s CronJob / Go ticker
  ├── Source A: GET /admin/v1/cache/stats       → {hits, misses, hit_rate}
  ├── Source B: chain CacheQualityEpochSummary  → reuseCount (self-reported)
  └── Source C: Prometheus scraper (GiP #840)   → independent time-series

  stats(A).hits ≈ chain(B).reuseCount ± 5%   →  self-report validated
  divergence > threshold                      →  operator alert

/admin/v1/cache/stats makes Source A possible. Prometheus exporter (GiP #840) already reads admin API on :9200 — integrates without new infrastructure.

This directly addresses blizko's asymmetric advantage concern: nodes cannot inflate reuseCount without the divergence appearing in Source A vs Source B cross-check, while MaxWeightFractionBps (30%) bounds the maximum gain even without verification. The verification layer is fully reproducible on k8s infrastructure (GiP #816); bare-metal operator verification remains an open question for community discussion.

Economic flow end-to-end

User request → DAPI
  ├── [L1/L2 HIT] → verify ResponseHash (sha256) → serve cached result
  │     └── MsgFinishInference → QualityReporter.RecordReuse
  │           └── epoch boundary → MsgSubmitCacheQualitySummary → chain
  │                 └── chainvalidation.go: CacheQualityWeight added to baseCount
  │                       └── +EpochGroup power → model assignments + Reward Coins
  └── [MISS] → GPU inference → MsgFinishInference → StoreResult(embedding, payload)

Streaming (stream: true) bypasses cache entirely — SSE format cannot be replayed.

Scientist-Validator summary

Layer	Status	Evidence
Mechanism correctness	PROVEN	20/20 tests PASS (0.103s), all fail-paths verified
ResponseHash integrity	PROVEN	`TestHTTP_L1_VerifyFail_FallThrough` — tamper → fallthrough, not crash
Governance live update	PROVEN	`TestUpdateCacheParams_ModelVersionChange` — immediate invalidation
Admin stats endpoint	PROVEN	`go build ./...` clean, nil-safe, atomic reads
Network baseline	REAL	75,016 inferences/ep (epoch 190, live API data)
Economic impact projection	CALCULATED	9 scenarios with measured baseline, honest assumptions table
Live X-Cache / reuseCount / CacheQualityWeight delta	PENDING	Requires `CacheQualityParams.Enabled = true` + tokens for live inference

Next step (GiP #860)

CacheQualityParams + epoch settlement pattern is generic enough to carry additional quality axes (session continuity, explicit feedback, verifiable outcome) without rebuilding the reward layer. See #860.

@tcharchian could you add this to the v0.2.11 milestone? It builds directly on the continuous PoC foundation from #856.

… layer, k8s specialization - GET /admin/v1/cache/stats: atomic read-only counters for DAG/Prometheus - Operator verification layer: Source A (stats) vs B (chain) vs C (Prometheus) - k8s node specialization: hit_rate formula, economic self-reinforcement - L2 text-only limitation documented with multimodal upgrade path - Known Limitations updated with DAG cross-check mitigation Made-with: Cursor

…ance) Merge origin/upgrade-v0.2.11 into feature/gip-semantic-cache-trust-layer. Resolved conflicts: - server.go: keep both semanticCache/qualityReporter and statsStorage fields - main.go: add WithStatsStorage to publicServerOpts alongside semantic cache init - keys.go: upstream 45-46 (ModelLoad/InferenceCount rolling windows) + ours 48-51 (ContinuousPoC + CacheQuality) Made-with: Cursor

cisco added 6 commits March 4, 2026 13:08

fix: use NewPrefixedTripleRange for single-epoch iteration

b8b000f

Made-with: Cursor

fix: update confirmation_poc NewWeightCalculator call signature

5279293

Made-with: Cursor

test: add unit tests for continuous PoC types, Merkle proof, and prun…

7b23c69

…ing state Made-with: Cursor

Mayveskii mentioned this pull request Mar 6, 2026

Test voting delegation #857

Closed

cisco added 2 commits March 6, 2026 04:06

This was referenced Mar 6, 2026

Ak issue782 0.2.11 #793

Open

Fix/784 fund atomicity error safety #789

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(semantic-cache): two-level cache with on-chain quality trust layer#859

feat(semantic-cache): two-level cache with on-chain quality trust layer#859
Mayveskii wants to merge 9 commits intogonka-ai:upgrade-v0.2.11from
Mayveskii:feature/gip-semantic-cache-trust-layer

Mayveskii commented Mar 4, 2026

Uh oh!

Mayveskii commented Mar 4, 2026 •

edited

Loading

Uh oh!

blizko commented Mar 5, 2026

Uh oh!

Mayveskii commented Mar 5, 2026

Uh oh!

Mayveskii commented Mar 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mayveskii commented Mar 4, 2026

Summary

Architecture — Two-Level Cache

Changes

Validation

Protocol Compliance

Uh oh!

Mayveskii commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blizko commented Mar 5, 2026

Uh oh!

Mayveskii commented Mar 5, 2026

Uh oh!

Mayveskii commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What it adds

How it works — k8s hosts vs bare-metal

Economic scenario matrix (live network data)

Operator verification layer — DAG + Prometheus (GiP #816, #840)

Economic flow end-to-end

Scientist-Validator summary

Next step (GiP #860)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mayveskii commented Mar 4, 2026 •

edited

Loading

Mayveskii commented Mar 6, 2026 •

edited

Loading