Skip to content

perf: Implement single storage and retrieval of schemas#2892

Closed
josephschorr wants to merge 22 commits intoauthzed:mainfrom
josephschorr:unified-schema-hash
Closed

perf: Implement single storage and retrieval of schemas#2892
josephschorr wants to merge 22 commits intoauthzed:mainfrom
josephschorr:unified-schema-hash

Conversation

@josephschorr
Copy link
Member

Summary: Unified Schema Storage Implementation

Overview

This branch implements a fundamental architectural change to SpiceDB's schema storage system, transitioning from namespace-based schema storage to a unified single-storage model with hash-based caching. The implementation provides a gradual migration path supporting four operational modes, ensuring zero-downtime deployment.

Core Architecture Changes

Before: Schemas were stored as individual namespace and caveat definitions across multiple rows. Each read required querying and reconstructing the complete schema from these distributed components. Caching used complex revision-based ranges with checkpoint tracking, watchers, and time-based cleanup.

After: Complete schemas are stored as single versioned entities with content-addressable retrieval via SHA256 hashes. The new hash-based LRU cache provides O(1) lookups with single-flight deduplication, eliminating the complexity of revision tracking.

Key Components

1. StoredSchema Protocol Buffer

New proto definition (core.StoredSchema) for versioned schema storage:

  • Version field for format evolution
  • V1 format containing schema text, SHA256 hash, and timestamp
  • Support for both legacy and unified storage modes

2. Database Schema

Two new tables added across all SQL datastores (PostgreSQL, MySQL, CockroachDB, Spanner):

schema_revision: Lightweight metadata table for fast change detection

  • name (always 'current'): index key
  • hash: SHA256 of current schema
  • created_xid/timestamp: revision tracking

schema: Content storage with chunking support

  • name: schema identifier
  • chunk_index: sequential chunk number
  • chunk_data: data segment (1MB default)
  • Handles schemas exceeding database column size limits

3. Migration Modes

Four operational modes enable gradual rollout:

  • read-legacy-write-legacy: Default mode, no migration impact
  • read-legacy-write-both: Begin migration, dual-write to populate new tables
  • read-new-write-both: Switch reads to new storage, maintain dual-write
  • read-new-write-new: Complete migration, unified storage only

Configuration via --datastore-experimental-schema-mode flag. Cache size is configurable in code via WithSchemaCacheOptions() (default: 100 entries).

4. Hash-Based Cache (HashCache)

Replaced complex RevisionedCache with simplified LRU implementation:

Design:

  • Key: Schema hash (string)
  • Value: Cached *core.StoredSchema
  • Size: Configurable, default 100 entries
  • Eviction: LRU with single-flight deduplication
  • Thread safety: Built-in concurrent access protection

Performance characteristics:

  • Cache hit: O(1) hash map lookup
  • Cache miss: DB load with deduplication
  • Overhead: 9-62µs per HeadRevision() call
  • Sentinel optimization: O(1) map lookup for bypass checks

5. Schema Hash Integration

All datastore methods updated to return schema hash atomically with revision:

HeadRevision() → (Revision, SchemaHash, error)
OptimizedRevision() → (Revision, SchemaHash, error)
SnapshotReader(rev, hash) → Reader with hash for cache lookups

SQL datastores load hash via COALESCE subquery selecting current schema_revision. Memory datastore stores hash with each snapshot.

@github-actions github-actions bot added area/cli Affects the command line area/api v1 Affects the v1 API area/datastore Affects the storage system area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) area/dispatch Affects dispatching of requests labels Feb 10, 2026
@codecov
Copy link

codecov bot commented Feb 10, 2026

Codecov Report

❌ Patch coverage is 57.92313% with 1248 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.81%. Comparing base (cd8f1ce) to head (bbbf8b0).
⚠️ Report is 27 commits behind head on main.

Files with missing lines Patch % Lines
internal/datastore/schema/singlestore.go 26.20% 244 Missing and 4 partials ⚠️
internal/datastore/mysql/schema_chunker.go 24.04% 92 Missing and 6 partials ⚠️
internal/datastore/crdb/schema_chunker.go 27.11% 74 Missing and 4 partials ⚠️
internal/datastore/spanner/readwrite.go 58.23% 52 Missing and 14 partials ⚠️
internal/datastore/mysql/readwrite.go 52.42% 46 Missing and 13 partials ⚠️
internal/datastore/postgres/readwrite.go 52.42% 46 Missing and 13 partials ⚠️
internal/datastore/postgres/schema_chunker.go 30.96% 54 Missing and 4 partials ⚠️
internal/datastore/crdb/readwrite.go 50.51% 40 Missing and 9 partials ⚠️
internal/datastore/common/sqlschema.go 61.67% 33 Missing and 13 partials ⚠️
internal/datastore/memdb/memdb.go 65.05% 33 Missing and 10 partials ⚠️
... and 38 more

❌ Your project status has failed because the head coverage (73.81%) is below the target coverage (75.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2892      +/-   ##
==========================================
- Coverage   74.61%   73.81%   -0.79%     
==========================================
  Files         484      491       +7     
  Lines       58538    61053    +2515     
==========================================
+ Hits        43672    45061    +1389     
- Misses      11800    12798     +998     
- Partials     3066     3194     +128     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@josephschorr josephschorr force-pushed the unified-schema-hash branch 4 times, most recently from baaee98 to e28cc5f Compare February 12, 2026 17:24
josephschorr and others added 19 commits February 12, 2026 12:29
Add StoredSchema message to support storing complete schema definitions
as a single versioned entity in datastores. The schema includes:
- Version field for future schema format evolution
- V1 format with schema text, hash, and timestamp
- Support for both namespace-based and unified storage modes

This enables datastores to store and retrieve schemas atomically,
providing better consistency and performance for schema operations.
Add configuration options for schema storage mode with three modes:
- ReadLegacyWriteLegacy: use legacy namespace/caveat tables (default)
- ReadLegacyWriteBoth: write to both legacy and new schema tables
- ReadNewWriteBoth: read from new schema table, write to both
- ReadNewWriteNew: fully use new schema table

Also add schema cache configuration options:
- MaximumCacheMemoryBytes: memory limit for schema caching
- QuantizationWindow: time window for cache entry consolidation

These options enable gradual migration from legacy to unified schema
storage without downtime.
Add interfaces for reading and writing schemas in datastores:
- SingleStoreSchemaReader/Writer: for new unified schema storage
- SchemaReader/Writer: mode-aware adapters that delegate to either
  legacy namespace/caveat operations or unified schema operations

The adapter pattern (schemaadapter.NewSchemaWriter) enables smooth
migration by routing operations based on the configured schema mode.
Datastores implement SingleStore interfaces, and the adapter handles
the complexity of dual-write scenarios and mode switching.

Add context helpers for tracking schema mode across operations.
Add migrations for new schema_revision and schema_chunks tables:
- schema_revision: stores current schema hash and timestamp for fast
  schema change detection
- schema_chunks: stores schema content in chunks to work around
  database column size limits

Tables added to:
- PostgreSQL (migration 0024)
- MySQL (migration 0011)
- CockroachDB (migration 0010)
- Spanner (migration 0012)

Each datastore uses appropriate column types and indexing strategies
optimized for their storage engines.
Add SQLSchemaReaderWriter to handle storage of large schemas by:
- Chunking schemas into configurable chunk sizes (default 1MB)
- Storing chunks in schema_chunks table with sequential indexing
- Reassembling chunks on read with validation

Add database-specific chunk executors for:
- PostgreSQL: uses pgx transactions
- MySQL: uses sqlx transactions
- CockroachDB: uses pgx with CRDB-specific features
- Spanner: uses Cloud Spanner mutations and queries

The chunking approach handles schemas larger than database column
size limits while maintaining transactional consistency.
Implement SingleStoreSchemaHashWatcher for each datastore to detect
schema changes in real-time:

- PostgreSQL: polls schema_revision table at configurable intervals
- MySQL: polls schema_revision table at configurable intervals
- CockroachDB: uses CHANGEFEED with resolved timestamps for checkpoints
- Spanner: polls schema_revision table at configurable intervals

Watchers invoke callbacks with (schemaHash, revision) on changes,
enabling caches and watches to stay synchronized with schema updates.
Polling is optimized to query only the lightweight schema_revision
table rather than the full schema_chunks table.
Wire up unified schema support in all datastores:

- Add SchemaWriter() method to ReadWriteTransaction returning mode-aware
  schema writers that handle legacy/new/both read and write patterns
- Add WithTimestampRevision interface to revisions for time-based cache
  cleanup (implemented by HLCRevision and TimestampRevision)
- Initialize SQLSchemaReaderWriter with schema hash watchers in
  PostgreSQL, MySQL datastores
- Add SchemaHashWatcherForTesting() interfaces for test access
- Update legacy caveat/namespace write operations to also update
  schema_revision table with computed schema hash
- Add support for reading schemas at specific revisions where supported

Each datastore now supports all four schema modes (ReadLegacyWriteLegacy,
ReadLegacyWriteBoth, ReadNewWriteBoth, ReadNewWriteNew) for gradual
migration without downtime.
Add RevisionedSchemaCache to cache schemas by revision with safety
guarantees:

- Lock-free reads using atomic snapshots for high performance
- Async writes via background goroutine to avoid blocking
- Checkpoint-based validity for all ranges (current and historical)
- Ranges track startRevision and checkpointRevision (no endRevision)
- Ranges are only valid from start up to and including checkpoint
- Separate currentRange and historicalRanges list
- Time-based cleanup for WithTimestampRevision types using quantization
  window (deletes ranges older than 2x window)
- Memory-based eviction (keeps minimum 2 ranges)
- Schema hash watcher integration for automatic cache updates

Safety design: Both current and historical ranges trust only their
checkpoint revision. This prevents returning cached data for revisions
that may have had undetected schema changes between watcher checkpoints,
preferring safe cache misses over incorrect data.
Update all datastore proxy implementations to support SchemaReader and
SchemaWriter interfaces:

- observable: track schema read/write operations in metrics
- counting: count schema operations
- schemacaching: cache schema reads (both legacy and unified)
- relationshipintegrity: pass through schema operations
- checkingreplicated: verify schema consistency across replicas
- strictreplicated: enforce schema replication
- indexcheck: validate schema integrity

Proxies delegate to underlying datastore's schema methods while adding
their layer-specific functionality (caching, metrics, validation, etc).
Add UnifiedSchemaTest to verify all four schema modes across all
datastores:

- Tests schema write and read operations in all modes
- Verifies correct routing between legacy and unified storage
- Tests schema versioning and revision tracking
- Validates concurrent schema updates
- Tests schema caching behavior
- Verifies gradual migration path works correctly

Test suite runs for:
- PostgreSQL (all modes)
- MySQL (all modes)
- CockroachDB (all modes)
- Spanner (all modes)

Each test creates multiple schema versions and verifies they can be
read back at the correct revisions, ensuring proper isolation and
consistency guarantees.
Replace the complex revision-based cache with a simpler hash-based LRU
cache for schema storage. This improves performance and simplifies the
caching logic significantly.

Key changes:
- Replace RevisionedCache with HashCache using schema hash as key
- Remove complex checkpoint tracking, watchers, and time-based cleanup
- Add schema hash to HeadRevision() and OptimizedRevision() returns
- Update all datastores to load schema hash atomically with revision
- Add migrations to populate schema/schema_revision tables from existing data
- Add NoSchemaHash sentinel for transaction reads and testing
- Include schema_hash in dispatch ResolverMeta for distributed caching
- Add Prometheus metrics for cache hits/misses
- Remove unused revision-based cache files (revisionedcache.go, rangelist.go)

Performance improvements:
- O(1) hash lookup vs O(log n) binary search
- Single-flight deduplication prevents duplicate schema loads
- Configurable LRU size (default: 100 entries)
- Overhead: 9-62µs per HeadRevision call (benchmarked on PostgreSQL)

Schema hash loading per datastore:
- PostgreSQL/MySQL: COALESCE subquery for current hash
- CockroachDB: Same pattern with bytea type
- Spanner: Similar with bytes type
- Memory: Hash stored with each snapshot

Migration details:
- Each datastore migration reads legacy namespaces/caveats
- Generates canonical schema (alphabetically sorted)
- Computes SHA256 hash and stores in schema_revision table
- Stores serialized schema in chunked format (1MB chunks)
Add support for testing with both legacy and new unified schema modes
in the steelthread test framework. Each test now runs with both modes
to ensure functional equivalence and enable performance comparison.

Changes:
- Modified TestNonMemdbSteelThreads to run with LegacySchema and NewSchema modes
- Added BenchmarkSteelThreadSchemaMode for performance comparison
- Memdb tests remain unchanged (schema mode not applicable)

Benchmark Results (PostgreSQL on Apple M1 Max):

Overall Performance:
- Legacy schema: 10.119s
- New schema: 8.701s
- Improvement: 14.0% faster

Top Performance Gains:
- Cursored reads by subject (page_size=10): 38.1% faster
- Cursored lookup resources (page_size=5): 33.7% faster
- Cursored reads by subject (page_size=2): 31.3% faster
- Bulk checks with traits: 28.3% faster
- Lookup subjects for document: 27.4% faster
- Cursored reads by resource (page_size=2): 25.9% faster

Performance by Operation Type:
- Cursored read operations: 15-38% faster
- Bulk check operations: 12-28% faster
- Lookup subject operations: 10-27% faster
- Indirect lookup operations: 4-20% faster
- Schema write operations: 4-10% faster

Technical Improvements:
- O(1) hash lookup vs O(log n) binary search
- LRU cache with singleflight deduplication
- HeadRevision overhead: 9-62µs per call
- Unified schema storage reduces read paths

All tests pass with identical results in both modes, confirming
functional equivalence while delivering significant performance
improvements.
The datastore proxy layers (schemacaching, observable, counting) were
hardcoded to always return LegacySchemaReaderAdapter, regardless of the
configured schema mode. This forced all schema reads through legacy
methods even when the datastore was configured for unified schema mode
(SchemaModeReadNewWriteNew), resulting in:

- 16.71% of memory allocations (1234.74MB) in legacy schema methods
- Lost opportunity for hash-based schema cache benefits
- Bypassed O(1) unified schema lookups in favor of O(log n) legacy reads

Fix:
All proxy SchemaReader() methods now:
1. Query the underlying reader to determine its configured schema mode
2. For unified schema mode: pass through directly (hash cache is efficient)
3. For legacy mode: wrap with LegacySchemaReaderAdapter (maintains caching)

This ensures the proxy layers preserve caching behavior while respecting
the datastore's schema mode configuration.

Impact:
- Eliminates 17% memory overhead when using unified schema mode
- Enables full benefits of hash-based schema caching (O(1) lookups)
- Expected to improve new schema mode performance from 14% to 22-27% faster
  than legacy mode

Test updates:
- Added helper functions for consistent mock setup
- Updated all test mocks to expect SchemaReader() calls
- All schemacaching proxy tests pass
- Add NewSQLByteChunker function that returns errors instead of panicking
- Update sqlschema to use NewSQLByteChunker in error-returning functions
- Fix test mocks to include SchemaHash parameter in datastore method calls
- Handle Prometheus metrics AlreadyRegisteredError in test environments
- Pre-allocate slice capacity in spanner readwrite
- Add nolint directive for safe SQL formatting in migrations
- Explicitly ignore error returns in benchmark and panic tests
Spanner does not allow mixing mutations (BufferWrite) and DML statements
in the same transaction. The migration framework's WriteVersion method uses
DML, so all upTx callbacks must also use DML to avoid conflicts.

Converted two migrations to use DML:
- add-metadata-and-counters: Changed BufferWrite to INSERT DML
- populate-schema-tables: Changed BufferWrite to INSERT DML with loop

This fixes the 'Previously received a different request with this seqno' error.
This change ensures that the legacy schema hash is ONLY written when
in "write to both" modes (SchemaModeReadLegacyWriteBoth or
SchemaModeReadNewWriteBoth), not in "write to legacy only" mode
(SchemaModeReadLegacyWriteLegacy).

Key changes:

1. Removed writeLegacySchemaHash() from individual legacy operations
   (LegacyWriteNamespaces, LegacyWriteCaveats, etc.) across all
   datastores to prevent write-write conflicts on schema_revision
   table in CRDB's serializable isolation.

2. Added WriteLegacySchemaHashFromDefinitions implementation to all
   datastores (CRDB, Postgres, MySQL, Spanner, MemDB) to support
   writing the hash from buffered definitions without requiring reads.

3. Moved hash-writing logic from LegacySchemaWriterAdapter.WriteSchema()
   to dualSchemaWriter.WriteSchema() to ensure the hash is only written
   when using the dual writer (i.e., in "write to both" modes).

This fixes the CRDB test failure in TestCaveatSnapshotReads where
write conflicts were occurring due to concurrent schema_revision
updates, while ensuring the hash is still written correctly for
unified schema testing.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
… schemas

MemDB supports both legacy and unified schema storage, but was only
using LegacySchemaWriterAdapter which after our changes no longer writes
the legacy schema hash. This caused TestUnifiedSchemaHashWatch to fail
with a timeout waiting for hash updates.

Changes:
1. Updated SchemaWriter() to return a dualSchemaWriter that writes to
   both legacy and unified storage (SchemaModeReadNewWriteBoth)

2. Refactored schema writing methods to avoid deadlocks:
   - Added writeStoredSchemaNoLock() that uses the transaction's
     db handle without acquiring additional locks
   - Added writeSchemaHashNoLock() for transaction-scoped hash writes
   - Added writeLegacySchemaHashFromDefinitionsNoLock() for legacy hash

3. Updated WriteStoredSchema() and WriteLegacySchemaHashFromDefinitions()
   to use the transaction's db handle via txSource()

This ensures MemDB writes to both storages and properly updates the
schema hash for watchers, fixing the unified schema hash watch test.
The SingleStoreSchemaHashWatcher interface and all its implementations
were only used for testing and not in production code. This commit
removes:

- SingleStoreSchemaHashWatcher interface and SchemaWatchCallback type
- All datastore schemawatch.go implementation files (postgres, mysql,
  crdb, spanner, memdb)
- SchemaHashWatcherForTesting methods from all datastores and proxies
- UnifiedSchemaHashWatchTest test function

Changes made:
- Inlined readSchemaHash logic into SchemaHashReaderForTesting methods
- Added missing SchemaModeForTesting to observable proxy and validating
  datastore wrapper
- Fixed test files to pass schema hash parameter to SnapshotReader
- Fixed linter issues (perfsprint, testifylint)

All datastore tests pass (postgres, mysql, crdb).
We therefore need to read them ourselves, manually, in the new schema writer
Comment on lines +10 to +15
schemaCacheLookups = promauto.NewCounterVec(prometheus.CounterOpts{
Namespace: "spicedb",
Subsystem: "datastore",
Name: "schema_cache_lookups_total",
Help: "total number of schema cache Get() calls",
}, []string{"result"}) // result: "hit" or "miss"
Copy link
Contributor

@miparnisari miparnisari Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep this consistent with the existing metrics?

descCacheHitsTotal = prometheus.NewDesc(
stringz.Join("_", promNamespace, promSubsystem, "hits_total"),
"Number of cache hits",
[]string{"cache"},
nil,
)
descCacheMissesTotal = prometheus.NewDesc(
stringz.Join("_", promNamespace, promSubsystem, "misses_total"),
"Number of cache misses",
[]string{"cache"},
nil,
)

so, as a consumer, i will see the same metric but with a new label cache=schema

# HELP spicedb_cache_hits_total Number of cache hits
# TYPE spicedb_cache_hits_total counter
spicedb_cache_hits_total{cache="cluster_dispatch"} 0
spicedb_cache_hits_total{cache="dispatch"} 0
spicedb_cache_hits_total{cache="namespace"} 0
# HELP spicedb_cache_misses_total Number of cache misses
# TYPE spicedb_cache_misses_total counter
spicedb_cache_misses_total{cache="cluster_dispatch"} 0
spicedb_cache_misses_total{cache="dispatch"} 0
spicedb_cache_misses_total{cache="namespace"} 0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

if err != nil {
return shared.RewriteErrorWithoutConfig(ctx, err)
}
// If cursor has empty schema hash (legacy cursor), skip cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. "skip cache"? there is no "cache" mentioned in this file. is this leaking a detail from one of the subsequent method calls?
  2. can we not move this if to inside decodeCursor()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved

perfinsights.SetInContext(ctx, perfinsights.NoLabels)

atRevision, _, err := consistency.RevisionFromContext(ctx)
atRevision, _, _, err := consistency.RevisionAndSchemaHashFromContext(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this if you're going to ignore the returned value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we need it in many other places?


handle.(*revisionHandle).revision = revision
// Note: For some consistency modes (AtLeastAsFresh, AtExactSnapshot), schema hash may be empty.
// This will cause cache lookups to fail, which is expected until schema hash is properly threaded
Copy link
Contributor

@miparnisari miparnisari Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"expected until schema hash is properly threaded through these code paths"

is this a follow-up PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed; it was a WIP comment

Comment on lines +680 to +688
SnapshotReader(Revision, SchemaHash) Reader

// OptimizedRevision gets a revision that will likely already be replicated
// and will likely be shared amongst many queries.
OptimizedRevision(ctx context.Context) (Revision, error)
OptimizedRevision(ctx context.Context) (Revision, SchemaHash, error)

// HeadRevision gets a revision that is guaranteed to be at least as fresh as
// right now.
HeadRevision(ctx context.Context) (Revision, error)
HeadRevision(ctx context.Context) (Revision, SchemaHash, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

godoc updates please. it's not obvious why SnapshotReader needs a SchemaHash

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added but it has to mention the cache now

"fmt"
"sync/atomic"

lru "github.com/hashicorp/golang-lru/v2"
Copy link
Contributor

@miparnisari miparnisari Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have three cache implementations and you brought another one? 😓

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to wire through the complexities of one of the existing cache libs; we don't really need it. Otherwise, we'd have to actually wire it through into the hashcache.go, which has... issues. I can do it, if you like, but it will make the datastore configuration that much more complicated

Subject: subject,
CaveatContext: caveatContext,
AtRevision: devContext.Revision,
SchemaHash: datastore.NoSchemaHashForTesting, // DevContext uses memdb which doesn't support hashing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems so wrong.. why does the caller of ComputeCheck have to know what the underlying datastore is to pass the right parameter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because its being called from a development environment? Otherwise, we'd have to load it from somewhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to its own bypass, but the bypass is correct

}

reader := devContext.Datastore.SnapshotReader(devContext.Revision)
// DevContext uses memdb which doesn't support schema hashing
Copy link
Contributor

@miparnisari miparnisari Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems so wrong.. why does the caller of SnapshotReader have to know what the underlying datastore is to pass the right parameter?

(this is why i asked you to update the godoc of SnapshotReader)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because if the underlying datastore is memdb, doing the extra work to load the hash makes zero sense and devcontext is always using memdb.

Copy link
Contributor

@miparnisari miparnisari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • can you add instructions to this PR on how to test this by hand, perhaps with docker-compose up?

Copy link
Contributor

@tstirrat15 tstirrat15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

}
errMsg := err.Error()
// Check both the error message and wrapped errors
return strings.Contains(errMsg, "no chunks found") || strings.Contains(errMsg, "failed to reassemble chunks")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to use types rather than string matching here? Or are we working across types and that's why this approach was taken?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

const defaultMaxCacheEntries = 10

// bypassSentinels contains all schema hash values that intentionally bypass the cache.
// These are special sentinel values used in specific contexts where caching is not appropriate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that bypassing the cache prevents the cache from being populated and having entries potentially evicted because of new entries?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, its about not having access to the schema hash and therefore, skipping the cache when it provides no value.

// bypassSentinels contains all schema hash values that intentionally bypass the cache.
// These are special sentinel values used in specific contexts where caching is not appropriate.
// Using a map for O(1) lookup instead of slice iteration.
var bypassSentinels = map[datastore.SchemaHash]bool{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mapz.NewSet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really needed for a quick check; changed to struct{} though to save a few bytes

// The LRU cache itself is thread-safe, so no locks are needed for cache operations.
// The atomic latest entry provides lock-free fast-path for the common case.
type SchemaHashCache struct {
cache *lru.Cache[string, *core.StoredSchema] // Thread-safe LRU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that we're not using the same cache(s) that we use for other purposes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires wiring it through the entire datastore init system, for a tiny LRU cache with max 10 entries. If you like, I can do so, but its a lot of code change for something quite small

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires wiring it through the entire datastore init system

That being an artifact of our current cache interface? It surprises me that a handle on a cache would have anything to do with the datastore system.

func NewSchemaHashCache(opts options.SchemaCacheOptions) (*SchemaHashCache, error) {
maxEntries := int(opts.MaximumCacheEntries)
if maxEntries == 0 {
maxEntries = defaultMaxCacheEntries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 seems reasonable. Is this something that we should benchmark in future work to figure out whether a different value works better? Or is this based on reasoning about defaults around default quantization intervals etc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its simply a guess: unless someone is changing schemas very, very rapidly, however, the normal number of entries should be 0 (1 current entry, 0 LRU ones)

Comment on lines +118 to +120
if err := rwt.writeCaveat(tx, caveats); err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this rewrite?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing: I was adding logs afterwards for testing, but then removed them when I was done

r.db.RLock()
defer r.db.RUnlock()

tx := r.db.db.Txn(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

db.db?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to mdb.db

Comment on lines +79 to +81
const (
ctxKeyTransactionID contextKey = "mysql_transaction_id"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ctxkey here

return nil, err
}

// Hash-based cache doesn't need watchers or warming
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this contradict line 344?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does. Watching is still correct, but I removed it

created_xid xid8 NOT NULL DEFAULT (pg_current_xact_id()),
deleted_xid xid8 NOT NULL DEFAULT ('9223372036854775807'),
CONSTRAINT pk_schema PRIMARY KEY (name, chunk_index, created_xid));`,
`CREATE INDEX ix_schema_gc ON schema (deleted_xid DESC) WHERE deleted_xid < '9223372036854775807'::xid8;`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious - does an index help when there's one (or an otherwise small number of) row? Or is it actually faster to let it table scan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHERE deleted_xid will make it faster to use an index, if there are a lot of schema changes in a short window

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, that makes sense - it was more of a general question. If we're using an index for GC, would it make sense to also have an index for the content lookups, or does it not make sense because the number of rows is small (1?) and it'd just be overhead? I don't know enough about datastores to know how indices behave when the data are small

}

// Parse the experimental schema mode string if provided
if opts.experimentalSchemaModeString != "" {
Copy link
Contributor

@miparnisari miparnisari Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you run go run ./cmd/spicedb/main.go serve --grpc-preshared-key foobar --datastore-experimental-schema-mode=read-legacy-write-both, by the time execution gets here, the value of opts.experimentalSchemaModeString is somehow read-legacy-write-legacy. and when the server prints its config, "ExperimentalSchemaMode":0 (why a number and not a string??)

in conclusion: the flag isn't being passed down properly, and the printed version of it makes for a poor dev experience

}

// Create schema reader/writer
schemaReaderWriter, err := common.NewSQLSchemaReaderWriter[uint64, postgresRevision](BaseSchemaChunkerConfig, config.schemaCacheOptions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putting this call here means that we are creating a new cache object once for every replica that the server is connected to.

why did you choose this design and not a proxy datastore object that holds the cache and if an entry isn't there we fallback to the actual datastore impl?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want any proxies to be able to interact directly with the cache; the current proxy-based design suffers from a problem: it is easy to forget to add the proxy, and it makes it hard/impossible for the datastores themselves to warm and update the cache; this was a bigger problem when I was using a watch-based update mechanism, but I suppose I could move to a proxy-based design now

@josephschorr
Copy link
Member Author

Closing in favor of #2924

@github-actions github-actions bot locked and limited conversation to collaborators Feb 25, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area/api v1 Affects the v1 API area/cli Affects the command line area/datastore Affects the storage system area/dispatch Affects dispatching of requests area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants