Skip to content

Add distributed cache for horizontal scaling#877

Open
hpopuri2 wants to merge 16 commits intotrinodb:mainfrom
hpopuri2:valkey
Open

Add distributed cache for horizontal scaling#877
hpopuri2 wants to merge 16 commits intotrinodb:mainfrom
hpopuri2:valkey

Conversation

@hpopuri2
Copy link
Contributor

@hpopuri2 hpopuri2 commented Jan 27, 2026


##Add Valkey Distributed Cache for Horizontal Scaling

##Summary

This PR implements distributed caching using Valkey to enable horizontal scaling of Trino Gateway. Multiple gateway instances can now share query metadata through a distributed cache layer, ensuring consistent query routing across all instances.

##Motivation

Currently, Trino Gateway uses local Guava caches that are not shared between instances. In multi-instance deployments, this can lead to:

  • Inconsistent query routing when requests hit different gateway instances
  • Cache misses requiring expensive database lookups
  • Inability to leverage cache across horizontally scaled deployments

This implementation addresses these limitations while maintaining backward compatibility and graceful degradation.

##Architecture

3-Tier Caching Strategy

Request Flow:

  1. L1 Cache (Local Guava) → ~1ms
    - Hit: Return immediately
    - Miss: Check L2
  2. L2 Cache (Valkey Distributed) → ~5ms
    - Hit: Populate L1, return
    - Miss: Check L3
  3. L3 Cache (PostgreSQL Database) → ~50ms
    - Found: Populate L2 + L1, return
    - Not Found: Search backends via HTTP (~200ms)

Cache Keys

Three values are cached for each query:

  • trino:query:backend:{queryId} - Backend URL for query routing
  • trino:query:routing_group:{queryId} - Routing group assignment
  • trino:query:external_url:{queryId} - External URL for query access

All keys use configurable TTL (default 30 minutes / 1800 seconds).

##Implementation Details

Core Components

ValkeyConfiguration (gateway-ha/src/main/java/io/trino/gateway/ha/config/ValkeyConfiguration.java)

  • 9 configurable parameters with sensible defaults
  • Input validation (port range, positive values)
  • Convention over Configuration - only enabled, host, and port required
  • Fixed: cacheTtlSeconds parameter now properly used (was previously hardcoded)

Cache Interface (gateway-ha/src/main/java/io/trino/gateway/ha/cache/Cache.java)

  • Generic caching abstraction: get(), set(), invalidate(), isEnabled()
  • Implementation-agnostic design (name describes contract, not implementation)
  • Enables future alternative implementations (e.g., Redis Cluster, Memcached)
  • Located in dedicated io.trino.gateway.ha.cache package for better organization

ValkeyDistributedCache (gateway-ha/src/main/java/io/trino/gateway/ha/cache/ValkeyDistributedCache.java)

  • Implements Cache interface
  • JedisPool connection pooling with configurable pool size
  • Graceful degradation when disabled or connection fails
  • Configurable TTL management via cacheTtlSeconds parameter

QueryCacheManager (gateway-ha/src/main/java/io/trino/gateway/ha/cache/QueryCacheManager.java) - NEW

  • Encapsulates all query-related cache operations
  • Manages 3 LoadingCache instances (backend, routing group, external URL)
  • Provides clean separation of concerns between routing and caching logic
  • Handles both L1 (in-memory) and L2 (distributed) cache operations
  • Methods:
    • L1 operations: setBackendInL1(), getBackendFromL1(), etc.
    • L2 operations: cacheBackend(), getCachedBackend(), etc.
    • Combined operations: setBackend(), updateAllCaches()

NoopDistributedCache (gateway-ha/src/test/java/io/trino/gateway/ha/cache/NoopDistributedCache.java)

  • No-op implementation for testing without real cache
  • Always returns empty, always disabled

Integration

BaseRoutingManager - Simplified routing logic:

  • Now uses single QueryCacheManager instance instead of managing multiple caches
  • Reduced from ~380 to ~310 lines through better separation of concerns
  • updateQueryIdCache() method caches all 3 values via QueryCacheManager
  • All cache operations delegated to QueryCacheManager
  • findBackendForUnknownQueryId() - L1 → L2 → L3 → HTTP search
  • findRoutingGroupForUnknownQueryId() - L1 → L2 → L3 lookup
  • findExternalUrlForUnknownQueryId() - L1 → L2 → L3 lookup
  • Automatic cache backfilling when found in lower tiers

ProxyRequestHandler - Query submission:

  • Updated recordBackendForQueryId() to call updateQueryIdCache() with all 3 values
  • Ensures all query metadata is cached on first submission

HaGatewayProviderModule - Dependency injection:

  • @provides @singleton Cache method
  • Wires ValkeyConfiguration to ValkeyDistributedCache
  • Passes cacheTtlSeconds from configuration to cache implementation

Configuration

Minimal (Recommended for Getting Started)

valkeyConfiguration:
  enabled: true
  host: localhost
  port: 6379

With Authentication

valkeyConfiguration:
  enabled: true
  host: valkey.internal.prod
  port: 6379
  password: ${VALKEY_PASSWORD}
  database: 0

Advanced (Production Tuning)

valkeyConfiguration:
  enabled: true
  host: valkey.internal.prod
  port: 6379
  password: ${VALKEY_PASSWORD}
  database: 0
  maxTotal: 100              # Max connections in pool
  maxIdle: 50                # Max idle connections
  minIdle: 25                # Min idle connections
  timeoutMs: 5000            # Connection timeout
  cacheTtlSeconds: 3600      # 1 hour TTL for long-running queries

Single Instance (No Changes Required)

valkeyConfiguration:
   enabled: false  # Default - local cache sufficient

##Testing

Unit Tests

TestValkeyConfiguration

  • Default values verification
  • Setter/getter correctness

TestValkeyDistributedCache (2 tests)

  • testDisabledCache() - Verifies disabled cache returns empty
  • testNoopDistributedCache() - Tests noop implementation

Integration Tests

TestValkeyDistributedCacheIntegration (9 comprehensive tests using TestContainers)

  • testValkeyConnectionAndBasicOperations() - Basic get/set/invalidate
  • testUpdateQueryIdCachesAllThreeValues() - Verifies all 3 values cached via updateQueryIdCache()
  • testRoutingGroupL2Caching() - L1 miss → L2 hit for routing_group
  • testExternalUrlL2Caching() - L1 miss → L2 hit for external_url
  • testThreeTierCacheLookupForBackend() - L1 miss → L2 hit scenario
  • testCacheBackfillFromDatabase() - L1 miss → L2 miss → L3 hit → backfills L2
  • testMultipleQueryIdsWithDifferentValues() - Multiple concurrent queries
  • testCacheOverwrite() - Cache update behavior
  • testEmptyStringValues() - Edge case handling

TestRoutingManagerExternalUrlCache (6 tests)

  • Tests external URL caching with mocked QueryHistoryManager
  • Verifies L1/L2 cache coordination
  • Tests cache miss fallback to query history

TestContainers Setup

  • Added createValkeyContainer() to TestcontainersUtils
  • Spins up real PostgreSQL and Valkey containers
  • Tests complete 3-tier caching flow end-to-end

Test Results

  • 194 tests total (routing package), all passing
  • Integration tests verify real Valkey connectivity
  • No regression in existing functionality
  • 0 Checkstyle violations

##Backward Compatibility

✅ Fully backward compatible

  • Disabled by default (enabled: false)
  • No changes required to existing configs
  • Single-instance deployments work exactly as before
  • Existing tests pass without modification

Migration Path

From Single to Multi-Gateway:

  1. Deploy Valkey server
    docker run -d -p 6379:6379 valkey/valkey:latest
  2. Update config.yaml on all gateways
    valkeyConfiguration:
    enabled: true
    host: valkey.internal
    port: 6379
    password: ${VALKEY_PASSWORD}
  3. Rolling restart gateways
  4. Verify cache is working

Check Valkey keys

docker exec valkey valkey-cli KEYS "trino:query:*"

No data migration needed - cache populates automatically.

##Graceful Degradation

When Valkey is unavailable:

  • ✅ Queries continue working (falls back to L1 and L3)
  • ✅ Falls back to database lookups
  • ✅ Logs warnings (not errors)
  • ✅ Auto-recovery when Valkey returns

Dependencies

Added:

  • io.valkey:valkey-java:5.5.0
    • Valkey is a Redis fork with compatible protocol
    • Works with both Valkey and Redis servers
    • Apache 2.0 licensed
    • Modern, actively maintained

###Code Quality Improvements

New Files (8)

Core Implementation:

  • gateway-ha/src/main/java/io/trino/gateway/ha/config/ValkeyConfiguration.java (121 lines)
  • gateway-ha/src/main/java/io/trino/gateway/ha/cache/Cache.java (40 lines)
  • gateway-ha/src/main/java/io/trino/gateway/ha/cache/ValkeyDistributedCache.java (156 lines)
  • gateway-ha/src/main/java/io/trino/gateway/ha/cache/QueryCacheManager.java (184 lines) - NEW

Tests:

  • gateway-ha/src/test/java/io/trino/gateway/ha/config/TestValkeyConfiguration.java (71 lines)
  • gateway-ha/src/test/java/io/trino/gateway/ha/cache/NoopDistributedCache.java (47 lines)
  • gateway-ha/src/test/java/io/trino/gateway/ha/router/TestValkeyDistributedCache.java (44 lines)
  • gateway-ha/src/test/java/io/trino/gateway/ha/router/TestValkeyDistributedCacheIntegration.java (267 lines)

Modified Files (10)

Configuration:

  • gateway-ha/src/main/java/io/trino/gateway/ha/config/HaGatewayConfiguration.java - Added ValkeyConfiguration field

Core:

  • gateway-ha/src/main/java/io/trino/gateway/ha/module/HaGatewayProviderModule.java - Added Cache provider with cacheTtlSeconds
  • gateway-ha/src/main/java/io/trino/gateway/ha/router/BaseRoutingManager.java - Refactored to use QueryCacheManager
  • gateway-ha/src/main/java/io/trino/gateway/ha/router/QueryCountBasedRouter.java - Updated to use Cache interface
  • gateway-ha/src/main/java/io/trino/gateway/ha/router/StochasticRoutingManager.java - Updated to use Cache interface
  • gateway-ha/src/main/java/io/trino/gateway/proxyserver/ProxyRequestHandler.java - Cache all 3 values on query submission

Build:

  • gateway-ha/pom.xml - Added valkey-java dependency

Tests:

  • gateway-ha/src/test/java/io/trino/gateway/ha/util/TestcontainersUtils.java - Added createValkeyContainer()
  • gateway-ha/src/test/java/io/trino/gateway/ha/router/TestRoutingManagerExternalUrlCache.java - Updated to use NoopDistributedCache
  • 6 additional test files updated to use Cache interface and new package structure

Future Enhancements

  • Add cache metrics tracking and exposure via /metrics endpoint
  • Add TLS/SSL support for Valkey connections
  • Support Redis Cluster mode for high availability
  • Implement cache warming on startup
  • Add circuit breaker pattern for cache failures
  • Implement cache eviction strategies beyond TTL

@cla-bot cla-bot bot added the cla-signed label Jan 27, 2026
@hpopuri2 hpopuri2 requested a review from kbhatianr January 28, 2026 10:46
@hpopuri2 hpopuri2 requested a review from kbhatianr January 29, 2026 16:12
@mosabua mosabua changed the title Valkey Add distributed cache for horizontal scaling Jan 31, 2026
Copy link
Member

@oneonestar oneonestar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick skim. Please rebase to main since we migrated to Caffeine cache =)

@hpopuri2
Copy link
Contributor Author

hpopuri2 commented Feb 5, 2026

@oneonestar addressed comments and done rebasing as well. please review again

@hpopuri2 hpopuri2 requested a review from oneonestar February 8, 2026 20:16
@hpopuri2
Copy link
Contributor Author

hpopuri2 commented Feb 8, 2026

@oneonestar addressed comment changing all logic to querycachemanger and given new cache design ..

@hpopuri2 hpopuri2 requested a review from oneonestar February 9, 2026 08:35
@hpopuri2
Copy link
Contributor Author

hpopuri2 commented Feb 9, 2026

@oneonestar addressed comments and one open conversation let me know your answer there ...

@hpopuri2
Copy link
Contributor Author

hpopuri2 commented Feb 9, 2026

@oneonestar addressed comments ..please review the single cache design

@hpopuri2
Copy link
Contributor Author

@ebyhr resolved all comments please review the changes

@hpopuri2 hpopuri2 requested a review from ebyhr February 12, 2026 07:53
- Fixed cacheTtlSeconds configuration not being used in ValkeyDistributedCache
- Refactored repetitive distributedCache.isEnabled() checks into helper methods
- Created QueryCacheManager to encapsulate cache management logic
- Moved all cache classes to dedicated io.trino.gateway.ha.cache package
- Renamed DistributedCache interface to Cache for better abstraction

These changes provide better separation of concerns and make the caching
infrastructure more maintainable and reusable across the gateway.
  Resolved code review comments from @kbhatianr:

  1. Applied proper dependency injection pattern in HaGatewayProviderModule
     - Made provider methods static with injected parameters
     - HaGatewayConfiguration is injected (already bound in BaseApp)

  2. Simplified ValkeyDistributedCache constructor
     - Accept ValkeyConfiguration object instead of 10 individual parameters

  3. Implemented proper DI for QueryCacheManager
     - Added @provides method in HaGatewayProviderModule
     - Separated concerns: QueryCacheManager handles L2 (distributed cache),
       BaseRoutingManager owns L1 (LoadingCache)
     - QueryCacheManager is now injected into routing managers

  4. Abstracted cache tier orchestration
     - Added getBackend/getRoutingGroup/getExternalUrl methods to QueryCacheManager
     - These methods internally handle L2→L3 fallback and automatic backfilling
     - Eliminated manual cache tier checking from BaseRoutingManager
Use Duration, fix database logging, update documentation
Move all cache logic into QueryCacheManager
Consolidated three separate caches (backend, routingGroup, externalUrl)
into a single cache storing QueryMetadata objects. This reduces cache
operations by 3x, ensures atomic updates, and improves consistency across
the 3-tier cache architecture (L1: Caffeine, L2: Valkey, L3: Database).

Added @JsonIgnore annotations to prevent Jackson from serializing helper
methods (isEmpty, isComplete) as JSON properties, which was causing
deserialization failures in distributed cache operations.
@hpopuri2 hpopuri2 force-pushed the valkey branch 3 times, most recently from 381dbeb to fdcb77f Compare February 12, 2026 18:14
@hpopuri2
Copy link
Contributor Author

@ebyhr resolved all the comments ..please review

@hpopuri2
Copy link
Contributor Author

@oneonestar , @ebyhr please review

@oneonestar
Copy link
Member

I removed the two unnecessary caches in #923.
Would you like to take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants

Comments