Add distributed cache for horizontal scaling by hpopuri2 · Pull Request #877 · trinodb/trino-gateway

hpopuri2 · 2026-01-27T19:14:51Z

##Add Valkey Distributed Cache for Horizontal Scaling

##Summary

This PR implements distributed caching using Valkey to enable horizontal scaling of Trino Gateway. Multiple gateway instances can now share query metadata through a distributed cache layer, ensuring consistent query routing across all instances.

##Motivation

Currently, Trino Gateway uses local Guava caches that are not shared between instances. In multi-instance deployments, this can lead to:

Inconsistent query routing when requests hit different gateway instances
Cache misses requiring expensive database lookups
Inability to leverage cache across horizontally scaled deployments

This implementation addresses these limitations while maintaining backward compatibility and graceful degradation.

##Architecture

3-Tier Caching Strategy

Request Flow:

L1 Cache (Local Guava) → ~1ms
- Hit: Return immediately
- Miss: Check L2
L2 Cache (Valkey Distributed) → ~5ms
- Hit: Populate L1, return
- Miss: Check L3
L3 Cache (PostgreSQL Database) → ~50ms
- Found: Populate L2 + L1, return
- Not Found: Search backends via HTTP (~200ms)

Cache Keys

Three values are cached for each query:

trino:query:backend:{queryId} - Backend URL for query routing
trino:query:routing_group:{queryId} - Routing group assignment
trino:query:external_url:{queryId} - External URL for query access

All keys use configurable TTL (default 30 minutes / 1800 seconds).

##Implementation Details

Core Components

ValkeyConfiguration (gateway-ha/src/main/java/io/trino/gateway/ha/config/ValkeyConfiguration.java)

9 configurable parameters with sensible defaults
Input validation (port range, positive values)
Convention over Configuration - only enabled, host, and port required
Fixed: cacheTtlSeconds parameter now properly used (was previously hardcoded)

Cache Interface (gateway-ha/src/main/java/io/trino/gateway/ha/cache/Cache.java)

Generic caching abstraction: get(), set(), invalidate(), isEnabled()
Implementation-agnostic design (name describes contract, not implementation)
Enables future alternative implementations (e.g., Redis Cluster, Memcached)
Located in dedicated io.trino.gateway.ha.cache package for better organization

ValkeyDistributedCache (gateway-ha/src/main/java/io/trino/gateway/ha/cache/ValkeyDistributedCache.java)

Implements Cache interface
JedisPool connection pooling with configurable pool size
Graceful degradation when disabled or connection fails
Configurable TTL management via cacheTtlSeconds parameter

QueryCacheManager (gateway-ha/src/main/java/io/trino/gateway/ha/cache/QueryCacheManager.java) - NEW

Encapsulates all query-related cache operations
Manages 3 LoadingCache instances (backend, routing group, external URL)
Provides clean separation of concerns between routing and caching logic
Handles both L1 (in-memory) and L2 (distributed) cache operations
Methods:
- L1 operations: setBackendInL1(), getBackendFromL1(), etc.
- L2 operations: cacheBackend(), getCachedBackend(), etc.
- Combined operations: setBackend(), updateAllCaches()

NoopDistributedCache (gateway-ha/src/test/java/io/trino/gateway/ha/cache/NoopDistributedCache.java)

No-op implementation for testing without real cache
Always returns empty, always disabled

Integration

BaseRoutingManager - Simplified routing logic:

Now uses single QueryCacheManager instance instead of managing multiple caches
Reduced from ~380 to ~310 lines through better separation of concerns
updateQueryIdCache() method caches all 3 values via QueryCacheManager
All cache operations delegated to QueryCacheManager
findBackendForUnknownQueryId() - L1 → L2 → L3 → HTTP search
findRoutingGroupForUnknownQueryId() - L1 → L2 → L3 lookup
findExternalUrlForUnknownQueryId() - L1 → L2 → L3 lookup
Automatic cache backfilling when found in lower tiers

ProxyRequestHandler - Query submission:

Updated recordBackendForQueryId() to call updateQueryIdCache() with all 3 values
Ensures all query metadata is cached on first submission

HaGatewayProviderModule - Dependency injection:

@provides @singleton Cache method
Wires ValkeyConfiguration to ValkeyDistributedCache
Passes cacheTtlSeconds from configuration to cache implementation

Configuration

Minimal (Recommended for Getting Started)

valkeyConfiguration:
  enabled: true
  host: localhost
  port: 6379

With Authentication

valkeyConfiguration:
  enabled: true
  host: valkey.internal.prod
  port: 6379
  password: ${VALKEY_PASSWORD}
  database: 0

Advanced (Production Tuning)

valkeyConfiguration:
  enabled: true
  host: valkey.internal.prod
  port: 6379
  password: ${VALKEY_PASSWORD}
  database: 0
  maxTotal: 100              # Max connections in pool
  maxIdle: 50                # Max idle connections
  minIdle: 25                # Min idle connections
  timeoutMs: 5000            # Connection timeout
  cacheTtlSeconds: 3600      # 1 hour TTL for long-running queries

Single Instance (No Changes Required)

valkeyConfiguration:
   enabled: false  # Default - local cache sufficient

##Testing

Unit Tests

TestValkeyConfiguration

Default values verification
Setter/getter correctness

TestValkeyDistributedCache (2 tests)

testDisabledCache() - Verifies disabled cache returns empty
testNoopDistributedCache() - Tests noop implementation

Integration Tests

TestValkeyDistributedCacheIntegration (9 comprehensive tests using TestContainers)

testValkeyConnectionAndBasicOperations() - Basic get/set/invalidate
testUpdateQueryIdCachesAllThreeValues() - Verifies all 3 values cached via updateQueryIdCache()
testRoutingGroupL2Caching() - L1 miss → L2 hit for routing_group
testExternalUrlL2Caching() - L1 miss → L2 hit for external_url
testThreeTierCacheLookupForBackend() - L1 miss → L2 hit scenario
testCacheBackfillFromDatabase() - L1 miss → L2 miss → L3 hit → backfills L2
testMultipleQueryIdsWithDifferentValues() - Multiple concurrent queries
testCacheOverwrite() - Cache update behavior
testEmptyStringValues() - Edge case handling

TestRoutingManagerExternalUrlCache (6 tests)

Tests external URL caching with mocked QueryHistoryManager
Verifies L1/L2 cache coordination
Tests cache miss fallback to query history

TestContainers Setup

Added createValkeyContainer() to TestcontainersUtils
Spins up real PostgreSQL and Valkey containers
Tests complete 3-tier caching flow end-to-end

Test Results

194 tests total (routing package), all passing
Integration tests verify real Valkey connectivity
No regression in existing functionality
0 Checkstyle violations

##Backward Compatibility

✅ Fully backward compatible

Disabled by default (enabled: false)
No changes required to existing configs
Single-instance deployments work exactly as before
Existing tests pass without modification

Migration Path

From Single to Multi-Gateway:

Deploy Valkey server
docker run -d -p 6379:6379 valkey/valkey:latest
Update config.yaml on all gateways
valkeyConfiguration:
enabled: true
host: valkey.internal
port: 6379
password: ${VALKEY_PASSWORD}
Rolling restart gateways
Verify cache is working

Check Valkey keys

docker exec valkey valkey-cli KEYS "trino:query:*"

No data migration needed - cache populates automatically.

##Graceful Degradation

When Valkey is unavailable:

✅ Queries continue working (falls back to L1 and L3)
✅ Falls back to database lookups
✅ Logs warnings (not errors)
✅ Auto-recovery when Valkey returns

Dependencies

Added:

io.valkey:valkey-java:5.5.0
- Valkey is a Redis fork with compatible protocol
- Works with both Valkey and Redis servers
- Apache 2.0 licensed
- Modern, actively maintained

###Code Quality Improvements

New Files (8)

Core Implementation:

gateway-ha/src/main/java/io/trino/gateway/ha/config/ValkeyConfiguration.java (121 lines)
gateway-ha/src/main/java/io/trino/gateway/ha/cache/Cache.java (40 lines)
gateway-ha/src/main/java/io/trino/gateway/ha/cache/ValkeyDistributedCache.java (156 lines)
gateway-ha/src/main/java/io/trino/gateway/ha/cache/QueryCacheManager.java (184 lines) - NEW

Tests:

gateway-ha/src/test/java/io/trino/gateway/ha/config/TestValkeyConfiguration.java (71 lines)
gateway-ha/src/test/java/io/trino/gateway/ha/cache/NoopDistributedCache.java (47 lines)
gateway-ha/src/test/java/io/trino/gateway/ha/router/TestValkeyDistributedCache.java (44 lines)
gateway-ha/src/test/java/io/trino/gateway/ha/router/TestValkeyDistributedCacheIntegration.java (267 lines)

Modified Files (10)

Configuration:

gateway-ha/src/main/java/io/trino/gateway/ha/config/HaGatewayConfiguration.java - Added ValkeyConfiguration field

Core:

gateway-ha/src/main/java/io/trino/gateway/ha/module/HaGatewayProviderModule.java - Added Cache provider with cacheTtlSeconds
gateway-ha/src/main/java/io/trino/gateway/ha/router/BaseRoutingManager.java - Refactored to use QueryCacheManager
gateway-ha/src/main/java/io/trino/gateway/ha/router/QueryCountBasedRouter.java - Updated to use Cache interface
gateway-ha/src/main/java/io/trino/gateway/ha/router/StochasticRoutingManager.java - Updated to use Cache interface
gateway-ha/src/main/java/io/trino/gateway/proxyserver/ProxyRequestHandler.java - Cache all 3 values on query submission

Build:

gateway-ha/pom.xml - Added valkey-java dependency

Tests:

gateway-ha/src/test/java/io/trino/gateway/ha/util/TestcontainersUtils.java - Added createValkeyContainer()
gateway-ha/src/test/java/io/trino/gateway/ha/router/TestRoutingManagerExternalUrlCache.java - Updated to use NoopDistributedCache
6 additional test files updated to use Cache interface and new package structure

Future Enhancements

Add cache metrics tracking and exposure via /metrics endpoint
Add TLS/SSL support for Valkey connections
Support Redis Cluster mode for high availability
Implement cache warming on startup
Add circuit breaker pattern for cache failures
Implement cache eviction strategies beyond TTL

gateway-ha/src/main/java/io/trino/gateway/ha/router/BaseRoutingManager.java

gateway-ha/gateway.log

gateway-ha/src/main/java/io/trino/gateway/ha/router/DistributedCache.java

gateway-ha/src/main/java/io/trino/gateway/ha/module/HaGatewayProviderModule.java

gateway-ha/src/main/java/io/trino/gateway/ha/cache/ValkeyDistributedCache.java

gateway-ha/src/main/java/io/trino/gateway/ha/cache/QueryCacheManager.java

gateway-ha/src/main/java/io/trino/gateway/ha/router/BaseRoutingManager.java

gateway-ha/src/main/java/io/trino/gateway/ha/cache/ValkeyDistributedCache.java

oneonestar

Just a quick skim. Please rebase to main since we migrated to Caffeine cache =)

docs/installation.md

gateway-ha/src/main/java/io/trino/gateway/ha/cache/Cache.java

gateway-ha/src/main/java/io/trino/gateway/ha/config/ValkeyConfiguration.java

gateway-ha/src/main/java/io/trino/gateway/ha/cache/ValkeyDistributedCache.java

hpopuri2 · 2026-02-05T10:35:55Z

@oneonestar addressed comments and done rebasing as well. please review again

gateway-ha/src/main/java/io/trino/gateway/ha/router/BaseRoutingManager.java

hpopuri2 · 2026-02-08T20:17:21Z

@oneonestar addressed comment changing all logic to querycachemanger and given new cache design ..

gateway-ha/src/main/java/io/trino/gateway/ha/router/BaseRoutingManager.java

gateway-ha/src/main/java/io/trino/gateway/ha/cache/QueryCacheManager.java

gateway-ha/src/main/java/io/trino/gateway/proxyserver/ProxyRequestHandler.java

gateway-ha/.gitignore

gateway-ha/src/main/java/io/trino/gateway/ha/router/BaseRoutingManager.java

hpopuri2 · 2026-02-09T08:58:48Z

@oneonestar addressed comments and one open conversation let me know your answer there ...

hpopuri2 · 2026-02-09T13:16:11Z

@oneonestar addressed comments ..please review the single cache design

gateway-ha/src/main/java/io/trino/gateway/ha/cache/ValkeyDistributedCache.java

docs/config.yaml

gateway-ha/src/main/java/io/trino/gateway/ha/cache/DistributedCache.java

gateway-ha/src/main/java/io/trino/gateway/ha/cache/QueryCacheManager.java

gateway-ha/src/main/java/io/trino/gateway/ha/cache/ValkeyDistributedCache.java

gateway-ha/src/test/java/io/trino/gateway/ha/cache/TestQueryCacheManager.java

gateway-ha/src/test/java/io/trino/gateway/ha/router/TestValkeyDistributedCacheIntegration.java

gateway-ha/src/test/java/io/trino/gateway/ha/util/TestcontainersUtils.java

gateway-ha/src/test/java/io/trino/gateway/ha/router/TestValkeyDistributedCacheIntegration.java

hpopuri2 · 2026-02-12T07:53:09Z

@ebyhr resolved all comments please review the changes

- Fixed cacheTtlSeconds configuration not being used in ValkeyDistributedCache - Refactored repetitive distributedCache.isEnabled() checks into helper methods - Created QueryCacheManager to encapsulate cache management logic - Moved all cache classes to dedicated io.trino.gateway.ha.cache package - Renamed DistributedCache interface to Cache for better abstraction These changes provide better separation of concerns and make the caching infrastructure more maintainable and reusable across the gateway.

@kbhatianr

Resolved code review comments from @kbhatianr: 1. Applied proper dependency injection pattern in HaGatewayProviderModule - Made provider methods static with injected parameters - HaGatewayConfiguration is injected (already bound in BaseApp) 2. Simplified ValkeyDistributedCache constructor - Accept ValkeyConfiguration object instead of 10 individual parameters 3. Implemented proper DI for QueryCacheManager - Added @provides method in HaGatewayProviderModule - Separated concerns: QueryCacheManager handles L2 (distributed cache), BaseRoutingManager owns L1 (LoadingCache) - QueryCacheManager is now injected into routing managers 4. Abstracted cache tier orchestration - Added getBackend/getRoutingGroup/getExternalUrl methods to QueryCacheManager - These methods internally handle L2→L3 fallback and automatic backfilling - Eliminated manual cache tier checking from BaseRoutingManager

Use Duration, fix database logging, update documentation

Move all cache logic into QueryCacheManager

…e QueryCacheManager directly

Consolidated three separate caches (backend, routingGroup, externalUrl) into a single cache storing QueryMetadata objects. This reduces cache operations by 3x, ensures atomic updates, and improves consistency across the 3-tier cache architecture (L1: Caffeine, L2: Valkey, L3: Database). Added @JsonIgnore annotations to prevent Jackson from serializing helper methods (isEmpty, isComplete) as JSON properties, which was causing deserialization failures in distributed cache operations.

hpopuri2 · 2026-02-13T12:11:23Z

@ebyhr resolved all the comments ..please review

hpopuri2 · 2026-02-17T04:33:55Z

@oneonestar , @ebyhr please review

oneonestar · 2026-02-18T04:29:48Z

I removed the two unnecessary caches in #923.
Would you like to take a look?

cla-bot bot added the cla-signed label Jan 27, 2026

kbhatianr reviewed Jan 27, 2026

View reviewed changes

gateway-ha/src/main/java/io/trino/gateway/ha/router/BaseRoutingManager.java Outdated Show resolved Hide resolved

kbhatianr reviewed Jan 27, 2026

View reviewed changes

gateway-ha/src/main/java/io/trino/gateway/ha/router/BaseRoutingManager.java Outdated Show resolved Hide resolved

kbhatianr reviewed Jan 27, 2026

View reviewed changes

gateway-ha/gateway.log Outdated Show resolved Hide resolved

kbhatianr reviewed Jan 27, 2026

View reviewed changes

gateway-ha/src/main/java/io/trino/gateway/ha/router/DistributedCache.java Show resolved Hide resolved

kbhatianr reviewed Jan 27, 2026

View reviewed changes

gateway-ha/src/main/java/io/trino/gateway/ha/router/DistributedCache.java Show resolved Hide resolved

hpopuri2 requested a review from kbhatianr January 28, 2026 10:46