An AWS-native storage backend for SGLang's HiCache system that enables multi-regional KV cache persistence using DynamoDB for metadata and S3 for data storage via NIXL.
This implementation decouples "cache locality" from "GPU locality," allowing KV cache state to survive instance scale-down events and be shared across regions while maintaining HiCache's performance benefits (up to 6× throughput improvement, 80% TTFT reduction).
- Multi-Region Support: Cache data can be accessed across AWS regions with automatic read-through
- Two-Phase Commit: Ensures readers never access partially-written or corrupted cache blocks
- GPU-Optimized Transfers: Uses NIXL for efficient GPU ↔ S3 data movement
- Graceful Degradation: Continues serving inference requests even during AWS service outages
- Cost Management: Configurable rate limits and hotness-based replication to control costs
The backend separates concerns into two planes:
- Metadata Plane (DynamoDB): Fast key lookups, state management, access tracking
- Data Plane (S3 + NIXL): Bulk KV cache block storage and GPU-optimized transfers
graph TB
subgraph App["SGLang Application Layer"]
CC[Cache Controller]
end
subgraph Backend["HiCache DynamoDB+S3+NIXL Backend"]
Main[HiCacheDynamoS3NixlBackend]
MS[MetadataStore]
S3S[S3NixlStore]
SF[SingleFlightCoordinator]
WSM[WriteStateManager]
CRR[CrossRegionReplicator]
MC[MetricsCollector]
end
subgraph Region1["AWS Region: us-east-1"]
DDB1[(DynamoDB Global Table)]
S3_1[(S3 Bucket)]
end
subgraph Region2["AWS Region: us-west-2"]
DDB2[(DynamoDB Replica)]
S3_2[(S3 Bucket)]
end
subgraph Region3["AWS Region: eu-west-1"]
DDB3[(DynamoDB Replica)]
S3_3[(S3 Bucket)]
end
%% Main connections
CC --> Main
Main --> MS
Main --> S3S
Main --> SF
Main --> WSM
Main --> CRR
Main --> MC
%% Storage connections
MS --> DDB1
MS -.-> DDB2
MS -.-> DDB3
S3S --> S3_1
S3S -.-> S3_2
S3S -.-> S3_3
%% Replication
DDB1 <-.-> DDB2
DDB2 <-.-> DDB3
DDB1 <-.-> DDB3
S3_1 -.-> S3_2
S3_2 -.-> S3_3
%% Styling
classDef aws fill:#ff9900,stroke:#232f3e,stroke-width:2px,color:#fff
classDef backend fill:#4a90e2,stroke:#2c5aa0,stroke-width:2px,color:#fff
classDef app fill:#50c878,stroke:#2d5016,stroke-width:2px,color:#fff
class DDB1,DDB2,DDB3,S3_1,S3_2,S3_3 aws
class Main,MS,S3S,SF,WSM,CRR,MC backend
class CC app
sequenceDiagram
participant App as SGLang App
participant Backend as HiCache Backend
participant WSM as WriteStateManager
participant DDB as DynamoDB
participant S3 as S3 + NIXL
App->>Backend: set(key, kv_data)
Backend->>WSM: begin_write(key, region)
WSM->>DDB: PutItem(state=WRITING, write_id=uuid)
alt Write Claim Successful
DDB-->>WSM: Success
WSM-->>Backend: write_id
Backend->>S3: NIXL write(kv_data)
S3-->>Backend: {etag, version_id, size}
Backend->>WSM: complete_write(write_id, s3_result)
WSM->>DDB: UpdateItem(state=COMMITTED, condition: write_id match)
DDB-->>WSM: Success
WSM-->>Backend: Success
Backend-->>App: True
else Write Claim Failed (Key Exists)
DDB-->>WSM: ConditionalCheckFailedException
WSM-->>Backend: None
Backend-->>App: False
end
sequenceDiagram
participant App as SGLang App
participant Backend as HiCache Backend
participant SF as SingleFlight
participant DDB as DynamoDB
participant S3Local as S3 Local
participant S3Remote as S3 Remote
participant CRR as CrossRegionReplicator
App->>Backend: get(key)
Backend->>DDB: GetItem(key, local_region)
alt Local Cache Hit
DDB-->>Backend: {state: COMMITTED, s3_path}
Backend->>S3Local: NIXL read(s3_path)
S3Local-->>Backend: kv_data
Backend->>DDB: update_access_async(hit_count++)
Backend-->>App: kv_data
else Local Cache Miss
DDB-->>Backend: NOT_FOUND
Backend->>DDB: Query(key, all_regions)
DDB-->>Backend: [{region: us-west-2, state: COMMITTED}]
Backend->>SF: do(key, fetch_from_remote)
SF->>CRR: fetch_from_remote(key, remote_metadata)
CRR->>S3Remote: NIXL read(remote_s3_path)
S3Remote-->>CRR: kv_data
alt Should Replicate Locally
CRR->>S3Local: NIXL write(kv_data)
CRR->>DDB: PutItem(local_region_metadata)
end
CRR-->>SF: kv_data
SF-->>Backend: kv_data
Backend-->>App: kv_data
end
- Python 3.8+
- AWS credentials configured (IAM role recommended)
- NIXL library installed
- SGLang framework
pip install boto3 aioboto3 nixl-
DynamoDB Global Table:
aws dynamodb create-table \ --table-name sglang-hicache-metadata \ --attribute-definitions \ AttributeName=cache_key,AttributeType=S \ AttributeName=region,AttributeType=S \ --key-schema \ AttributeName=cache_key,KeyType=HASH \ AttributeName=region,KeyType=RANGE \ --billing-mode PAY_PER_REQUEST -
S3 Buckets (per region):
aws s3 mb s3://sglang-hicache-us-east-1 --region us-east-1 aws s3 mb s3://sglang-hicache-us-west-2 --region us-west-2
-
Enable S3 Versioning:
aws s3api put-bucket-versioning \ --bucket sglang-hicache-us-east-1 \ --versioning-configuration Status=Enabled
export HICACHE_DYNAMODB_TABLE=sglang-hicache-metadata
export HICACHE_S3_BUCKET_US_EAST_1=sglang-hicache-us-east-1
export HICACHE_S3_BUCKET_US_WEST_2=sglang-hicache-us-west-2
export AWS_DEFAULT_REGION=us-east-1Enable the backend via command-line flag:
python -m sglang.launch_server \
--model-path meta-llama/Llama-2-7b-chat-hf \
--hicache-storage-backend dynamodb_s3_nixl \
--other-sglang-options@dataclass
class BackendConfig:
# Required
dynamodb_table_name: str = "sglang-hicache-metadata"
local_region: str = "us-east-1"
local_s3_bucket: str = "sglang-hicache-us-east-1"
# Multi-region setup
regional_buckets: Dict[str, str] = field(default_factory=dict)
# Performance tuning
max_concurrent_remote_fetches: int = 10
max_bytes_per_minute_replication: int = 100 * 1024 * 1024 # 100MB/min
hotness_threshold_for_replication: int = 3
# Lifecycle management
cache_ttl_seconds: int = 86400 # 24 hours
stale_write_ttl_seconds: int = 60The backend implements SGLang's standard HiCache interface:
from sglang.srt.mem_cache.storage.dynamodb_s3_nixl import HiCacheDynamoS3NixlBackend
# Initialize backend
backend = HiCacheDynamoS3NixlBackend(config)
# Check if cache key exists
exists = await backend.exist("cache_key_hash")
# Retrieve cached data
data = await backend.get("cache_key_hash")
# Store new cache data
success = await backend.set("cache_key_hash", kv_cache_data){
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:UpdateItem",
"dynamodb:DeleteItem",
"dynamodb:Query"
],
"Resource": "arn:aws:dynamodb:*:*:table/sglang-hicache-metadata*"
}
]
}{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": "arn:aws:s3:::sglang-hicache-*/*"
}
]
}The backend exposes the following metrics for monitoring:
hit_local_s3: Cache hits from local region S3hit_remote_s3: Cache hits from remote region S3miss_total: Total cache misses (with reason breakdown)bytes_remote_read: Cross-region data transfer volumebytes_written: Data written to cachedynamodb_errors: DynamoDB operation failuress3_errors: S3 operation failuresexist_latency_p50_p95: Latency percentiles for exist() callsget_latency_p50_p95: Latency percentiles for get() callsset_latency_p50_p95: Latency percentiles for set() calls
exist()p95 latency: ≤10ms (single-region)get()metadata lookup p95 latency: ≤10msset()non-blocking time: ≤50ms (before async transfer)
- VPC Endpoints: Use VPC Gateway Endpoints for DynamoDB and S3 to reduce data transfer costs
- S3 Lifecycle Rules: Configure automatic transition to cheaper storage tiers
- Replication Limits: Set appropriate
max_bytes_per_minute_replicationbased on budget - TTL Configuration: Use DynamoDB TTL and S3 lifecycle rules for automatic cleanup
- Cross-region data transfer incurs charges
- Configure
hotness_threshold_for_replicationto only replicate frequently accessed cache - Monitor
bytes_remote_readmetric to track cross-region costs
- High Miss Rate: Check DynamoDB and S3 service health, verify IAM permissions
- Cross-Region Timeouts: Increase
max_concurrent_remote_fetchesor check network connectivity - Cost Spikes: Review replication settings and cross-region transfer volumes
- Stale WRITING Entries: Background cleanup runs automatically, check
stale_write_ttl_seconds
Enable detailed logging:
import logging
logging.getLogger('sglang.hicache.dynamodb_s3_nixl').setLevel(logging.DEBUG)# Unit tests
python -m pytest sglang/srt/mem_cache/storage/dynamodb_s3_nixl/
# Integration tests (requires LocalStack)
docker run -d -p 4566:4566 localstack/localstack
python -m pytest sglang/srt/mem_cache/storage/dynamodb_s3_nixl/ --integrationsglang/srt/mem_cache/storage/dynamodb_s3_nixl/
├── __init__.py
├── backend.py # Main HiCache backend implementation
├── config.py # Configuration dataclasses
├── dynamodb_store.py # DynamoDB metadata operations
├── s3_store.py # S3+NIXL data operations
├── single_flight.py # Concurrent request coalescing
├── write_state_manager.py # Two-phase commit lifecycle
├── metadata.py # Data models and schemas
└── test_*.py # Test files
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
This project is part of the SGLang framework. See the main SGLang repository for license information.
For issues and questions:
- SGLang GitHub Issues: sglang repository
- Documentation: SGLang Documentation