-
Notifications
You must be signed in to change notification settings - Fork 604
Description
1. Summary
This RFC proposes two independent, composable features to enhance Mooncake Store for model weight storage workloads:
- Hard Pin — A mechanism that guarantees an object will never be evicted, until explicitly unpinned or removed. Hard pin is set at creation time via
ReplicateConfig. - Upsert — An update interface that replaces an existing object's data in-place (when size is unchanged) or via delete-and-reallocate (when size differs). Concurrent in-progress writes on the same key are preempted.
Both features are designed as orthogonal primitives. They can be used independently or together, preserving flexibility for reinforcement learning, model management, and hybrid KVCache + model weight workloads.
The rest of this document first explains the design of the two features, then summarizes implementation impact, client APIs, configuration, compatibility, and testing.
2. Design
2.1 Design Principles
- Minimal data structure changes — Reuse existing mechanisms (
replicas_,ReplicaStatus,processing_keys,DiscardedReplicas) wherever possible. - Predictable reader behavior — During upsert, the key temporarily has no readable replicas (all replicas are PROCESSING). Readers get an empty result, consistent with a key mid-
PutStart. This is acceptable for model weight workloads where updates are coordinated. - Memory efficiency — Avoid 2x memory overhead by reusing existing buffers (in-place) or freeing before reallocating, instead of COW.
- Failure safety — Crashed or failed operations must be automatically cleaned up by existing timeout mechanisms.
2.2 Hard Pin
2.2.1 Data Model
Add a hard_pinned boolean to ObjectMetadata:
// master_service.h — ObjectMetadata (alongside existing lease_timeout and soft_pin_timeout)
mutable bool hard_pinned GUARDED_BY(lock){false};Why a boolean instead of a timeout?
- Model weights need indefinite protection. Using a timeout to approximate "forever" is the exact anti-pattern we're replacing.
- A boolean is cheaper to check (no clock comparison) and semantically unambiguous.
2.2.2 Setting Hard Pin
Hard pin is set at creation time via ReplicateConfig:
// replica.h
struct ReplicateConfig {
size_t replica_num{1};
bool with_soft_pin{false};
bool with_hard_pin{false}; // NEW
// ...
};When with_hard_pin=true, the object is hard-pinned from creation. This is the common path for model weight storage. Hard pin state is preserved across Upsert operations.
To unpin a hard-pinned object, the user must Remove it and re-create it without with_hard_pin.
2.2.3 Eviction Behavior
The eviction loop in BatchEvict() currently has two passes:
| Pass | What it evicts |
|---|---|
| First | Non-soft-pinned objects with expired leases |
| Second | Soft-pinned objects (if allow_evict_soft_pinned_objects_ is true) |
With hard pin, the change is minimal — add a single check at the top of the iteration:
// master_service.cpp — BatchEvict(), in the per-object loop (3 locations)
if (it->second.IsHardPinned()) {
++it;
continue; // Skip — hard-pinned objects are NEVER evicted
}
// ... existing soft pin / lease checks unchangedThe eviction priority becomes:
Eviction Priority (first evicted → last evicted):
1. Non-pinned, lease-expired objects ← evicted first
2. Soft-pinned objects (if allowed) ← evicted under pressure
3. Hard-pinned objects ← NEVER evicted
2.2.4 HA mode Serialization
The MetadataSerializer must persist the hard_pinned field so that hard pin state survives master restarts. This is a single boolean field appended to the existing msgpack serialization.
2.3 Upsert (In-Place Update / Delete-and-Reallocate)
2.3.1 Core Insight
Instead of a COW approach that requires 2x memory (old + new data coexist), Upsert takes a simpler, more memory-efficient approach:
- When size is unchanged: reuse the existing memory buffers — mark replicas as PROCESSING, let the client overwrite the data in-place, then mark them COMPLETE again.
- When size differs: free the old replicas first, then allocate new ones at the required size.
- When the key has an in-progress write (Put or Upsert): preempt the in-progress operation — move its PROCESSING replicas to
DiscardedReplicasfor delayed release, then proceed with the new Upsert.
This design reuses existing mechanisms with minimal changes:
| Existing Mechanism | Role in Upsert |
|---|---|
ReplicaStatus state machine |
COMPLETE → PROCESSING → COMPLETE lifecycle for in-place updates |
processing_keys set |
Tracks keys with active Upsert/Put writes |
DiscardedReplicas + TTL |
Safe delayed release of preempted PROCESSING replicas |
DiscardExpiredProcessingReplicas |
Auto-cleanup of failed upsert's replicas on timeout |
Required metadata changes:
ObjectMetadata::put_start_timemust become non-const— refreshed tonowon in-place UpsertStart, soDiscardExpiredProcessingReplicastimes out the current write cycle, not the original creation time.ObjectMetadata::client_idmust become non-const— updated to the new client on in-place UpsertStart.ObjectMetadata::sizeremainsconst— when size changes, the metadata entry is deleted and recreated.Replicaneeds new methods:mark_processing()— transitions COMPLETE → PROCESSING for in-place updates.is_processing()/fn_is_processing()— predicate for filtering PROCESSING replicas during preemption.is_busy()/fn_is_busy()— checksrefcnt > 0, used as a safety gate before allowing Case B/C.
New error code:
OBJECT_REPLICA_BUSY(-714) — returned when UpsertStart finds replicas with non-zero refcnt (active RDMA reads in flight). The client should retry after readers finish.
No changes are needed to eviction logic or any other existing mechanism.
2.3.2 API Design
Following the existing PutStart/PutEnd/PutRevoke pattern:
auto UpsertStart(const UUID& client_id, const std::string& key,
const uint64_t slice_length, const ReplicateConfig& config)
-> tl::expected<std::vector<Replica::Descriptor>, ErrorCode>;
auto UpsertEnd(const UUID& client_id, const std::string& key,
ReplicaType replica_type)
-> tl::expected<void, ErrorCode>;
auto UpsertRevoke(const UUID& client_id, const std::string& key,
ReplicaType replica_type)
-> tl::expected<void, ErrorCode>;UpsertEnd and UpsertRevoke are functionally equivalent to PutEnd and PutRevoke — no special logic is needed.
2.3.3 Detailed Flow
UpsertStart(client_id, key, slice_length, config):
Step 0: Cleanup stale handles
IF key exists AND all handles are stale (CleanupStaleHandles returns true):
→ Erase processing_keys entry and metadata
→ Treat as non-existent (fall through to Case A)
Step 1: Safety checks and preemption (only if key exists)
1a. IF key has active replication_tasks (Copy/Move in progress):
→ Return OBJECT_HAS_REPLICATION_TASK
1b. IF key has active offloading_tasks:
→ Return OBJECT_HAS_REPLICATION_TASK
1c. IF key is in processing_keys (another Put/Upsert in progress):
→ Pop all PROCESSING replicas → DiscardedReplicas (TTL for delayed release)
→ Remove key from processing_keys
→ If no COMPLETE replicas remain, erase metadata entirely
and fall through to Case A
Step 2: Dispatch by key existence and size
Case A — Key does not exist (or erased by Step 0/1c):
→ Equivalent to PutStart (allocate replicas, create ObjectMetadata)
→ Return new replica descriptors
(Safety gate before Case B/C)
IF any replica has non-zero refcnt (is_busy):
→ Return OBJECT_REPLICA_BUSY
Case B — Key exists, size unchanged (in-place update):
→ Mark all COMPLETE replicas → PROCESSING (via mark_processing())
→ Refresh put_start_time = now
→ Update client_id = new client_id
→ Insert key into processing_keys
→ Return existing replica descriptors (same memory addresses)
Case C — Key exists, size changed (delete-and-reallocate):
→ Pop all replicas → DiscardedReplicas (TTL for delayed release,
same as MoveEnd — needed because GetReplicaList doesn't use refcnt,
so readers may still hold descriptors without incrementing refcnt)
→ Erase metadata entry
→ Allocate new replicas + create new ObjectMetadata (same as PutStart)
→ Return new replica descriptors
Client transfers data (identical to Put — TransferWrite via RDMA/TCP)
UpsertEnd(client_id, key, replica_type):
- Equivalent to
PutEnd— mark PROCESSING → COMPLETE, grant lease, remove from processing_keys.
UpsertRevoke(client_id, key, replica_type):
- Equivalent to
PutRevoke— erase PROCESSING replicas, clean up metadata if empty.
2.3.4 Swimlane Diagrams
In-Place Update Path (key exists, size unchanged)
sequenceDiagram
participant C as Client
participant M as Master Service
participant TE as Transfer Engine
participant R as Server
C->>M: UpsertStart(client_id, key, same_size, config)
Note over M: Acquire shard write lock
M->>M: Find existing ObjectMetadata
M->>M: Safety checks (replication/offloading tasks, refcnt)
M->>M: size unchanged → in-place path
M->>M: mark_processing() on all COMPLETE replicas
M->>M: Refresh put_start_time, update client_id
M->>M: Insert key into processing_keys
Note over M: Release shard write lock
M-->>C: Return existing replica descriptors (same addresses)
C->>TE: TransferWrite(descriptors, new_data)
TE->>R: RDMA Write to same buffer
R-->>TE: Write complete
TE-->>C: Transfer success
C->>M: UpsertEnd(client_id, key, MEMORY)
Note over M: Same as PutEnd
M->>M: mark_complete() on PROCESSING replicas
M->>M: Remove from processing_keys
M->>M: GrantLease (preserve hard_pin/soft_pin)
M-->>C: Success
Delete-and-Reallocate Path (key exists, size changed)
sequenceDiagram
participant C as Client
participant M as Master Service
participant TE as Transfer Engine
participant R as Server
C->>M: UpsertStart(client_id, key, new_size, config)
Note over M: Acquire shard write lock
M->>M: Find existing ObjectMetadata
M->>M: size differs → delete-and-reallocate path
M->>M: Check refcnt → reject if busy (OBJECT_REPLICA_BUSY)
M->>M: Pop all replicas → DiscardedReplicas (TTL delayed release)
M->>M: Erase old metadata
M->>M: Allocate new replicas (same as PutStart)
M->>M: Create new ObjectMetadata
Note over M: Release shard write lock
M-->>C: Return new replica descriptors
C->>TE: TransferWrite(descriptors, data)
TE->>R: RDMA Write to new buffer
R-->>TE: Write complete
TE-->>C: Transfer success
C->>M: UpsertEnd(client_id, key, MEMORY)
M->>M: Same as PutEnd
M-->>C: Success
Preemption Path (key has in-progress write)
sequenceDiagram
participant A as Client A
participant B as Client B
participant M as Master Service
A->>M: PutStart(key) or UpsertStart(key)
M-->>A: Return descriptors
Note over A: Transfer in progress...
B->>M: UpsertStart(key, ...)
Note over M: key is in processing_keys
M->>M: Pop A's PROCESSING replicas → DiscardedReplicas (with TTL)
M->>M: Remove from processing_keys
M->>M: Continue with normal Upsert flow (Case A/B/C)
M-->>B: Return descriptors
Note over M: A's replicas released after TTL<br/>by ReleaseExpiredDiscardedReplicas()
A->>M: PutEnd(key) or UpsertEnd(key)
M-->>A: ILLEGAL_CLIENT or OBJECT_NOT_FOUND
2.3.5 Failure Handling
- Client crash after UpsertStart:
DiscardExpiredProcessingReplicaswill clean up afterput_start_discard_timeout_sec_, same as forPutStart. For in-place updates, the replicas are freed normally (they are PROCESSING with no COMPLETE replicas, so the entire metadata entry is removed). For delete-and-reallocate, cleanup is identical to a crashedPutStart. - Allocation failure in delete-and-reallocate: If the old replicas have been freed but new allocation fails, the key's data is lost. This is acceptable for model weight workloads where the source data (GPU/CPU) is still available and the client can retry.
- Preempted client calls UpsertEnd/PutEnd: Returns
ILLEGAL_CLIENT(if metadata was reused with a new client_id) orOBJECT_NOT_FOUND(if metadata was erased and recreated). The client should handle this as a failed operation. - Concurrent Copy/Move/Offload: Returns
OBJECT_HAS_REPLICATION_TASK. The client should retry after the replication task completes. - Active readers (non-zero refcnt): Returns
OBJECT_REPLICA_BUSY. The client should retry after readers release their references. Note: this only blocks Case B/C; if the key is mid-write (in processing_keys), preemption proceeds regardless of refcnt on the PROCESSING replicas.
4. Client-Side API
C++
// Upsert (same signature as put_from, but updates if exists)
ReplicateConfig config;
config.with_hard_pin = true; // hard pin set at creation time
client->upsert_from("model/policy/weights", buffer, size, config);Python
# Upsert
config = ReplicateConfig(with_hard_pin=True)
store.upsert_from("model/policy/weights", buffer, size, config)5. Configuration
No new configuration parameters are required. Hard pin is a per-object attribute set via ReplicateConfig.with_hard_pin at creation time. All existing configuration parameters retain their current behavior.
6. Migration and Backward Compatibility
- Wire format:
ReplicateConfiggains one new boolean field (with_hard_pin, defaultfalse). Old clients sending the old format will default tofalse— no hard pin, existing behavior preserved. - Snapshot format: A new
hard_pinnedfield is appended to the serializedObjectMetadata. On deserialization, if the field is absent (old snapshot), it defaults tofalse. - Existing APIs:
Put,Get,Remove,PutStart/PutEnd/PutRevokeare unchanged. No existing behavior is modified.
7. Testing Strategy
-
Unit tests for
MasterService:UpsertNewKey: Case A — key doesn't exist, equivalent to PutStartUpsertSameSize: Case B — in-place update, verifies descriptor addresses are reusedUpsertSameSizeRefreshesMetadata: Case B — verifies client_id is refreshed (old client getsILLEGAL_CLIENT)UpsertDifferentSize: Case C — verifies new descriptor addresses, old replicas releasedUpsertConflictReplicationTask: CopyStart in progress → returnsOBJECT_HAS_REPLICATION_TASKUpsertPreemptsInProgressPut: Preempts incomplete Put, old client's PutEnd returns errorUpsertRevoke: Case A + UpsertRevoke cleans upUpsertInPlaceThenRevoke: Case B + UpsertRevoke, key is deletedBatchUpsertStart: Mix of Case A and Case B in a single batch call
-
Integration tests:
- End-to-end upsert with data verification (write new data, verify data is updated after
UpsertEnd) - Hard pin + upsert combined (update a hard-pinned object, verify it remains pinned)
- Preemption scenario: two clients race on the same key, second one wins
- Client crash simulation (kill client after
UpsertStart, verify auto-cleanup by timeout)
- End-to-end upsert with data verification (write new data, verify data is updated after
-
Snapshot/Restore tests:
- Hard-pinned objects retain their pin state after master restart
Before submitting a new issue...
- Make sure you already searched for relevant issues and read the documentation