[RFC]: Hard Pin and Upsert for Mooncake Store

## 1. Summary

This RFC proposes two independent, composable features to enhance Mooncake Store for model weight storage workloads:

1. **Hard Pin** — A mechanism that guarantees an object will never be evicted, until explicitly unpinned or removed. Hard pin is set at creation time via `ReplicateConfig`.
2. **Upsert** — An update interface that replaces an existing object's data in-place (when size is unchanged) or via delete-and-reallocate (when size differs). Concurrent in-progress writes on the same key are preempted.

Both features are designed as orthogonal primitives. They can be used independently or together, preserving flexibility for reinforcement learning, model management, and hybrid KVCache + model weight workloads.

The rest of this document first explains the design of the two features, then summarizes implementation impact, client APIs, configuration, compatibility, and testing.

## 2. Design

### 2.1 Design Principles

1. **Minimal data structure changes** — Reuse existing mechanisms (`replicas_`, `ReplicaStatus`, `processing_keys`, `DiscardedReplicas`) wherever possible.
2. **Predictable reader behavior** — During upsert, the key temporarily has no readable replicas (all replicas are PROCESSING). Readers get an empty result, consistent with a key mid-`PutStart`. This is acceptable for model weight workloads where updates are coordinated.
3. **Memory efficiency** — Avoid 2x memory overhead by reusing existing buffers (in-place) or freeing before reallocating, instead of COW.
4. **Failure safety** — Crashed or failed operations must be automatically cleaned up by existing timeout mechanisms.


### 2.2 Hard Pin

#### 2.2.1 Data Model

Add a `hard_pinned` boolean to `ObjectMetadata`:

```cpp
// master_service.h — ObjectMetadata (alongside existing lease_timeout and soft_pin_timeout)
mutable bool hard_pinned GUARDED_BY(lock){false};
```

Why a boolean instead of a timeout?
- Model weights need **indefinite** protection. Using a timeout to approximate "forever" is the exact anti-pattern we're replacing.
- A boolean is cheaper to check (no clock comparison) and semantically unambiguous.

#### 2.2.2 Setting Hard Pin

Hard pin is set at creation time via `ReplicateConfig`:

```cpp
// replica.h
struct ReplicateConfig {
    size_t replica_num{1};
    bool with_soft_pin{false};
    bool with_hard_pin{false};  // NEW
    // ...
};
```

When `with_hard_pin=true`, the object is hard-pinned from creation. This is the common path for model weight storage. Hard pin state is preserved across Upsert operations.

To unpin a hard-pinned object, the user must `Remove` it and re-create it without `with_hard_pin`.

#### 2.2.3 Eviction Behavior

The eviction loop in `BatchEvict()` currently has two passes:

| Pass | What it evicts |
|------|---------------|
| First | Non-soft-pinned objects with expired leases |
| Second | Soft-pinned objects (if `allow_evict_soft_pinned_objects_` is true) |

With hard pin, the change is minimal — add a single check at the top of the iteration:

```cpp
// master_service.cpp — BatchEvict(), in the per-object loop (3 locations)
if (it->second.IsHardPinned()) {
    ++it;
    continue;  // Skip — hard-pinned objects are NEVER evicted
}
// ... existing soft pin / lease checks unchanged
```

The eviction priority becomes:

```
Eviction Priority (first evicted → last evicted):
1. Non-pinned, lease-expired objects     ← evicted first
2. Soft-pinned objects (if allowed)      ← evicted under pressure
3. Hard-pinned objects                   ← NEVER evicted
```

#### 2.2.4 HA mode Serialization

The `MetadataSerializer` must persist the `hard_pinned` field so that hard pin state survives master restarts. This is a single boolean field appended to the existing msgpack serialization.

### 2.3 Upsert (In-Place Update / Delete-and-Reallocate)

#### 2.3.1 Core Insight

Instead of a COW approach that requires 2x memory (old + new data coexist), Upsert takes a simpler, more memory-efficient approach:

- **When size is unchanged**: reuse the existing memory buffers — mark replicas as PROCESSING, let the client overwrite the data in-place, then mark them COMPLETE again.
- **When size differs**: free the old replicas first, then allocate new ones at the required size.
- **When the key has an in-progress write** (Put or Upsert): preempt the in-progress operation — move its PROCESSING replicas to `DiscardedReplicas` for delayed release, then proceed with the new Upsert.

This design reuses existing mechanisms with minimal changes:

| Existing Mechanism | Role in Upsert |
|---|---|
| `ReplicaStatus` state machine | COMPLETE → PROCESSING → COMPLETE lifecycle for in-place updates |
| `processing_keys` set | Tracks keys with active Upsert/Put writes |
| `DiscardedReplicas` + TTL | Safe delayed release of preempted PROCESSING replicas |
| `DiscardExpiredProcessingReplicas` | Auto-cleanup of failed upsert's replicas on timeout |

Required metadata changes:

1. `ObjectMetadata::put_start_time` must become non-`const` — refreshed to `now` on in-place UpsertStart, so `DiscardExpiredProcessingReplicas` times out the current write cycle, not the original creation time.
2. `ObjectMetadata::client_id` must become non-`const` — updated to the new client on in-place UpsertStart.
3. `ObjectMetadata::size` remains `const` — when size changes, the metadata entry is deleted and recreated.
4. `Replica` needs new methods:
   - `mark_processing()` — transitions COMPLETE → PROCESSING for in-place updates.
   - `is_processing()` / `fn_is_processing()` — predicate for filtering PROCESSING replicas during preemption.
   - `is_busy()` / `fn_is_busy()` — checks `refcnt > 0`, used as a safety gate before allowing Case B/C.

New error code:

- `OBJECT_REPLICA_BUSY` (`-714`) — returned when UpsertStart finds replicas with non-zero refcnt (active RDMA reads in flight). The client should retry after readers finish.

No changes are needed to eviction logic or any other existing mechanism.

#### 2.3.2 API Design

Following the existing `PutStart/PutEnd/PutRevoke` pattern:

```cpp
auto UpsertStart(const UUID& client_id, const std::string& key,
                 const uint64_t slice_length, const ReplicateConfig& config)
    -> tl::expected<std::vector<Replica::Descriptor>, ErrorCode>;

auto UpsertEnd(const UUID& client_id, const std::string& key,
               ReplicaType replica_type)
    -> tl::expected<void, ErrorCode>;

auto UpsertRevoke(const UUID& client_id, const std::string& key,
                  ReplicaType replica_type)
    -> tl::expected<void, ErrorCode>;
```

`UpsertEnd` and `UpsertRevoke` are functionally equivalent to `PutEnd` and `PutRevoke` — no special logic is needed.

#### 2.3.3 Detailed Flow

**UpsertStart(client_id, key, slice_length, config)**:

```
Step 0: Cleanup stale handles
    IF key exists AND all handles are stale (CleanupStaleHandles returns true):
        → Erase processing_keys entry and metadata
        → Treat as non-existent (fall through to Case A)

Step 1: Safety checks and preemption (only if key exists)
    1a. IF key has active replication_tasks (Copy/Move in progress):
        → Return OBJECT_HAS_REPLICATION_TASK
    1b. IF key has active offloading_tasks:
        → Return OBJECT_HAS_REPLICATION_TASK
    1c. IF key is in processing_keys (another Put/Upsert in progress):
        → Pop all PROCESSING replicas → DiscardedReplicas (TTL for delayed release)
        → Remove key from processing_keys
        → If no COMPLETE replicas remain, erase metadata entirely
           and fall through to Case A

Step 2: Dispatch by key existence and size

    Case A — Key does not exist (or erased by Step 0/1c):
        → Equivalent to PutStart (allocate replicas, create ObjectMetadata)
        → Return new replica descriptors

    (Safety gate before Case B/C)
    IF any replica has non-zero refcnt (is_busy):
        → Return OBJECT_REPLICA_BUSY

    Case B — Key exists, size unchanged (in-place update):
        → Mark all COMPLETE replicas → PROCESSING (via mark_processing())
        → Refresh put_start_time = now
        → Update client_id = new client_id
        → Insert key into processing_keys
        → Return existing replica descriptors (same memory addresses)

    Case C — Key exists, size changed (delete-and-reallocate):
        → Pop all replicas → DiscardedReplicas (TTL for delayed release,
          same as MoveEnd — needed because GetReplicaList doesn't use refcnt,
          so readers may still hold descriptors without incrementing refcnt)
        → Erase metadata entry
        → Allocate new replicas + create new ObjectMetadata (same as PutStart)
        → Return new replica descriptors
```

**Client transfers data** (identical to Put — TransferWrite via RDMA/TCP)

**UpsertEnd(client_id, key, replica_type)**:
- Equivalent to `PutEnd` — mark PROCESSING → COMPLETE, grant lease, remove from processing_keys.

**UpsertRevoke(client_id, key, replica_type)**:
- Equivalent to `PutRevoke` — erase PROCESSING replicas, clean up metadata if empty.

#### 2.3.4 Swimlane Diagrams

##### In-Place Update Path (key exists, size unchanged)

```mermaid
sequenceDiagram
    participant C as Client
    participant M as Master Service
    participant TE as Transfer Engine
    participant R as Server

    C->>M: UpsertStart(client_id, key, same_size, config)

    Note over M: Acquire shard write lock
    M->>M: Find existing ObjectMetadata
    M->>M: Safety checks (replication/offloading tasks, refcnt)
    M->>M: size unchanged → in-place path
    M->>M: mark_processing() on all COMPLETE replicas
    M->>M: Refresh put_start_time, update client_id
    M->>M: Insert key into processing_keys
    Note over M: Release shard write lock
    M-->>C: Return existing replica descriptors (same addresses)

    C->>TE: TransferWrite(descriptors, new_data)
    TE->>R: RDMA Write to same buffer
    R-->>TE: Write complete
    TE-->>C: Transfer success

    C->>M: UpsertEnd(client_id, key, MEMORY)

    Note over M: Same as PutEnd
    M->>M: mark_complete() on PROCESSING replicas
    M->>M: Remove from processing_keys
    M->>M: GrantLease (preserve hard_pin/soft_pin)
    M-->>C: Success
```

##### Delete-and-Reallocate Path (key exists, size changed)

```mermaid
sequenceDiagram
    participant C as Client
    participant M as Master Service
    participant TE as Transfer Engine
    participant R as Server

    C->>M: UpsertStart(client_id, key, new_size, config)

    Note over M: Acquire shard write lock
    M->>M: Find existing ObjectMetadata
    M->>M: size differs → delete-and-reallocate path
    M->>M: Check refcnt → reject if busy (OBJECT_REPLICA_BUSY)
    M->>M: Pop all replicas → DiscardedReplicas (TTL delayed release)
    M->>M: Erase old metadata
    M->>M: Allocate new replicas (same as PutStart)
    M->>M: Create new ObjectMetadata
    Note over M: Release shard write lock
    M-->>C: Return new replica descriptors

    C->>TE: TransferWrite(descriptors, data)
    TE->>R: RDMA Write to new buffer
    R-->>TE: Write complete
    TE-->>C: Transfer success

    C->>M: UpsertEnd(client_id, key, MEMORY)
    M->>M: Same as PutEnd
    M-->>C: Success
```

##### Preemption Path (key has in-progress write)

```mermaid
sequenceDiagram
    participant A as Client A
    participant B as Client B
    participant M as Master Service

    A->>M: PutStart(key) or UpsertStart(key)
    M-->>A: Return descriptors
    Note over A: Transfer in progress...

    B->>M: UpsertStart(key, ...)

    Note over M: key is in processing_keys
    M->>M: Pop A's PROCESSING replicas → DiscardedReplicas (with TTL)
    M->>M: Remove from processing_keys
    M->>M: Continue with normal Upsert flow (Case A/B/C)
    M-->>B: Return descriptors

    Note over M: A's replicas released after TTL<br/>by ReleaseExpiredDiscardedReplicas()

    A->>M: PutEnd(key) or UpsertEnd(key)
    M-->>A: ILLEGAL_CLIENT or OBJECT_NOT_FOUND
```

#### 2.3.5 Failure Handling

- **Client crash after UpsertStart**: `DiscardExpiredProcessingReplicas` will clean up after `put_start_discard_timeout_sec_`, same as for `PutStart`. For in-place updates, the replicas are freed normally (they are PROCESSING with no COMPLETE replicas, so the entire metadata entry is removed). For delete-and-reallocate, cleanup is identical to a crashed `PutStart`.
- **Allocation failure in delete-and-reallocate**: If the old replicas have been freed but new allocation fails, the key's data is lost. This is acceptable for model weight workloads where the source data (GPU/CPU) is still available and the client can retry.
- **Preempted client calls UpsertEnd/PutEnd**: Returns `ILLEGAL_CLIENT` (if metadata was reused with a new client_id) or `OBJECT_NOT_FOUND` (if metadata was erased and recreated). The client should handle this as a failed operation.
- **Concurrent Copy/Move/Offload**: Returns `OBJECT_HAS_REPLICATION_TASK`. The client should retry after the replication task completes.
- **Active readers (non-zero refcnt)**: Returns `OBJECT_REPLICA_BUSY`. The client should retry after readers release their references. Note: this only blocks Case B/C; if the key is mid-write (in processing_keys), preemption proceeds regardless of refcnt on the PROCESSING replicas.



## 4. Client-Side API

### C++

```cpp
// Upsert (same signature as put_from, but updates if exists)
ReplicateConfig config;
config.with_hard_pin = true;  // hard pin set at creation time
client->upsert_from("model/policy/weights", buffer, size, config);
```

### Python

```python
# Upsert
config = ReplicateConfig(with_hard_pin=True)
store.upsert_from("model/policy/weights", buffer, size, config)
```

## 5. Configuration

No new configuration parameters are required. Hard pin is a per-object attribute set via `ReplicateConfig.with_hard_pin` at creation time. All existing configuration parameters retain their current behavior.

## 6. Migration and Backward Compatibility

- **Wire format**: `ReplicateConfig` gains one new boolean field (`with_hard_pin`, default `false`). Old clients sending the old format will default to `false` — no hard pin, existing behavior preserved.
- **Snapshot format**: A new `hard_pinned` field is appended to the serialized `ObjectMetadata`. On deserialization, if the field is absent (old snapshot), it defaults to `false`.
- **Existing APIs**: `Put`, `Get`, `Remove`, `PutStart/PutEnd/PutRevoke` are unchanged. No existing behavior is modified.



## 7. Testing Strategy

1. **Unit tests** for `MasterService`:
   - `UpsertNewKey`: Case A — key doesn't exist, equivalent to PutStart
   - `UpsertSameSize`: Case B — in-place update, verifies descriptor addresses are reused
   - `UpsertSameSizeRefreshesMetadata`: Case B — verifies client_id is refreshed (old client gets `ILLEGAL_CLIENT`)
   - `UpsertDifferentSize`: Case C — verifies new descriptor addresses, old replicas released
   - `UpsertConflictReplicationTask`: CopyStart in progress → returns `OBJECT_HAS_REPLICATION_TASK`
   - `UpsertPreemptsInProgressPut`: Preempts incomplete Put, old client's PutEnd returns error
   - `UpsertRevoke`: Case A + UpsertRevoke cleans up
   - `UpsertInPlaceThenRevoke`: Case B + UpsertRevoke, key is deleted
   - `BatchUpsertStart`: Mix of Case A and Case B in a single batch call

2. **Integration tests**:
   - End-to-end upsert with data verification (write new data, verify data is updated after `UpsertEnd`)
   - Hard pin + upsert combined (update a hard-pinned object, verify it remains pinned)
   - Preemption scenario: two clients race on the same key, second one wins
   - Client crash simulation (kill client after `UpsertStart`, verify auto-cleanup by timeout)

3. **Snapshot/Restore tests**:
   - Hard-pinned objects retain their pin state after master restart






### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues and read the [documentation](https://kvcache-ai.github.io/Mooncake/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Hard Pin and Upsert for Mooncake Store #1645

1. Summary

2. Design

2.1 Design Principles

2.2 Hard Pin

2.2.1 Data Model

2.2.2 Setting Hard Pin

2.2.3 Eviction Behavior

2.2.4 HA mode Serialization

2.3 Upsert (In-Place Update / Delete-and-Reallocate)

2.3.1 Core Insight

2.3.2 API Design

2.3.3 Detailed Flow

2.3.4 Swimlane Diagrams

In-Place Update Path (key exists, size unchanged)

Delete-and-Reallocate Path (key exists, size changed)

Preemption Path (key has in-progress write)

2.3.5 Failure Handling

4. Client-Side API

C++

Python

5. Configuration

6. Migration and Backward Compatibility

7. Testing Strategy

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pass	What it evicts
First	Non-soft-pinned objects with expired leases
Second	Soft-pinned objects (if `allow_evict_soft_pinned_objects_` is true)

Existing Mechanism	Role in Upsert
`ReplicaStatus` state machine	COMPLETE → PROCESSING → COMPLETE lifecycle for in-place updates
`processing_keys` set	Tracks keys with active Upsert/Put writes
`DiscardedReplicas` + TTL	Safe delayed release of preempted PROCESSING replicas
`DiscardExpiredProcessingReplicas`	Auto-cleanup of failed upsert's replicas on timeout

[RFC]: Hard Pin and Upsert for Mooncake Store #1645

Description

1. Summary

2. Design

2.1 Design Principles

2.2 Hard Pin

2.2.1 Data Model

2.2.2 Setting Hard Pin

2.2.3 Eviction Behavior

2.2.4 HA mode Serialization

2.3 Upsert (In-Place Update / Delete-and-Reallocate)

2.3.1 Core Insight

2.3.2 API Design

2.3.3 Detailed Flow

2.3.4 Swimlane Diagrams

In-Place Update Path (key exists, size unchanged)

Delete-and-Reallocate Path (key exists, size changed)

Preemption Path (key has in-progress write)

2.3.5 Failure Handling

4. Client-Side API

C++

Python

5. Configuration

6. Migration and Backward Compatibility

7. Testing Strategy

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions