Implementation of state management and state-leak fix for RWKV GGUF models by A190nux · Pull Request #441 · josStorer/RWKV-Runner

A190nux · 2026-02-19T15:23:33Z

Overview

This PR introduces comprehensive state management for RWKV models running via the GGUF (llama.cpp) backend.
This implementation allows users to extract, inject, and reset this internal state, enabling instant context switching and persistent sequential memory without re-computation.

Additionally, this PR addresses a critical state-leak bug where RWKV models would maintain their recurrent state between unrelated generation calls even when the state cache was disabled.

Key Changes

1. Specialized API Endpoints
To avoid confusion with native .st/.pth RWKV state management, new GGUF-specific endpoints have been added to state_cache.py:

/gguf-get-state: Extracts the raw byte buffer of the RWKV hidden state from the C context using llama_cpp bindings.
/gguf-set-state: Injects a provided state buffer back into the model context and primes the token count to allow for immediate resumption of a previous state.

2. State-Leak Fix & Hardware Sync

stateless Flag: Added a stateless toggle to the AbstractLlama and TextLlama classes in llama.py.
Automated Reset: When the global State Cache is disabled via /disable-state-cache, the model's stateless flag is set to True. This forces the generate() method to trigger clear_rwkv_state() before every new generation, ensuring no "memory" leaks from previous prompts.
Persistent Memory: When the State Cache is enabled, the model maintains its sequential state as intended, allowing the Trie-based prefix cache to function correctly for long-form conversations.

Technical Implementation Details

Backend: Utilizes llama_get_state_size, llama_copy_state_data, and llama_set_state_data from the llama-cpp-python C-bindings to manipulate the RNN hidden state.
Safety Guards: All new endpoints and state-manipulation methods include checks (e.g., is_rwkv_model) to ensure they only execute when an RWKV GGUF model is loaded, preventing incompatible operations on standard Transformer models.

How to Test

Load Model: Load an RWKV-7 GGUF model.
Verify Leak Fix: Call /disable-state-cache. Send two unrelated prompts; the model should no longer "remember" the first prompt when answering the second.
Test State Persistence: Call /enable-state-cache. The model should now maintain continuous memory across turns.
Manual State Management: Use /gguf-get-state to save a conversation snapshot and /gguf-set-state to resume that exact state in a fresh session.

…ntended side effects on non-RWKV models.

A190nux and others added 5 commits February 17, 2026 20:01

Fix RWKV GGUF state management: use reset=False and llama_memory_clear

8fdec40

Only apply the RWKV patch if the model is an RWKV model, to avoid uni…

1594a89

…ntended side effects on non-RWKV models.

Cleaner fix for the state issue with llama.cpp

927e9d0

Merge branch 'josStorer:master' into master

073fd8f

Implemented state management for RWKV Llama

f260bfc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of state management and state-leak fix for RWKV GGUF models#441

Implementation of state management and state-leak fix for RWKV GGUF models#441
A190nux wants to merge 5 commits intojosStorer:masterfrom
A190nux:master

A190nux commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

A190nux commented Feb 19, 2026

Overview

Key Changes

Technical Implementation Details

How to Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant