Skip to content

Implementation of state management and state-leak fix for RWKV GGUF models#441

Open
A190nux wants to merge 5 commits intojosStorer:masterfrom
A190nux:master
Open

Implementation of state management and state-leak fix for RWKV GGUF models#441
A190nux wants to merge 5 commits intojosStorer:masterfrom
A190nux:master

Conversation

@A190nux
Copy link
Copy Markdown
Contributor

@A190nux A190nux commented Feb 19, 2026

Overview

This PR introduces comprehensive state management for RWKV models running via the GGUF (llama.cpp) backend.
This implementation allows users to extract, inject, and reset this internal state, enabling instant context switching and persistent sequential memory without re-computation.

Additionally, this PR addresses a critical state-leak bug where RWKV models would maintain their recurrent state between unrelated generation calls even when the state cache was disabled.

Key Changes

1. Specialized API Endpoints
To avoid confusion with native .st/.pth RWKV state management, new GGUF-specific endpoints have been added to state_cache.py:

  • /gguf-get-state: Extracts the raw byte buffer of the RWKV hidden state from the C context using llama_cpp bindings.
  • /gguf-set-state: Injects a provided state buffer back into the model context and primes the token count to allow for immediate resumption of a previous state.

2. State-Leak Fix & Hardware Sync

  • stateless Flag: Added a stateless toggle to the AbstractLlama and TextLlama classes in llama.py.
  • Automated Reset: When the global State Cache is disabled via /disable-state-cache, the model's stateless flag is set to True. This forces the generate() method to trigger clear_rwkv_state() before every new generation, ensuring no "memory" leaks from previous prompts.
  • Persistent Memory: When the State Cache is enabled, the model maintains its sequential state as intended, allowing the Trie-based prefix cache to function correctly for long-form conversations.

Technical Implementation Details

  • Backend: Utilizes llama_get_state_size, llama_copy_state_data, and llama_set_state_data from the llama-cpp-python C-bindings to manipulate the RNN hidden state.
  • Safety Guards: All new endpoints and state-manipulation methods include checks (e.g., is_rwkv_model) to ensure they only execute when an RWKV GGUF model is loaded, preventing incompatible operations on standard Transformer models.

How to Test

  1. Load Model: Load an RWKV-7 GGUF model.
  2. Verify Leak Fix: Call /disable-state-cache. Send two unrelated prompts; the model should no longer "remember" the first prompt when answering the second.
  3. Test State Persistence: Call /enable-state-cache. The model should now maintain continuous memory across turns.
  4. Manual State Management: Use /gguf-get-state to save a conversation snapshot and /gguf-set-state to resume that exact state in a fresh session.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant