Implementation of state management and state-leak fix for RWKV GGUF models#441
Open
A190nux wants to merge 5 commits intojosStorer:masterfrom
Open
Implementation of state management and state-leak fix for RWKV GGUF models#441A190nux wants to merge 5 commits intojosStorer:masterfrom
A190nux wants to merge 5 commits intojosStorer:masterfrom
Conversation
…ntended side effects on non-RWKV models.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR introduces comprehensive state management for RWKV models running via the GGUF (
llama.cpp) backend.This implementation allows users to extract, inject, and reset this internal state, enabling instant context switching and persistent sequential memory without re-computation.
Additionally, this PR addresses a critical state-leak bug where RWKV models would maintain their recurrent state between unrelated generation calls even when the state cache was disabled.
Key Changes
1. Specialized API Endpoints
To avoid confusion with native
.st/.pthRWKV state management, new GGUF-specific endpoints have been added tostate_cache.py:/gguf-get-state: Extracts the raw byte buffer of the RWKV hidden state from the C context usingllama_cppbindings./gguf-set-state: Injects a provided state buffer back into the model context and primes the token count to allow for immediate resumption of a previous state.2. State-Leak Fix & Hardware Sync
statelessFlag: Added astatelesstoggle to theAbstractLlamaandTextLlamaclasses inllama.py./disable-state-cache, the model'sstatelessflag is set toTrue. This forces thegenerate()method to triggerclear_rwkv_state()before every new generation, ensuring no "memory" leaks from previous prompts.Technical Implementation Details
llama_get_state_size,llama_copy_state_data, andllama_set_state_datafrom thellama-cpp-pythonC-bindings to manipulate the RNN hidden state.is_rwkv_model) to ensure they only execute when an RWKV GGUF model is loaded, preventing incompatible operations on standard Transformer models.How to Test
/disable-state-cache. Send two unrelated prompts; the model should no longer "remember" the first prompt when answering the second./enable-state-cache. The model should now maintain continuous memory across turns./gguf-get-stateto save a conversation snapshot and/gguf-set-stateto resume that exact state in a fresh session.