-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Create Pmll.cpp #14981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create Pmll.cpp #14981
Conversation
How it works 1. State I/O • The llama API exposes whole-context save/restore (llama_state_save_file, llama_state_load_file) and per-sequence variants (llama_state_seq_*) . • The official example examples/save-load-state.cpp shows the minimal pattern . • We wrap that in persist() / restore() so each sequence’s KV-cache, RoPE shifts, logits & RNG can be pickled to disk automatically after every decode step. 2. Memory safety • llama_memory_t is grabbed once at construction and never freed manually—llama_free(ctx_) takes care of it . • No direct GGML graph surgery is done, so the code keeps working even with the new unified-KV flag (kv_unified) and SWA cache tweaks introduced early-2025 . 3. The logic-loop hook You can subclass pmll::LoopHook to push every token to Graphiti / SQLite / a vector DB, run extra PMLL graph updates, or even call back into another LLM. If the hook returns false the outer loop aborts, leaving the snapshot on disk for the next run. 4. Extensibility • Because it talks only to the stable C API, the file is independent of architecture plug-ins you might add as per the “HOWTO-add-model” guide . • The greedy sampler is deliberately trivial—replace sample_next() with your favourite top-k-p or grammar-guided sampler (see src/llama-sampling.cpp for ready helpers ). ⸻ Why this is safe & portable • State saving bugs (e.g. llama_state_get_size mis-count) were fixed in PR ggml-org#13463 , so head-of-master is fine. • The KV-cache API stabilised after issue ggml-org#730 , so the calls we use won’t disappear. • External wrappers (e.g. llama-cpp-python) already relied on the same pattern, proving cross-platform viability . ⸻ Next steps 1. Encrypt snapshots if you handle private data—just AES-GCM the buffer before writing. 2. Delta-KV compression: only write the newly-appended KV rows each step; zip them with Zstd. 3. Graphiti bridge: inside your LoopHook, call the Graphiti “add_episode()” mutation you drafted earlier to keep the PMLL knowledge-graph in sync. Happy looping — your Llama is now stateful. 🐑🔄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a persistent memory logic loop adapter for llama.cpp that enables stateful LLM interactions by automatically saving and restoring context state between generation steps. The implementation provides a C++ wrapper around llama.cpp's state management APIs to create resumable conversation sessions.
- Implements automatic state persistence after each token generation step
- Provides extensible hook system for custom logic injection during generation
- Wraps llama.cpp's low-level state save/restore APIs in a user-friendly interface
| std::string tokens_to_str(const std::vector<llama_token>& t) { | ||
| std::string s; | ||
| for (auto tok : t) { | ||
| char buf[8]; |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffer size of 8 bytes is insufficient for token-to-string conversion. Some tokens can produce UTF-8 sequences longer than 8 bytes, which could lead to truncated output or buffer overflow. Consider using a larger buffer size (e.g., 32 or 64 bytes) or dynamically allocating based on the token.
| private: | ||
| llama_token sample_next() { | ||
| const float* logits = llama_get_logits(ctx_); | ||
| int n_vocab = llama_n_vocab(llama_model_get_vocab(model_)); |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function llama_model_get_vocab() appears to be an incorrect API usage. Based on llama.cpp's API, this should likely be llama_n_vocab(model_) directly, as llama_n_vocab typically takes the model pointer, not a vocab object.
| for (int i = 0; i < n; ++i) { | ||
| batch.token[i] = tokens[i]; | ||
| batch.pos[i] = i; | ||
| batch.seq_id[i] = &seq; |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking the address of the seq parameter is incorrect. The seq_id field expects an array of sequence IDs, not a pointer to the seq variable. This should be batch.seq_id[i] = seq; and the seq_id array should be properly allocated.
| batch.seq_id[i] = &seq; | |
| batch.seq_id[i] = seq; |
| llama_batch b1 = llama_batch_init(1, 0, 1); | ||
| b1.token[0] = sample_next(); // greedy / top-k | ||
| b1.pos[0] = tokens.size() + step; | ||
| b1.seq_id[0] = &seq; |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as line 75 - taking the address of seq is incorrect. This should be b1.seq_id[0] = seq; and proper sequence ID array handling should be implemented.
| b1.seq_id[0] = &seq; | |
| b1.seq_id[0] = seq; |
| LoopHook* user_hook_; | ||
| llama_model* model_ = nullptr; | ||
| llama_context* ctx_ = nullptr; | ||
| llama_memory_t mem_ = nullptr; |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mem_ member variable is assigned in the constructor but never used elsewhere in the class. Consider removing it if it's not needed, or document its intended purpose if it will be used in future functionality.
| llama_memory_t mem_ = nullptr; | |
| // Removed unused mem_ member variable. |
| tokens.data(), tokens.size(), true, true); | ||
| tokens.resize(n); | ||
|
|
||
| llama_batch batch = llama_batch_init(n, 0, 1); |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The batch allocation and deallocation happens multiple times in the generation loop. Consider reusing batch objects or allocating them once outside the loop to reduce memory allocation overhead during token generation.
How it works
State I/O • The llama API exposes whole-context save/restore (llama_state_save_file, llama_state_load_file) and per-sequence variants (llama_state_seq_*) . • The official example examples/save-load-state.cpp shows the minimal pattern . • We wrap that in persist() / restore() so each sequence’s KV-cache, RoPE shifts, logits & RNG can be pickled to disk automatically after every decode step.
Memory safety • llama_memory_t is grabbed once at construction and never freed manually—llama_free(ctx_) takes care of it . • No direct GGML graph surgery is done, so the code keeps working even with the new unified-KV flag (kv_unified) and SWA cache tweaks introduced early-2025 .
The logic-loop hook
You can subclass pmll::LoopHook to push every token to Graphiti / SQLite / a vector DB, run extra PMLL graph updates, or even call back into another LLM. If the hook returns false the outer loop aborts, leaving the snapshot on disk for the next run.
⸻
Why this is safe & portable
• State saving bugs (e.g. llama_state_get_size mis-count) were fixed in PR #13463 , so head-of-master is fine.
• The KV-cache API stabilised after issue #730 , so the calls we use won’t disappear.
• External wrappers (e.g. llama-cpp-python) already relied on the same pattern, proving cross-platform viability .
⸻
Next steps
1. Encrypt snapshots if you handle private data—just AES-GCM the buffer before writing.
2. Delta-KV compression: only write the newly-appended KV rows each step; zip them with Zstd.
3. Graphiti bridge: inside your LoopHook, call the Graphiti “add_episode()” mutation you drafted earlier to keep the PMLL knowledge-graph in sync.
Happy looping — your Llama is now stateful. 🐑🔄
Make sure to read the contributing guidelines before submitting a PR