Create Pmll.cpp #14981

drQedwards · 2025-07-31T03:53:51Z

How it works

State I/O • The llama API exposes whole-context save/restore (llama_state_save_file, llama_state_load_file) and per-sequence variants (llama_state_seq_*) . • The official example examples/save-load-state.cpp shows the minimal pattern . • We wrap that in persist() / restore() so each sequence’s KV-cache, RoPE shifts, logits & RNG can be pickled to disk automatically after every decode step.
Memory safety • llama_memory_t is grabbed once at construction and never freed manually—llama_free(ctx_) takes care of it . • No direct GGML graph surgery is done, so the code keeps working even with the new unified-KV flag (kv_unified) and SWA cache tweaks introduced early-2025 .
The logic-loop hook

You can subclass pmll::LoopHook to push every token to Graphiti / SQLite / a vector DB, run extra PMLL graph updates, or even call back into another LLM. If the hook returns false the outer loop aborts, leaving the snapshot on disk for the next run.

Extensibility • Because it talks only to the stable C API, the file is independent of architecture plug-ins you might add as per the “HOWTO-add-model” guide . • The greedy sampler is deliberately trivial—replace sample_next() with your favourite top-k-p or grammar-guided sampler (see src/llama-sampling.cpp for ready helpers ).

⸻

Why this is safe & portable
• State saving bugs (e.g. llama_state_get_size mis-count) were fixed in PR #13463 , so head-of-master is fine.
• The KV-cache API stabilised after issue #730 , so the calls we use won’t disappear.
• External wrappers (e.g. llama-cpp-python) already relied on the same pattern, proving cross-platform viability .

⸻

Next steps
1. Encrypt snapshots if you handle private data—just AES-GCM the buffer before writing.
2. Delta-KV compression: only write the newly-appended KV rows each step; zip them with Zstd.
3. Graphiti bridge: inside your LoopHook, call the Graphiti “add_episode()” mutation you drafted earlier to keep the PMLL knowledge-graph in sync.

Happy looping — your Llama is now stateful. 🐑🔄

Make sure to read the contributing guidelines before submitting a PR

How it works 1. State I/O • The llama API exposes whole-context save/restore (llama_state_save_file, llama_state_load_file) and per-sequence variants (llama_state_seq_*) . • The official example examples/save-load-state.cpp shows the minimal pattern . • We wrap that in persist() / restore() so each sequence’s KV-cache, RoPE shifts, logits & RNG can be pickled to disk automatically after every decode step. 2. Memory safety • llama_memory_t is grabbed once at construction and never freed manually—llama_free(ctx_) takes care of it . • No direct GGML graph surgery is done, so the code keeps working even with the new unified-KV flag (kv_unified) and SWA cache tweaks introduced early-2025 . 3. The logic-loop hook You can subclass pmll::LoopHook to push every token to Graphiti / SQLite / a vector DB, run extra PMLL graph updates, or even call back into another LLM. If the hook returns false the outer loop aborts, leaving the snapshot on disk for the next run. 4. Extensibility • Because it talks only to the stable C API, the file is independent of architecture plug-ins you might add as per the “HOWTO-add-model” guide . • The greedy sampler is deliberately trivial—replace sample_next() with your favourite top-k-p or grammar-guided sampler (see src/llama-sampling.cpp for ready helpers ). ⸻ Why this is safe & portable • State saving bugs (e.g. llama_state_get_size mis-count) were fixed in PR ggml-org#13463 , so head-of-master is fine. • The KV-cache API stabilised after issue ggml-org#730 , so the calls we use won’t disappear. • External wrappers (e.g. llama-cpp-python) already relied on the same pattern, proving cross-platform viability . ⸻ Next steps 1. Encrypt snapshots if you handle private data—just AES-GCM the buffer before writing. 2. Delta-KV compression: only write the newly-appended KV rows each step; zip them with Zstd. 3. Graphiti bridge: inside your LoopHook, call the Graphiti “add_episode()” mutation you drafted earlier to keep the PMLL knowledge-graph in sync. Happy looping — your Llama is now stateful. 🐑🔄

Copilot

Pull Request Overview

This PR introduces a persistent memory logic loop adapter for llama.cpp that enables stateful LLM interactions by automatically saving and restoring context state between generation steps. The implementation provides a C++ wrapper around llama.cpp's state management APIs to create resumable conversation sessions.

Implements automatic state persistence after each token generation step
Provides extensible hook system for custom logic injection during generation
Wraps llama.cpp's low-level state save/restore APIs in a user-friendly interface

Copilot · 2025-07-31T09:17:24Z

Pmll.cpp

+    std::string tokens_to_str(const std::vector<llama_token>& t) {
+        std::string s;
+        for (auto tok : t) {
+            char buf[8];


The buffer size of 8 bytes is insufficient for token-to-string conversion. Some tokens can produce UTF-8 sequences longer than 8 bytes, which could lead to truncated output or buffer overflow. Consider using a larger buffer size (e.g., 32 or 64 bytes) or dynamically allocating based on the token.

Copilot · 2025-07-31T09:17:25Z

Pmll.cpp

+  private:
+    llama_token sample_next() {
+        const float* logits = llama_get_logits(ctx_);
+        int n_vocab         = llama_n_vocab(llama_model_get_vocab(model_));


The function llama_model_get_vocab() appears to be an incorrect API usage. Based on llama.cpp's API, this should likely be llama_n_vocab(model_) directly, as llama_n_vocab typically takes the model pointer, not a vocab object.

Copilot · 2025-07-31T09:17:25Z

Pmll.cpp

+        for (int i = 0; i < n; ++i) {
+            batch.token[i]  = tokens[i];
+            batch.pos[i]    = i;
+            batch.seq_id[i] = &seq;


Taking the address of the seq parameter is incorrect. The seq_id field expects an array of sequence IDs, not a pointer to the seq variable. This should be batch.seq_id[i] = seq; and the seq_id array should be properly allocated.

Suggested change

batch.seq_id[i] = &seq;

batch.seq_id[i] = seq;

Copilot · 2025-07-31T09:17:25Z

Pmll.cpp

+            llama_batch b1 = llama_batch_init(1, 0, 1);
+            b1.token[0]      = sample_next();                     // greedy / top-k
+            b1.pos[0]        = tokens.size() + step;
+            b1.seq_id[0]     = &seq;


Same issue as line 75 - taking the address of seq is incorrect. This should be b1.seq_id[0] = seq; and proper sequence ID array handling should be implemented.

Suggested change

b1.seq_id[0] = &seq;

b1.seq_id[0] = seq;

Copilot · 2025-07-31T09:17:26Z

Pmll.cpp

+    LoopHook*             user_hook_;
+    llama_model*          model_ = nullptr;
+    llama_context*        ctx_   = nullptr;
+    llama_memory_t        mem_   = nullptr;


The mem_ member variable is assigned in the constructor but never used elsewhere in the class. Consider removing it if it's not needed, or document its intended purpose if it will be used in future functionality.

Suggested change

llama_memory_t mem_ = nullptr;

// Removed unused mem_ member variable.

Copilot · 2025-07-31T09:17:26Z

Pmll.cpp

+                               tokens.data(), tokens.size(), true, true);
+        tokens.resize(n);
+
+        llama_batch batch = llama_batch_init(n, 0, 1);


The batch allocation and deallocation happens multiple times in the generation loop. Consider reusing batch objects or allocating them once outside the loop to reduce memory allocation overhead during token generation.

am17an requested a review from Copilot July 31, 2025 09:16

Copilot AI reviewed Jul 31, 2025

View reviewed changes

slaren closed this Jul 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create Pmll.cpp #14981

Create Pmll.cpp #14981

Uh oh!

drQedwards commented Jul 31, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 31, 2025

Uh oh!

Copilot AI Jul 31, 2025

Uh oh!

Copilot AI Jul 31, 2025

Uh oh!

Copilot AI Jul 31, 2025

Uh oh!

Copilot AI Jul 31, 2025

Uh oh!

Copilot AI Jul 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	llama_memory_t mem_ = nullptr;
	// Removed unused mem_ member variable.

Create Pmll.cpp #14981

Create Pmll.cpp #14981

Uh oh!

Conversation

drQedwards commented Jul 31, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants