Skip to content

Conversation

@drQedwards
Copy link

How it works

  1. State I/O • The llama API exposes whole-context save/restore (llama_state_save_file, llama_state_load_file) and per-sequence variants (llama_state_seq_*) . • The official example examples/save-load-state.cpp shows the minimal pattern . • We wrap that in persist() / restore() so each sequence’s KV-cache, RoPE shifts, logits & RNG can be pickled to disk automatically after every decode step.

  2. Memory safety • llama_memory_t is grabbed once at construction and never freed manually—llama_free(ctx_) takes care of it . • No direct GGML graph surgery is done, so the code keeps working even with the new unified-KV flag (kv_unified) and SWA cache tweaks introduced early-2025 .

  3. The logic-loop hook

You can subclass pmll::LoopHook to push every token to Graphiti / SQLite / a vector DB, run extra PMLL graph updates, or even call back into another LLM. If the hook returns false the outer loop aborts, leaving the snapshot on disk for the next run.

  1. Extensibility • Because it talks only to the stable C API, the file is independent of architecture plug-ins you might add as per the “HOWTO-add-model” guide . • The greedy sampler is deliberately trivial—replace sample_next() with your favourite top-k-p or grammar-guided sampler (see src/llama-sampling.cpp for ready helpers ).

Why this is safe & portable
• State saving bugs (e.g. llama_state_get_size mis-count) were fixed in PR #13463 , so head-of-master is fine.
• The KV-cache API stabilised after issue #730 , so the calls we use won’t disappear.
• External wrappers (e.g. llama-cpp-python) already relied on the same pattern, proving cross-platform viability .

Next steps
1. Encrypt snapshots if you handle private data—just AES-GCM the buffer before writing.
2. Delta-KV compression: only write the newly-appended KV rows each step; zip them with Zstd.
3. Graphiti bridge: inside your LoopHook, call the Graphiti “add_episode()” mutation you drafted earlier to keep the PMLL knowledge-graph in sync.

Happy looping — your Llama is now stateful. 🐑🔄

Make sure to read the contributing guidelines before submitting a PR

How it works

1. State I/O
	•	The llama API exposes whole-context save/restore (llama_state_save_file, llama_state_load_file) and per-sequence variants (llama_state_seq_*) .
	•	The official example examples/save-load-state.cpp shows the minimal pattern .
	•	We wrap that in persist() / restore() so each sequence’s KV-cache, RoPE shifts, logits & RNG can be pickled to disk automatically after every decode step.

2. Memory safety
	•	llama_memory_t is grabbed once at construction and never freed manually—llama_free(ctx_) takes care of it .
	•	No direct GGML graph surgery is done, so the code keeps working even with the new unified-KV flag (kv_unified) and SWA cache tweaks introduced early-2025 .

3. The logic-loop hook

You can subclass pmll::LoopHook to push every token to Graphiti / SQLite / a vector DB, run extra PMLL graph updates, or even call back into another LLM.
If the hook returns false the outer loop aborts, leaving the snapshot on disk for the next run.

4. Extensibility
	•	Because it talks only to the stable C API, the file is independent of architecture plug-ins you might add as per the “HOWTO-add-model” guide .
	•	The greedy sampler is deliberately trivial—replace sample_next() with your favourite top-k-p or grammar-guided sampler (see src/llama-sampling.cpp for ready helpers ).

⸻

Why this is safe & portable
	•	State saving bugs (e.g. llama_state_get_size mis-count) were fixed in PR ggml-org#13463 , so head-of-master is fine.
	•	The KV-cache API stabilised after issue ggml-org#730 , so the calls we use won’t disappear.
	•	External wrappers (e.g. llama-cpp-python) already relied on the same pattern, proving cross-platform viability .

⸻

Next steps
	1.	Encrypt snapshots if you handle private data—just AES-GCM the buffer before writing.
	2.	Delta-KV compression: only write the newly-appended KV rows each step; zip them with Zstd.
	3.	Graphiti bridge: inside your LoopHook, call the Graphiti “add_episode()” mutation you drafted earlier to keep the PMLL knowledge-graph in sync.

Happy looping — your Llama is now stateful. 🐑🔄
@am17an am17an requested a review from Copilot July 31, 2025 09:16
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a persistent memory logic loop adapter for llama.cpp that enables stateful LLM interactions by automatically saving and restoring context state between generation steps. The implementation provides a C++ wrapper around llama.cpp's state management APIs to create resumable conversation sessions.

  • Implements automatic state persistence after each token generation step
  • Provides extensible hook system for custom logic injection during generation
  • Wraps llama.cpp's low-level state save/restore APIs in a user-friendly interface

std::string tokens_to_str(const std::vector<llama_token>& t) {
std::string s;
for (auto tok : t) {
char buf[8];
Copy link

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer size of 8 bytes is insufficient for token-to-string conversion. Some tokens can produce UTF-8 sequences longer than 8 bytes, which could lead to truncated output or buffer overflow. Consider using a larger buffer size (e.g., 32 or 64 bytes) or dynamically allocating based on the token.

Copilot uses AI. Check for mistakes.
private:
llama_token sample_next() {
const float* logits = llama_get_logits(ctx_);
int n_vocab = llama_n_vocab(llama_model_get_vocab(model_));
Copy link

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function llama_model_get_vocab() appears to be an incorrect API usage. Based on llama.cpp's API, this should likely be llama_n_vocab(model_) directly, as llama_n_vocab typically takes the model pointer, not a vocab object.

Copilot uses AI. Check for mistakes.
for (int i = 0; i < n; ++i) {
batch.token[i] = tokens[i];
batch.pos[i] = i;
batch.seq_id[i] = &seq;
Copy link

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking the address of the seq parameter is incorrect. The seq_id field expects an array of sequence IDs, not a pointer to the seq variable. This should be batch.seq_id[i] = seq; and the seq_id array should be properly allocated.

Suggested change
batch.seq_id[i] = &seq;
batch.seq_id[i] = seq;

Copilot uses AI. Check for mistakes.
llama_batch b1 = llama_batch_init(1, 0, 1);
b1.token[0] = sample_next(); // greedy / top-k
b1.pos[0] = tokens.size() + step;
b1.seq_id[0] = &seq;
Copy link

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as line 75 - taking the address of seq is incorrect. This should be b1.seq_id[0] = seq; and proper sequence ID array handling should be implemented.

Suggested change
b1.seq_id[0] = &seq;
b1.seq_id[0] = seq;

Copilot uses AI. Check for mistakes.
LoopHook* user_hook_;
llama_model* model_ = nullptr;
llama_context* ctx_ = nullptr;
llama_memory_t mem_ = nullptr;
Copy link

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mem_ member variable is assigned in the constructor but never used elsewhere in the class. Consider removing it if it's not needed, or document its intended purpose if it will be used in future functionality.

Suggested change
llama_memory_t mem_ = nullptr;
// Removed unused mem_ member variable.

Copilot uses AI. Check for mistakes.
tokens.data(), tokens.size(), true, true);
tokens.resize(n);

llama_batch batch = llama_batch_init(n, 0, 1);
Copy link

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batch allocation and deallocation happens multiple times in the generation loop. Consider reusing batch objects or allocating them once outside the loop to reduce memory allocation overhead during token generation.

Copilot uses AI. Check for mistakes.
@slaren slaren closed this Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants