Implementing RAG, some questions on llama.cpp #12125

gnusupport · 2025-03-01T08:23:57Z

gnusupport
Mar 1, 2025

While working with Emacs Lisp, I have so far imlpemented text splitting into chunks and using the embeddings model

/usr/local/bin/llama-server -ngl 999 -v -c 8192 -ub 8192 --embedding --log-timestamps --host 192.168.1.68 --port 9999 -m /mnt/data/LLM/nomic-ai/quantized/nomic-embed-text-v1.5-Q8_0.gguf

That works fine and well and embeddings are recorded in the PostgreSQL database with the pgvector extension, this works well. Searching by embeddings works well.

I can quickly implement listing of documents, or people, whatever I am searching. This is done by PostgreSQL database.

And then according to what I learned, I am supposed to insert that information into the context of the LLM prompt in order to get the RAG functionality.

Sure I have some clue how to do it by curl, or Emacs Lisp over the API endpoint.

Though I would like to know is there, or would be there any way of implementing it in background so that I can somehow inject that stuff and get the responses over the Llama.cpp web UI?

onestardao · 2025-07-30T12:32:51Z

onestardao
Jul 30, 2025

You’re definitely heading in the right direction — using pgvector for embedding search is solid, and the fact that retrieval is already working puts you ahead of many setups.

But the tricky part is exactly where you're at now: how to inject retrieved chunks into the prompt without breaking logic or hallucinating.

From what I’ve seen helping others debug similar pipelines, your case likely falls under:

1: Chunk Drift — if you insert chunks without controlling their semantic boundaries, the model may misinterpret them
2: Interpretation Collapse — even when chunk content is relevant, LLMs may skip or distort it if injected poorly
14: Bootstrap Ordering — if you inject before context is stabilized, you can silently lose meaning or prompt coherence
8: Debugging Blindness — it’s often hard to tell whether the issue is retrieval, formatting, or logic application

I maintain a full diagnostic map of 16 such issues based on real-world RAG systems. If you’re interested, I can share the map and some background-safe injection strategies I’ve tested.

Also, I work on a terminal-native reasoning engine (MIT license, backed by the creator of Tesseract.js) that handles semantic injection with fine-grained control — happy to share if helpful.

1 reply

gnusupport Jul 31, 2025
Author

Yes, feel free to share your work with me.

Chunk drift

I have no idea yet how well it works, the point is that it does work, and I just need to tune my functions. I am running the embeddings chunking server since months, and I forgot it exists, it just works (for me), and you can look into it and tell me if there is some serious issue with the chunking server.

RCD Semantic Split Server startup script is here:
https://gitea.com/gnusupport/LLM-Helpers/src/branch/main/bin/rcd-semantic-split-server.sh

the actual server is here:
https://gitea.com/gnusupport/LLM-Helpers/src/branch/main/bin/rcd-semantic-split-server.py

Interpretation collapse

Yes, that may happen, though, several chunks are taken and I made workflow to see what is happening. I have many documents, like 60000+ and output depends good part of the query. Most of times I know what kind of answers I need to get, so I make a query that is very relevant, if I express me well in the query, I will also get good result.

My workflow is such that I can also mark relevant documents, that means, before LLM functions start searching for relevant documents, I can simply mark documents which I consider relevant, that way I also get I just believe better answer.

When LLM functions choose certain documents to be relevant, they also appear in the list, before the answer is constructed, and in that case I can also see what is happening in the background.

And it become easy to see in background, or log it:

what was the query?
which documents were considered relevant, and why (which embedding chunks)
what was the answer

Anyway I log all queries and answers in the database.

Yes, interpreation collapse, how you name it, and I hope it is not a standard abbreviation, could happen, but it can also be avoided.

Both "Chunk Drift" how you call it and "Interpretation Collapse" can be avoided by how you programmer do the program.

We cannot blame the whole system, or whole technology like "LLM" for such trivial details which actually belong to programmers' responsibilities.

Making articles out of it is exaggeration, simply make it right and prevent those issues.

Bootstrap Ordering

I wanted to inject it somehow into the Web UI, so that I could make question, script outside runs the RAG and provides final context for web UI.

Personally I do not know what it means "before context is stabilized". RAG outcome makes the context, so the context becomes "stabilized" at that point. I could do more workflow and make that context nicer with LLM, but why waste time...

I feel that "Bootstrap Ordering" de-stabilized context of my actual inquiry here.

Debugging Blindness

Sure, when user does not work with the own software, things will become mysterious. Though I think "debugging blindness" is contradictory in itself.

That would mean you either practice the sacred art of hype-driven development—where code writes itself and bugs are a feature—or you somehow skipped the chapter titled "Debugging: How Not to Cry in Public."

Anyway -- I do now, after all the months, the RAG inside of the Emacs Lisp and it looks pretty well.

On this screenshot you can see the list of documents that were found by the query, and below you can see the output.

Query was: why should job applicant demonstrate literacy with the Preliminary Communication Preparation Project?

Screenshot-2025-07-31-08-38-50-976737543

And output was very relevant, explanatory for job applicants.

onestardao · 2025-07-31T06:01:08Z

onestardao
Jul 31, 2025

Hi @gnusupport,

thanks for sharing your setup and scripts. from your description—embeddings + retrieval working, but injected chunks sometimes shift context or get misinterpreted—it sounds exactly like Problem No. 2: Interpretation Collapse in our diagnostic map. in practice, even valid chunks can be skipped or mangled if the injection timing or formatting isn’t handled as “symbolic logic” rather than raw text.

you might also bump into Problem No. 1: Chunk Drift (when semantic boundaries aren’t strictly enforced) and Problem No. 14: Bootstrap Ordering (if the prompt context isn’t stabilized before you inject).

i’ve put together fixes and background-safe injection strategies for all of these in our MIT-licensed problem map—feel free to dive into each entry here:
https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md

let me know which symptom you’re seeing most, and i can point you to the exact workaround. cheers!

1 reply

gnusupport Jul 31, 2025
Author

I do enjoy this conversation.

"injected chunks sometimes shift context" -- okay, I hear you saying that on my side, but I do not remember I said it is happening.

Chunks are chosen by LLM embeddings which match, so if they "shift context" is matter of human observation, not really some kind of a bug. What is relevant is relevant in technical way, in the viewpoint of vectors relevant to each other. I can imagine it is three dimensional mathematical space of well defined positions that matches each other.

I can't say that I feel any problem with the matching. I may be wrong, because I am definitely not able as human to go verifying those vectors.

Can you give me example what would be "context shift" for you?

If I do not know what is "context shift", why you think I get the problem? Even I demonstrated pretty relevant documents and answer, and I confirmed that answer is right to me.

I may continue on your own page.

Anyway, for my personal purposes, which are also of importance for so many other people learning from the outcomes of LLMs based on RAG, it is personally still fine and good.

Who knows what software do you use...

onestardao · 2025-07-31T08:08:01Z

onestardao
Jul 31, 2025

hey, thanks — i really appreciate your thoughtful reply, and it's great to see someone genuinely engaging with how embeddings & retrieval interact with human-level context.

the "context shift" i described is indeed a human-observed effect — it happens when retrieved chunks technically match via vector similarity, but semantically land just outside the intended reasoning scope. like injecting a quote into an argument one paragraph too early. the system doesn't crash, but the meaning slowly drifts.

i’d be happy to walk through a concrete example if you're curious — especially if you want to try out what we call “diagnosable cognition” (tracking how each chunk influenced the answer over time).

and haha, as for software — curious what you meant there? were you asking about what stack i use, or something else?

thanks again — great discussion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implementing RAG, some questions on llama.cpp #12125

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Implementing RAG, some questions on llama.cpp #12125

Uh oh!

gnusupport Mar 1, 2025

Replies: 3 comments · 2 replies

Uh oh!

onestardao Jul 30, 2025

Uh oh!

gnusupport Jul 31, 2025 Author

Uh oh!

onestardao Jul 31, 2025

Uh oh!

Uh oh!

gnusupport Jul 31, 2025 Author

Uh oh!

onestardao Jul 31, 2025

gnusupport
Mar 1, 2025

Replies: 3 comments 2 replies

onestardao
Jul 30, 2025

gnusupport Jul 31, 2025
Author

onestardao
Jul 31, 2025

gnusupport Jul 31, 2025
Author

onestardao
Jul 31, 2025