Implementing RAG, some questions on llama.cpp #12125
Replies: 3 comments 2 replies
-
You’re definitely heading in the right direction — using pgvector for embedding search is solid, and the fact that retrieval is already working puts you ahead of many setups. But the tricky part is exactly where you're at now: how to inject retrieved chunks into the prompt without breaking logic or hallucinating. From what I’ve seen helping others debug similar pipelines, your case likely falls under:
I maintain a full diagnostic map of 16 such issues based on real-world RAG systems. If you’re interested, I can share the map and some background-safe injection strategies I’ve tested. Also, I work on a terminal-native reasoning engine (MIT license, backed by the creator of Tesseract.js) that handles semantic injection with fine-grained control — happy to share if helpful. |
Beta Was this translation helpful? Give feedback.
-
Hi @gnusupport, thanks for sharing your setup and scripts. from your description—embeddings + retrieval working, but injected chunks sometimes shift context or get misinterpreted—it sounds exactly like Problem No. 2: Interpretation Collapse in our diagnostic map. in practice, even valid chunks can be skipped or mangled if the injection timing or formatting isn’t handled as “symbolic logic” rather than raw text. you might also bump into Problem No. 1: Chunk Drift (when semantic boundaries aren’t strictly enforced) and Problem No. 14: Bootstrap Ordering (if the prompt context isn’t stabilized before you inject). i’ve put together fixes and background-safe injection strategies for all of these in our MIT-licensed problem map—feel free to dive into each entry here: let me know which symptom you’re seeing most, and i can point you to the exact workaround. cheers! |
Beta Was this translation helpful? Give feedback.
-
hey, thanks — i really appreciate your thoughtful reply, and it's great to see someone genuinely engaging with how embeddings & retrieval interact with human-level context. the "context shift" i described is indeed a human-observed effect — it happens when retrieved chunks technically match via vector similarity, but semantically land just outside the intended reasoning scope. like injecting a quote into an argument one paragraph too early. the system doesn't crash, but the meaning slowly drifts. i’d be happy to walk through a concrete example if you're curious — especially if you want to try out what we call “diagnosable cognition” (tracking how each chunk influenced the answer over time). and haha, as for software — curious what you meant there? were you asking about what stack i use, or something else? thanks again — great discussion. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
While working with Emacs Lisp, I have so far imlpemented text splitting into chunks and using the embeddings model
That works fine and well and embeddings are recorded in the PostgreSQL database with the pgvector extension, this works well. Searching by embeddings works well.
I can quickly implement listing of documents, or people, whatever I am searching. This is done by PostgreSQL database.
And then according to what I learned, I am supposed to insert that information into the context of the LLM prompt in order to get the RAG functionality.
Sure I have some clue how to do it by curl, or Emacs Lisp over the API endpoint.
Though I would like to know is there, or would be there any way of implementing it in background so that I can somehow inject that stuff and get the responses over the Llama.cpp web UI?
Beta Was this translation helpful? Give feedback.
All reactions