From 736f0c8ac9d2a48b60849cdec71feca2472e1d24 Mon Sep 17 00:00:00 2001 From: Liam Broza Date: Thu, 5 Feb 2026 22:36:52 -0800 Subject: [PATCH 1/2] Refresh docs and add agent guide --- AGENTS.md | 29 +++++++++++++++++++++++++++++ README.md | 20 ++++++++++---------- docs/1100-questions.md | 2 +- docs/1300-categorization.md | 4 ++-- docs/1400-sources.md | 2 ++ docs/1500-hardware.md | 2 +- docs/1600-architecture.md | 16 ++++------------ docs/MODEL_CONFIGURATION.md | 21 ++++++++++++--------- public/sources/300_Food/README.md | 2 ++ public/sources/README.md | 16 +++++++++------- 10 files changed, 72 insertions(+), 42 deletions(-) create mode 100644 AGENTS.md create mode 100644 public/sources/300_Food/README.md diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..de5d477 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,29 @@ +# Codex Agent Guide + +## Repo Summary +- Offline emergency knowledge assistant built in C++ with llama.cpp. +- Docker Compose runs `tika`, `jic` (server), and `ingestion`. +- Documents live under `public/sources` and are ingested into a vector index. + +## Key Paths +- `docker-compose.yml`: primary model configuration and service wiring. +- `src/`: C++ server and ingestion pipeline. +- `public/sources/`: PDFs and other content to ingest. +- `helper-scripts/`: scripts for model fetching and basic testing. + +## Local Setup +1. Install Docker + Docker Compose. +2. Download GGUF files into `./gguf_models/` (see `helper-scripts/fetch-models.sh`). +3. Run: `docker compose up --build`. + +## Tests (Local) +- Config sanity: `./helper-scripts/test-config.sh` +- Server smoke test (server must be running): `./helper-scripts/test-server.sh` + +## Remote / CI Guidance +- Minimal CI should run `./helper-scripts/test-config.sh`. +- Optional CI step (if services can be started): `docker compose up -d` then `./helper-scripts/test-server.sh`, then `docker compose down`. + +## Data Notes +- Keep licensing and provenance clear for all added sources. +- Prefer public-domain or explicitly redistributable materials. diff --git a/README.md b/README.md index b7944f7..15cc51b 100644 --- a/README.md +++ b/README.md @@ -5,9 +5,7 @@ ### [Learn More - CompanionIntelligence.com/JIC](https://companionintelligence.com/JIC) -If the Internet goes dark, you should still be able to quickly find the knowledge that can help you survive and thrive during a crisis. The goal of the JIC project is to deliver a self-contained LLM-powered / AI driven conversational search engine over a curated set of emergency survival pdfs that can be used without the Internet. - -This includes survival guides, medical references, even agricultural know-how and engineering resources, as well as broader educational materials like offline Wikipedia, open textbooks, art and humanities. If the Internet goes dark, you should still be able to quickly find the knowledge that can help you survive and thrive during a crisis. +If the Internet goes dark, you should still be able to quickly find the knowledge that can help you survive and thrive. JIC aims to deliver a self-contained LLM-powered conversational search engine over curated emergency PDFs that can be used fully offline. The corpus includes survival guides, medical references, agricultural know-how, engineering resources, and broader educational materials like offline Wikipedia and open textbooks. Please feel free to join us. This is a work in progress and we welcome your participation. @@ -27,7 +25,7 @@ The problem is that these services are often cloud based, and not always at our The fact is that there's a real difference between the difficulty of trying to read through a book on how to treat a burn during an emergency, versus getting some quick help or counsel, conversationally, right away, or at least getting some quick guidance on where to look for more details. What would you do if the internet went down? Or even just an extended power outage? What is your families plan for region-specific threats such as tornadoes, tsunamis, or forest fires? Many of us have some kind of plan; a flashlight in a drawer, extra food supplies, water, cash, a map of community resources, a muster-point. -The world has changed, we now heavily rely on tools such as ChatGPT, Claude, Google and other online resources. Even for small questions such as "how do you wire up rechargeable batteries to a solar panel?" or "what is the probable cause of a lower right side stomach pain?". The thing most of us rely heavily on information itself, and that information is not always at our fingertips. +The world has changed, and we now rely on these tools for rapid answers. The problem is that information itself is not always at our fingertips when connectivity fails. Validating a tool like this raises many questions. Who are typical users of the dataset? What are typical scenarios? Can we build a list of typical questions a user may ask of the dataset? Can we have regression tests against the ability of the dataset to resolve the queries? Are there differences in what is needed for short, medium or extended emergencies or extended survival situations? In this ongoing project we'll try to tackle these and improve this over time. @@ -77,14 +75,14 @@ We've tried a variety of approaches, ollama, python, n8n - our current stack is | Component | Role | |-------------------|-------------------------------------------------| -| 🧠 `Llama.cpp` | LLM loader (e.g. `llama3`) | +| 🧠 `Llama.cpp` | LLM loader (GGUF models) | | šŸ“„ `Apache Tika` | PDF-to-text extractor | -| šŸ” `FAISS` | Vector search over parsed PDF chunks | +| šŸ” `SimpleVectorIndex` | In-memory vector search over parsed chunks | | 🌐 `C++ Server` | Simple API + minimal HTML frontend | Note we may shift a few pieces around here - may move to pgvector for example. -We're thinking of these engines for chewing through the context (the pdfs) - basically presenting each pdf page (generated with Tika) to qwen2.5-vl. Using a smaller model to be (high end) laptop friendly: +We're thinking of these engines for chewing through the context (the PDFs). The current configuration uses `qwen2.5-vl:7b` with a matching `mmproj` file. Using a smaller model keeps things high-end laptop friendly: https://huggingface.co/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF @@ -115,7 +113,7 @@ find Survival-Data/HOME -type f -iname "*.pdf" -exec cp {} sources/ \; ### 2. Prepare models -**Important:** Our setup of Docker will avoid downloading models from the internet. You must prepare them locally first. +**Important:** The Docker setup avoids downloading models from the internet. You must prepare them locally first. 1. **Download GGUF files manually** (one time, on any connection): ```bash @@ -177,8 +175,11 @@ You can customize which models to use by setting environment variables: ```bash # Use different models -export LLM_MODEL=llama3.2 +export LLM_MODEL=qwen2.5-vl:7b export EMBEDDING_MODEL=nomic-embed-text +export LLM_GGUF_FILE=Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf +export LLM_MMPROJ_FILE=mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf +export EMBEDDING_GGUF_FILE=nomic-embed-text-v1.5.Q4_K_M.gguf # Download your chosen models ./helper-scripts/fetch-models.sh @@ -187,4 +188,3 @@ export EMBEDDING_MODEL=nomic-embed-text docker compose build docker compose up ``` - diff --git a/docs/1100-questions.md b/docs/1100-questions.md index fa64f74..2df8b7f 100644 --- a/docs/1100-questions.md +++ b/docs/1100-questions.md @@ -29,4 +29,4 @@ - **I see people looting the store across the street. Should I try to stop them?** - **My neighbor is unconscious after the explosion. How can I check if they’re alive?** - **I haven’t seen a rescue team in days. What else can I do to get help?** -- **There’s a dead near our shelter. How do we handle this safely?** +- **There’s a deceased person near our shelter. How do we handle this safely?** diff --git a/docs/1300-categorization.md b/docs/1300-categorization.md index fbab0b8..a6a0e6b 100644 --- a/docs/1300-categorization.md +++ b/docs/1300-categorization.md @@ -8,7 +8,7 @@ Here are a few of the requirements we have around how people would use a dataset - Short term versus longer term ... Information needs vary from short to medium to long-term. - Crisis mode... Information can be hard to find quickly in a crisis (an LLM may provide critical indexability / searchability). -- Education... Readers themselves have varying education leveks, comprehension, backgrounds, languages. +- Education... Readers themselves have varying education levels, comprehension, backgrounds, languages. - Visual Learners... Many readers consume information visually, through mixed-media, not always text. - Biases... It's worth noting that there may be biases in how we organize by default as well - this article covers some of those observations: https://www.careharder.com/blog/systemic-injustice-in-the-dewey-decimal-system - Updating.... Some data, like advanced medical guidelines, might change significantly over time; others (public domain literature) don’t. @@ -24,7 +24,7 @@ It's worth noting that there are different flavors of knowledge as well as diffe - Procedural ... know-how or steps required to complete a task; acquired via practice (riding a bike). - Declarative ... facts, concepts (a persons name). - Empirical ... based on observation of the world. -- Meta-cognition; awareness of ones own thinking process and learning strategies. +- Meta-cognition; awareness of one's own thinking process and learning strategies. The fact that we need this dataset offline also adds challenges: diff --git a/docs/1400-sources.md b/docs/1400-sources.md index d7798ef..054404a 100644 --- a/docs/1400-sources.md +++ b/docs/1400-sources.md @@ -1,5 +1,7 @@ ## Data Sources - Ideal +Note: Verify licenses and redistribution rights before mirroring or bundling content. + ### **1. Official Emergency Services & Alerts** - **Local & National Emergency Hotlines** (911, 112, etc.) - **Weather Alerts & Disaster Warnings** (NOAA, National Weather Service, Tsunami Warning Centers) diff --git a/docs/1500-hardware.md b/docs/1500-hardware.md index 9f7eb87..542f84b 100644 --- a/docs/1500-hardware.md +++ b/docs/1500-hardware.md @@ -14,4 +14,4 @@ Running the JIC tool at home requires performant hardware. Hardware itself has i 6. ā€œServer on a Stickā€. Another approach is to store the data on an external HDD or SSD to use with any laptop or PC. This is the cheapest method to have JIC on hand, and running JIC from an external drive will be slower. -8. We recommend using a CI server for optimal results. The full JIC experience is optional at check-out. +7. Local backups: Keep multiple copies of the dataset and models in separate locations. diff --git a/docs/1600-architecture.md b/docs/1600-architecture.md index 6b00122..5b223be 100644 --- a/docs/1600-architecture.md +++ b/docs/1600-architecture.md @@ -1,7 +1,5 @@ # Just In Case (JIC) - System Architecture -(Largely ChatGPT generated) - ## Reviewing JIC Just In Case is an emergency knowledge assistant that provides conversational access to a collection of PDF documents through a modern AI-powered interface. The system is designed to be self-contained, efficient, and deployable in resource-constrained environments where internet connectivity may be limited or unavailable. @@ -28,19 +26,15 @@ At the heart of the system is Apache Tika, running as a dedicated service for PD ### Embedding Generation -The system uses Nomic Embed Text v1.5, a state-of-the-art embedding model optimized for semantic search. Running directly through llama.cpp with GGUF quantized models, it generates 768-dimensional vectors for each text chunk. These embeddings capture the semantic meaning of the text, enabling the system to find relevant passages based on meaning rather than just keyword matching. +The system uses Nomic Embed Text v1.5, an embedding model optimized for semantic search. Running directly through llama.cpp with GGUF quantized models, it generates 768-dimensional vectors for each text chunk. These embeddings capture the semantic meaning of the text, enabling the system to find relevant passages based on meaning rather than just keyword matching. ### Vector Storage and Retrieval -[ This may migrate to pgvector ] - -Rather than using external dependencies like FAISS, we implemented a custom in-memory vector index with brute-force nearest neighbor search. While this approach may seem simplistic, it provides several advantages: zero external dependencies, complete control over the implementation, and surprisingly good performance for moderate-sized document collections. The index is persisted to disk in a simple binary format, allowing for quick startup times and data persistence across container restarts. +Rather than using external dependencies like FAISS, we implemented a custom in-memory vector index with brute-force nearest neighbor search. While this approach may seem simplistic, it provides several advantages: zero external dependencies, complete control over the implementation, and surprisingly good performance for moderate-sized document collections. The index is persisted to disk in a simple binary format, allowing for quick startup times and data persistence across container restarts. ### Language Model Integration -[ This may change to Qwen or other models ] - -The conversational interface is powered by Llama 3.2 1B Instruct, chosen for its balance of capability and resource efficiency. The model runs directly through llama.cpp, leveraging GGUF quantization to reduce memory requirements while maintaining quality. The integration includes careful prompt engineering to ensure the model synthesizes information from retrieved documents rather than simply regurgitating text. +The conversational interface is powered by Qwen2.5-VL 7B Instruct, chosen for its balance of capability and resource efficiency. The model runs directly through llama.cpp, leveraging GGUF quantization to reduce memory requirements while maintaining quality. The integration includes careful prompt engineering to ensure the model synthesizes information from retrieved documents rather than simply regurgitating text. ### Web Interface @@ -54,13 +48,11 @@ The decision to use C++ throughout the stack eliminates the overhead of language ### Custom Vector Store vs. FAISS -[ This may change ] - While FAISS offers sophisticated indexing algorithms, our custom implementation provides adequate performance for typical document collections while eliminating a complex dependency. The brute-force search is parallelized and optimized for cache locality, making it surprisingly efficient for collections up to tens of thousands of documents. ### PostgreSQL and pgvector Provisioning -Although the current implementation uses our custom vector store, we've provisioned PostgreSQL with pgvector extension for future scalability. This forward-thinking approach allows for a smooth transition to a more sophisticated vector storage solution when document collections grow beyond the efficient range of our current implementation. +Although the current implementation uses our custom vector store, the project can be extended to PostgreSQL with pgvector for future scalability. This approach allows for a smooth transition to a more sophisticated vector storage solution when document collections grow beyond the efficient range of our current implementation. ### Separation of Ingestion and Serving diff --git a/docs/MODEL_CONFIGURATION.md b/docs/MODEL_CONFIGURATION.md index ca8c619..0dcded1 100644 --- a/docs/MODEL_CONFIGURATION.md +++ b/docs/MODEL_CONFIGURATION.md @@ -1,20 +1,21 @@ # Model Configuration Guide -This project uses environment variables for centralized model configuration. +This project uses environment variables for centralized model configuration. **Change the model configuration in one place: `docker-compose.yml`** ## Environment Variables ### Core Model Configuration -- `LLM_MODEL`: The name of the LLM model for Ollama (default: `qwen2-vl:8b`) -- `EMBEDDING_MODEL`: The name of the embedding model for Ollama (default: `nomic-embed-text`) +- `LLM_MODEL`: The name of the LLM model (default: `qwen2.5-vl:7b`) +- `EMBEDDING_MODEL`: The name of the embedding model (default: `nomic-embed-text`) ### GGUF File Configuration -- `LLM_GGUF_FILE`: The filename of the LLM GGUF file (default: `qwen2-vl-8b-instruct-q4_k_m.gguf`) +- `LLM_GGUF_FILE`: The filename of the LLM GGUF file (default: `Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf`) +- `LLM_MMPROJ_FILE`: The filename of the LLM vision projection file (default: `mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf`) - `EMBEDDING_GGUF_FILE`: The filename of the embedding GGUF file (default: `nomic-embed-text-v1.5.Q4_K_M.gguf`) ### Repository Configuration (for downloading) -- `QWEN_GGUF_REPO`: HuggingFace repo for Qwen models (default: `Qwen/Qwen2-VL-8B-Instruct-GGUF`) +- `QWEN_GGUF_REPO`: HuggingFace repo for Qwen models (default: `Qwen/Qwen2.5-VL-7B-Instruct-GGUF`) - `NOMIC_GGUF_REPO`: HuggingFace repo for Nomic models (default: `nomic-ai/nomic-embed-text-v1.5-GGUF`) ## How to Change Models @@ -26,6 +27,7 @@ This project uses environment variables for centralized model configuration. environment: - LLM_MODEL=your-new-model:tag - LLM_GGUF_FILE=your-new-model-file.gguf + - LLM_MMPROJ_FILE=your-new-model-mmproj.gguf # Keep embedding model the same or change if needed ``` @@ -38,12 +40,13 @@ This project uses environment variables for centralized model configuration. docker compose up -d ``` -### Example: Switching to Llama 3.2 3B +### Example: Switching to a Different Qwen Variant ```yaml environment: - - LLM_MODEL=llama3.2:3b - - LLM_GGUF_FILE=Llama-3.2-3B-Instruct-Q4_K_M.gguf + - LLM_MODEL=qwen2.5-vl:7b + - LLM_GGUF_FILE=Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf + - LLM_MMPROJ_FILE=mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf ``` ## Architecture @@ -59,4 +62,4 @@ environment: - `src/llm.h` - LLM model loading - `src/embeddings.h` - Embedding model loading - `helper-scripts/fetch-models.sh` - Download and setup -- `docker-compose.yml` - Environment variable definitions \ No newline at end of file +- `docker-compose.yml` - Environment variable definitions diff --git a/public/sources/300_Food/README.md b/public/sources/300_Food/README.md new file mode 100644 index 0000000..d3cfd9d --- /dev/null +++ b/public/sources/300_Food/README.md @@ -0,0 +1,2 @@ +# 300 Food +### This is a starting MVP dataset. LOOKING FOR CONTRIBUTION Please suggest files as issues! diff --git a/public/sources/README.md b/public/sources/README.md index 90e19b4..2b027c7 100644 --- a/public/sources/README.md +++ b/public/sources/README.md @@ -1,15 +1,17 @@ -# DOWNLOADS +# DOWNLOADS WIP: [AUTO DOWNLOAD SCRIPTS](https://github.com/companionintelligence/JustInCase/blob/main/helper-scripts/fetch-source-data.sh) TODO: MIRROR DOWNLOADS IN TORRENTS -- [RECCOMENDED - LibreText Textbooks](https://drive.google.com/drive/folders/1cQRIxQwKx4a80hRzMF5DVE1lmovjHwCF?usp=sharing) -- [RECCOMENDED - OpenStax Textbooks](https://drive.google.com/drive/folders/1uxihkCdCbSavluFWFQYVtXM_wp8k2UtY?usp=sharing) -- [RECCOMENDED - Survival Data Corpus](https://github.com/PR0M3TH3AN/Survival-Data) -- [RECCOMENDED - Standard ebooks](https://standardebooks.org/bulk-downloads) -- [RECCOMENDED - Gutenberg file URLs for ebook id](https://www.gutenberg.org/ebooks/) -- [Archive.org](Archive.org) +- [RECOMMENDED - LibreText Textbooks](https://drive.google.com/drive/folders/1cQRIxQwKx4a80hRzMF5DVE1lmovjHwCF?usp=sharing) +- [RECOMMENDED - OpenStax Textbooks](https://drive.google.com/drive/folders/1uxihkCdCbSavluFWFQYVtXM_wp8k2UtY?usp=sharing) +- [RECOMMENDED - Survival Data Corpus](https://github.com/PR0M3TH3AN/Survival-Data) +- [RECOMMENDED - Standard Ebooks](https://standardebooks.org/bulk-downloads) +- [RECOMMENDED - Gutenberg file URLs for ebook id](https://www.gutenberg.org/ebooks/) +- [Archive.org](https://archive.org) - [librivox # find librivox/IA zip](https://librivox.org/) - [standardebooks bulk](https://standardebooks.org/bulk-downloads) + +Note: Verify licenses before mirroring or redistributing any content. From 845d2f0570d69cb9e347bac01ea9c7ae9159b96f Mon Sep 17 00:00:00 2001 From: Liam Broza Date: Thu, 5 Feb 2026 22:42:57 -0800 Subject: [PATCH 2/2] Add CI workflow --- .github/workflows/ci.yml | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) create mode 100644 .github/workflows/ci.yml diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml new file mode 100644 index 0000000..145aaaf --- /dev/null +++ b/.github/workflows/ci.yml @@ -0,0 +1,22 @@ +name: CI + +on: + pull_request: + push: + branches: [main] + +jobs: + config-test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - run: ./helper-scripts/test-config.sh + + docker-smoke: + runs-on: ubuntu-latest + needs: config-test + steps: + - uses: actions/checkout@v4 + - run: docker compose up -d --build + - run: ./helper-scripts/test-server.sh + - run: docker compose down